New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical.from_codes shouldn't coerce to int64 #18501

Closed
dcolascione opened this Issue Nov 26, 2017 · 7 comments

Comments

Projects
None yet
4 participants
@dcolascione

dcolascione commented Nov 26, 2017

Categorical.from_codes coerces its input to an array of np.int64 unconditionally even though the Categorical constructor immediately coerces the input to some other dtype using coerce_indexer_dtype. This coercion might cause a memory usage spike when codes is large. ISTM that we can just avoid the conversion in from_codes entirely and let coerce_indexer_dtype take care of any error case.

Version: master

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 26, 2017

pls show a copy pastable example

@dcolascione

This comment has been minimized.

dcolascione commented Nov 26, 2017

I'm not sure what you want. The intermediate conversion to int64 isn't observable, so there'd be nothing to show in a copy-paste example. It's an efficiency enhancement request, not a functionality one.

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 26, 2017

i want a copy pastable example of your input

@dcolascione

This comment has been minimized.

dcolascione commented Nov 26, 2017

    # Shouldn't convert the input codes array to int64, because we then convert it to 
    # to int8 anyway.
    pd.Categorical.from_codes(codes=np.asarray([0,1], np.int16), categories=["foo", "bar"])
@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Nov 26, 2017

Should be able to wrap this:

try:
codes = np.asarray(codes, np.int64)
except:
raise ValueError(
"codes need to be convertible to an arrays of integers")

in an

if not is_integer_dtype(codes):
    # do the try / except

And see what breaks. @dcolascione could you submit a PR for that, along with tests and a release note?

Note that we even if we avoid the cast to int64, I think we will still try to cast it to a smaller dtype (int8 in this case).

In [6]: pd.Categorical.from_codes(codes=np.asarray([0,1], np.int16), categories=["foo", "bar"]).codes.dtype
Out[6]: dtype('int8')

Avoiding all copies may be more difficult, but possible.

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Nov 26, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 26, 2017

ok this looks fine to do

In [1]: arr = np.ones(10000000,dtype='int8')

# with change
In [5]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
8.92 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# master
In [2]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
53.2 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

change

diff --git a/pandas/core/categorical.py b/pandas/core/categorical.py
index deaec20..8710f50 100644
--- a/pandas/core/categorical.py
+++ b/pandas/core/categorical.py
@@ -597,7 +597,7 @@ class Categorical(PandasObject):
             unordered.
         """
         try:
-            codes = np.asarray(codes, np.int64)
+            codes = np.asarray(codes)
         except:
             raise ValueError(
                 "codes need to be convertible to an arrays of integers")
@renewang

This comment has been minimized.

renewang commented Jan 27, 2018

Hi, I would like to work on this issue. Is any issue that I couldn't work on this? And any suggestions or documents for the first starter like how to write a test case etc? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment