New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Categorical.take and Series([Categorical]).take is inconsistent with other dtypes #20664

Closed
jorisvandenbossche opened this Issue Apr 12, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@jorisvandenbossche
Member

jorisvandenbossche commented Apr 12, 2018

Related to #20582 and #20640

Categorical.take does not do any bounds checking, so produces wrong results:

In [9]: cat = pd.Categorical(['a', 'b', 'a'])

In [12]: cat.take([3])
Out[12]: 
[0]                       # <--------- non-sensical result (produces invalid Categorical), should error instead
Categories (2, object): [a, b]

In [10]: cat.take([-1])
Out[10]: 
[NaN]                      # <--------- this is correct
Categories (2, object): [a, b]

In [15]: cat[:0].take([0])
Out[15]: 
[a]                             # <--------- should also error
Categories (2, object): [a, b]

Series.take with categorical data does not follow the numpy behaviour of "-1" (starting from end):

In [22]: s = pd.Series(cat)

In [24]: s.take([-1])
Out[24]: 
2    NaN            # <--------- wrongly NaN, because it dispatches to `Categorical.take` which returns this
dtype: category
Categories (2, object): [a, b]

In [25]: s.astype(object).take([-1])
astype
Out[25]: 
2    a
dtype: object
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 12, 2018

(at least the Series.take part is a regression from 0.20)

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 26, 2018

FYI, I'm working on this. Will have a PR tonight hopefully.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 27, 2018

Ok, will make a PR once #20814 is merged. One point to clarify, we only want to warn when the user passes negative values, correct?

cat = pd.Categorical(['a', 'b'])
cat.take([0, -1])  # FutureWarning
cat.take([0, 1])  # no warning
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 27, 2018

One point to clarify, we only want to warn when the user passes negative values, correct?

Yes. (I don't think there is any change in behaviour for the other case?)

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 27, 2018

Personally, I would even consider breaking it without deprecation (would people really use this?), but either is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment