Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.isin fails (errors) for categoricals #16639

Closed
aviolov opened this issue Jun 8, 2017 · 10 comments · Fixed by #16858
Closed

Series.isin fails (errors) for categoricals #16639

aviolov opened this issue Jun 8, 2017 · 10 comments · Fixed by #16858
Labels
Bug Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@aviolov
Copy link
Contributor

aviolov commented Jun 8, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
#%%
print(pd.__version__)
vals = np.array([0, 1,2, 0]);
cats = ['a', 'b', 'c'];

DFtrades = pd.DataFrame({'id': pd.Series(pd.Categorical(1).from_codes(vals, cats))});
DFscores = pd.DataFrame({'id': pd.Series(pd.Categorical(1).from_codes(np.array([0, 1]), cats))});

print(DFtrades)
print(DFscores)

select_ids = DFtrades['id'].isin(DFscores['id']);

Problem description

I get an error in 0.20.1

File "", line 12, in
select_ids = DFtrades['id'].isin(DFscores['id']);

File "C:\Users\alexandre\Anaconda3\lib\site-packages\pandas\core\series.py", line 2555, in isin
result = algorithms.isin(_values_from_object(self), values)

File "C:\Users\alexandre\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 421, in isin
return f(comps, values)

File "C:\Users\alexandre\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 399, in
f = lambda x, y: htable.ismember_object(x, values)

File "pandas_libs\hashtable_func_helper.pxi", line 428, in pandas._libs.hashtable.ismember_object (pandas_libs\hashtable.c:29677)

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'

Expected Output

a boolean array (or series?) indicating the third row of DFtrades is not in DFscores but the other three are

for reference, this worked (I did not get an error) in 0.19.(something)

also this code will work as expected:

select_ids = DFtrades['id'].isin(DFscores['id'].values);

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.20.1
pytest: 3.1.1
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.19.0
xarray: 0.9.5
IPython: 6.1.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.10
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@chris-b1 chris-b1 added Bug Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version labels Jun 8, 2017
@chris-b1 chris-b1 added this to the 0.20.3 milestone Jun 8, 2017
@chris-b1
Copy link
Contributor

chris-b1 commented Jun 8, 2017

I'm guessing the fix to this looks something like #16543 - did some refactoring the algorithms file and this is a case that probably got missed

@jreback
Copy link
Contributor

jreback commented Jun 9, 2017

this fixes. Though I think we should add some asv's with categoricals to make sure they are hitting the right path

diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
index d74c5e6..a651817 100644
--- a/pandas/core/algorithms.py
+++ b/pandas/core/algorithms.py
@@ -113,7 +113,8 @@ def _ensure_data(values, dtype=None):
 
         return values.asi8, dtype, 'int64'
 
-    elif is_categorical_dtype(values) or is_categorical_dtype(dtype):
+    elif (is_categorical_dtype(values) and
+          (is_categorical_dtype(dtype) or dtype is None)):
         values = getattr(values, 'values', values)
         values = values.codes
         dtype = 'category'

@jreback
Copy link
Contributor

jreback commented Jul 6, 2017

@aviolov want to push a PR for the above fix?

@aviolov
Copy link
Contributor Author

aviolov commented Jul 6, 2017

@jreback at the risk of sounding ignorant - how would I do that (maybe a link to some documentation / how-to)?

@TomAugspurger
Copy link
Contributor

@aviolov which part, specifically? All the contributing docs are at http://pandas.pydata.org/pandas-docs/stable/contributing.html. If you have any additional questions, just ask them here.

@aviolov
Copy link
Contributor Author

aviolov commented Jul 6, 2017

@TomAugspurger , thanks for the link. I guess a 'PR' is a pull request in this case. Is the idea that I download version 0.20.3 and check that my minimal example above works now or that I branch the current version and implement the fix suggested above and then try to push it back or... ? I haven't made a branch off pandas before, but would be fun to try - the how-to looks quite comprehensive

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 6, 2017

@aviolov you'll fork the repo as described in http://pandas.pydata.org/pandas-docs/stable/contributing.html#forking

Then create a new branch

Then apply your changes:

  • add a test in pandas/tests/test_categorical.py with your original example,
  • run the tests something like pytest pandas/tests/test_categorical.py -k <test name> to verify that it fails
  • Add the fix from @jreback
  • Add a release not in doc/source/whatsnew/v0.21.0.txt

Then push and make a pull request (PR)

@aviolov
Copy link
Contributor Author

aviolov commented Jul 6, 2017

@TomAugspurger cool, I'll give it a try

@aviolov
Copy link
Contributor Author

aviolov commented Jul 7, 2017

I could not get git rebase -i HEAD~2 to work for squashing two commits into 1 (possibly b/c i had pushed the first one prior to committing the second one) - I got

$ git rebase -i HEAD-2 fatal: Needed a single revision invalid upstream HEAD-2

@jreback
Copy link
Contributor

jreback commented Jul 7, 2017

you don't need to squash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants