Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge on categorical type columns gives wrong results #19551

Closed
vmuriart opened this issue Feb 6, 2018 · 8 comments · Fixed by #19553
Closed

Merge on categorical type columns gives wrong results #19551

vmuriart opened this issue Feb 6, 2018 · 8 comments · Fixed by #19553
Labels
Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@vmuriart
Copy link

vmuriart commented Feb 6, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
from pandas.api.types import CategoricalDtype

# Setup CategoricalDtype causing issue
cat_type1 = CategoricalDtype(categories=['A', 'B', 'C'], ordered=False)
cat_type2 = CategoricalDtype(categories=['C', 'B', 'A'], ordered=False)

print('Check dtypes are equivalent:', cat_type1 == cat_type2)
print()

# Test Data
df1 = pd.DataFrame({
    'Foo': pd.Series(['A', 'B', 'C']).astype(cat_type1),
    'Left': ['A0', 'B0', 'C0'],
})

df2 = pd.DataFrame({
    'Foo': pd.Series(['C', 'B', 'A']).astype(cat_type2),
    'Right': ['C1', 'B1', 'A1'],
})

print('df1:\n', df1)
print()

print('df2:\n', df2)
print()

# Issue happens here. Merges on codes instead of value.
# Notice, data from df2 isn't merged correctly.
df_merge = df1.merge(df2, on=['Foo'])
print('df_merge:\n', df_merge)
print()

results = """
Check dtypes are equivalent: True

df1:
   Foo Left
0   A   A0
1   B   B0
2   C   C0

df2:
   Foo Right
0   C    C1
1   B    B1
2   A    A1

df_merge:
   Foo Left Right
0   A   A0    C1
1   B   B0    B1
2   C   C0    A1
"""

Problem description

Since upgrading from v0.20.3 to v0.22.0 I noticed data missing on my datasets. After a few hours debugging I narrowed it down to an issue involving merges that involve Categoricals. I downgraded to v0.20.3 to test my original code and didn't have the issue. I then tested on v.0.21.0 and noticed the issue was first introduced on that version.

While the example provided doesn't run in v0.20.3, it does highlight the bug and I think shows why its happening. Notice that the merge should give a result of

   Foo Left Right
0   A   A0    A1
1   B   B0    B1
2   C   C0    C1

instead of

   Foo Left Right
0   A   A0    C1  # Wrong
1   B   B0    B1
2   C   C0    A1  # Wrong

Expected Output

   Foo Left Right
0   A   A0    A1
1   B   B0    B1
2   C   C0    C1

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.12.1
scipy: 0.19.1
pyarrow: 0.7.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

This looks similar to #19096 / #19097, taking a closer look now.

@TomAugspurger TomAugspurger added the Categorical Categorical Data Type label Feb 6, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Feb 6, 2018
@TomAugspurger TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version labels Feb 6, 2018
@TomAugspurger
Copy link
Contributor

Thanks for the bug report. Should be fixed by #19553 if you want to take a look.

@vmuriart
Copy link
Author

vmuriart commented Feb 6, 2018

Just took a look, changes makes sense to what I was seeing.
Sadly can't test it locally, getting an unrelated error during import AttributeError: module 'pandas' has no attribute 'compat'.

Changes makes sense though, so hopefully that cover what I was seeing on my larger dataset too 👍

@jorisvandenbossche
Copy link
Member

Sidenote: should we consider doing a bugfix release for 0.22 series? (don't know how many issues we would have to backport, and depending on how long 0.23.0 will still take)

@jorisvandenbossche
Copy link
Member

Other side note: is there actually a good usecase to have differently ordered categories for an unordered categorical? (although always sorting the categories upon construction might also give trouble down the road in other places)

@jreback
Copy link
Contributor

jreback commented Feb 6, 2018

i think we should just release 0.23 soon

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 6, 2018 via email

@vmuriart
Copy link
Author

vmuriart commented Feb 6, 2018

In my case they are pretty common, but this is because most of my data is coming from pyarrow's parquet files. However it creates the unordered categoricals, it was yielding different ordered indexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants