Merge on categorical type columns gives wrong results #19551

vmuriart · 2018-02-06T15:57:28Z

Code Sample, a copy-pastable example if possible

import pandas as pd
from pandas.api.types import CategoricalDtype

# Setup CategoricalDtype causing issue
cat_type1 = CategoricalDtype(categories=['A', 'B', 'C'], ordered=False)
cat_type2 = CategoricalDtype(categories=['C', 'B', 'A'], ordered=False)

print('Check dtypes are equivalent:', cat_type1 == cat_type2)
print()

# Test Data
df1 = pd.DataFrame({
    'Foo': pd.Series(['A', 'B', 'C']).astype(cat_type1),
    'Left': ['A0', 'B0', 'C0'],
})

df2 = pd.DataFrame({
    'Foo': pd.Series(['C', 'B', 'A']).astype(cat_type2),
    'Right': ['C1', 'B1', 'A1'],
})

print('df1:\n', df1)
print()

print('df2:\n', df2)
print()

# Issue happens here. Merges on codes instead of value.
# Notice, data from df2 isn't merged correctly.
df_merge = df1.merge(df2, on=['Foo'])
print('df_merge:\n', df_merge)
print()

results = """
Check dtypes are equivalent: True

df1:
   Foo Left
0   A   A0
1   B   B0
2   C   C0

df2:
   Foo Right
0   C    C1
1   B    B1
2   A    A1

df_merge:
   Foo Left Right
0   A   A0    C1
1   B   B0    B1
2   C   C0    A1
"""

Problem description

Since upgrading from v0.20.3 to v0.22.0 I noticed data missing on my datasets. After a few hours debugging I narrowed it down to an issue involving merges that involve Categoricals. I downgraded to v0.20.3 to test my original code and didn't have the issue. I then tested on v.0.21.0 and noticed the issue was first introduced on that version.

While the example provided doesn't run in v0.20.3, it does highlight the bug and I think shows why its happening. Notice that the merge should give a result of

   Foo Left Right
0   A   A0    A1
1   B   B0    B1
2   C   C0    C1

instead of

   Foo Left Right
0   A   A0    C1  # Wrong
1   B   B0    B1
2   C   C0    A1  # Wrong

Expected Output

   Foo Left Right
0   A   A0    A1
1   B   B0    B1
2   C   C0    C1

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.12.1
scipy: 0.19.1
pyarrow: 0.7.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-02-06T17:18:44Z

This looks similar to #19096 / #19097, taking a closer look now.

TomAugspurger · 2018-02-06T17:37:45Z

Thanks for the bug report. Should be fixed by #19553 if you want to take a look.

vmuriart · 2018-02-06T19:12:08Z

Just took a look, changes makes sense to what I was seeing.
Sadly can't test it locally, getting an unrelated error during import AttributeError: module 'pandas' has no attribute 'compat'.

Changes makes sense though, so hopefully that cover what I was seeing on my larger dataset too 👍

jorisvandenbossche · 2018-02-06T21:04:32Z

Sidenote: should we consider doing a bugfix release for 0.22 series? (don't know how many issues we would have to backport, and depending on how long 0.23.0 will still take)

jorisvandenbossche · 2018-02-06T21:06:15Z

Other side note: is there actually a good usecase to have differently ordered categories for an unordered categorical? (although always sorting the categories upon construction might also give trouble down the road in other places)

jreback · 2018-02-06T21:16:10Z

i think we should just release 0.23 soon

TomAugspurger · 2018-02-06T21:17:02Z

Not sure... These are pretty serious bugs, but I don't have a good feeling for how common differently-ordered unordered categories are.

…

On Tue, Feb 6, 2018 at 3:06 PM, Joris Van den Bossche < ***@***.***> wrote: Other side note: is there actually a good usecase to have differently ordered categories for an unordered categorical? (although always sorting the categories upon construction might also give trouble down the road in other places) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#19551 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIqaoWP06Ols3ODPvA-CZ4XhwJEUfks5tSL7LgaJpZM4R7SCZ> .

vmuriart · 2018-02-06T22:58:54Z

In my case they are pretty common, but this is because most of my data is coming from pyarrow's parquet files. However it creates the unordered categoricals, it was yielding different ordered indexes.

TomAugspurger added the Categorical Categorical Data Type label Feb 6, 2018

TomAugspurger added this to the 0.23.0 milestone Feb 6, 2018

TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version labels Feb 6, 2018

TomAugspurger mentioned this issue Feb 6, 2018

BUG: Fixed merge on dtype equal categories #19553

Merged

jreback closed this as completed in #19553 Feb 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge on categorical type columns gives wrong results #19551

Merge on categorical type columns gives wrong results #19551

vmuriart commented Feb 6, 2018

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

TomAugspurger commented Feb 6, 2018

TomAugspurger commented Feb 6, 2018

vmuriart commented Feb 6, 2018

jorisvandenbossche commented Feb 6, 2018

jorisvandenbossche commented Feb 6, 2018

jreback commented Feb 6, 2018

TomAugspurger commented Feb 6, 2018 via email

vmuriart commented Feb 6, 2018

Merge on categorical type columns gives wrong results #19551

Merge on categorical type columns gives wrong results #19551

Comments

vmuriart commented Feb 6, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

TomAugspurger commented Feb 6, 2018

TomAugspurger commented Feb 6, 2018

vmuriart commented Feb 6, 2018

jorisvandenbossche commented Feb 6, 2018

jorisvandenbossche commented Feb 6, 2018

jreback commented Feb 6, 2018

TomAugspurger commented Feb 6, 2018 via email

vmuriart commented Feb 6, 2018

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS