New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenating two series of categoricals results in data corruption without warning #19096

Closed
ediphy-azorab opened this Issue Jan 5, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@ediphy-azorab

ediphy-azorab commented Jan 5, 2018

Code Sample, a copy-pastable example if possible

I'm sadly unable to share the underlying data, and have not yet been able to product a minimised reproduction.

In [202]: s1 = df1.symbol

In [203]: s2 = df2.symbol

In [204]: s1.dtype
Out[204]: CategoricalDtype(categories=['RE00012ME6MA', 'RE00002YE6MA', 'RE00018ME6MA', 'RE00012YE6MA', 'RE00013YE6MA', 'RE00010YE6MA', 'RE00014YE6MA', 'RE00015YE6MA', 'RE00016YE6MA', 'RE00017YE6MA', 'RE00018YE6MA'
, 'RE00019YE6MA', 'RE00020YE6MA', 'RE00025YE6MA', 'RE00011YE6MA', 'RE00003YE6MA', 'RE00005YE6MA', 'RE00009YE6MA', 'RE00004YE6MA', 'RE00008YE6MA', 'RE00006YE6MA', 'RE00007YE6MA', 'RE00030YE6MA'], ordered=False)

In [205]: s1.shape
Out[205]: (2084,)

In [206]: s2.dtype
Out[206]: CategoricalDtype(categories=['RE00030YE6MA', 'RE00008YE6MA', 'RE00016YE6MA', 'RE00015YE6MA', 'RE00018YE6MA', 'RE00017YE6MA', 'RE00020YE6MA', 'RE00006YE6MA', 'RE00005YE6MA', 'RE00004YE6MA', 'RE00014YE6MA'
, 'RE00025YE6MA', 'RE00003YE6MA', 'RE00013YE6MA', 'RE00002YE6MA', 'RE00009YE6MA', 'RE00018ME6MA', 'RE00011YE6MA', 'RE00019YE6MA', 'RE00010YE6MA', 'RE00007YE6MA', 'RE00012YE6MA', 'RE00012ME6MA'], ordered=False)

In [207]: s2.shape
Out[207]: (1030,)

In [208]: pd.concat([s1, s2]).astype('object') == pd.concat([s1.astype('object'), s2.astype('object')])
Out[208]:
0        True
1        True
2        True
3        True
4        True
        ...
1025    False
1026    False
1027    False
1028    False
1029    False
Name: symbol, Length: 3114, dtype: bool

In [209]: pd.concat([s1, s2], ignore_index=True).astype('object') == pd.concat([s1.astype('object'), s2.astype('object')], ignore_index=True)
Out[209]:
0        True
1        True
2        True
3        True
4        True
        ...
3109    False
3110    False
3111    False
3112    False
3113    False
Name: symbol, Length: 3114, dtype: bool

In [210]: pd.concat([s1.astype('object'), s2.astype('object')], ignore_index=True).iloc[-5:]
Out[210]:
3109    RE00012ME6MA
3110    RE00012ME6MA
3111    RE00005YE6MA
3112    RE00015YE6MA
3113    RE00015YE6MA
Name: symbol, dtype: object

In [211]: pd.concat([s1, s2], ignore_index=True).astype('object').iloc[-5:]
Out[211]:
3109    RE00030YE6MA
3110    RE00030YE6MA
3111    RE00016YE6MA
3112    RE00012YE6MA
3113    RE00012YE6MA
Name: symbol, dtype: object

Problem description

The row values have changed without warning. This seems to be extremely suprising behaviour!

Expected Output

Concatenating two series with categories of the same values in different orders should not result in the row values changing

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 10.0.0.subpip_fix
setuptools: 36.5.0
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.1.0
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.5.0

@ediphy-azorab

This comment has been minimized.

ediphy-azorab commented Jan 5, 2018

Explicitly declaring the CategoricalDtype up front is a valid workaround, but only if you know that this is an issue. The silent nature of this seems particularly nasty to me.

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 5, 2018

on master, so it appears that the recoding of the unordered categoricals is going awry.

In [2]: from pandas.api.types import CategoricalDtype as CDT

In [3]: df1 = pd.DataFrame({'A': pd.Series(list('abc'), dtype=CDT(categories=list('cba'), ordered=False))})

In [4]: df2 = pd.DataFrame({'A': pd.Series(list('abc'), dtype=CDT(categories=list('bac'), ordered=False))})

In [5]: pd.concat([df1, df2])
Out[5]: 
   A
0  a
1  b
2  c
0  b
1  c
2  a

In [6]: pd.concat([df1, df2]).dtypes
Out[6]: 
A    category
dtype: object

In [7]: pd.concat([df1.astype(object), df2.astype(object)]).dtypes
Out[7]: 
A    object
dtype: object

In [8]: pd.concat([df1.astype(object), df2.astype(object)])
Out[8]: 
   A
0  a
1  b
2  c
0  a
1  b
2  c

@jreback jreback added the Categorical label Jan 5, 2018

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 5, 2018

@jreback jreback added the Reshaping label Jan 5, 2018

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Jan 5, 2018

Sounds like #18822, but that doesn't fix it.

I think the bug is in

if all(first.is_dtype_equal(other) for other in to_union[1:]):
, where we assume that is_dtype_equal(cat1, cat2) implies they share category <-> code mapping. The fix should look similar to #18822.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 5, 2018

@jreback jreback added this to the 0.23.0 milestone Jan 6, 2018

@jreback jreback added the Bug label Jan 6, 2018

jreback added a commit that referenced this issue Jan 7, 2018

BUG: Fixed union_categoricals with unordered cats (#19097)
* BUG: Fixed union_categoricals with unordered cats

Closes #19096

* TST: Added concat test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment