pd.Categorical(Series, categories=..) returns broken data with out-of-bound codes #25318

batterseapower · 2019-02-14T12:49:34Z

Code Sample, a copy-pastable example if possible

c1 = pd.Categorical(['A', 'B', 'B', 'C'])
# Works as expected:
#c2 = pd.Categorical(c1, categories=c1.categories[1:])
# Generates broken data:
c2 = pd.Categorical(pd.Series(c1), categories=c1.categories[1:])
print(c1[:2])
print(c2[:2])
print(c2.codes.max(), len(c2.categories))

Prints:

[A, B]
Categories (3, object): [A, B, C]
[B, C]
Categories (2, object): [B, C]
2 2

Problem description

pd.Categorical should not be able to construct an object where codes is >= len(categories). If it does this, then it's very likely that passing the array into other Pandas functions will trigger a segfault. For example, if you try to use this Categorical as a column in pd.concat then Python just dies:

>>> pd.concat([pd.DataFrame.from_dict({'A': c2}), pd.DataFrame.from_dict({'A': c2})], axis=0)

C:\>echo %ERRORLEVEL%
-1073741819

Expected Output

[A, B]
Categories (3, object): [A, B, C]
[NaN, B]
Categories (2, object): [B, C]
1 2

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.24.1
pytest: 3.1.2
pip: 19.0.2
setuptools: 39.0.1
Cython: 0.27.2
numpy: 1.16.1
scipy: 1.2.1
pyarrow: 0.9.0
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.3.0.dev0
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 3.8.0
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

ghost · 2019-02-15T05:50:50Z

In pandas.__version__ == '0.23.4', you can get what you excepted output.

Of course, this version gets:

>>> pd.concat([pd.DataFrame.from_dict({'A': c2}), pd.DataFrame.from_dict({'A': c2})], axis=0)
     A
0  NaN
1    B
2    B
3    C
0  NaN
1    B
2    B
3    C

Python works well.

And then I use pandas==0.24.1 and run these codes

c2 = pd.Categorical(pd.Series(c1), categories=c1.categories[1:])

yes, my Python died. But we can check c2.codes, output:
array([0, 1, 1, 2])

That may be the error lies. In fact we want:

>>> c2.codes
array([-1, 0, 0, 1])

and pd.Categorical 's help doc says:
If categories are given, values not in categories will be replaced with NaN.
and the NaN is the code -1 if everything is OK, but

c2 = pd.Categorical(pd.Series(c1), categories=c1.categories[1:])

does not work well.

We check the source code pandas.core.arrays.categorical.py in pandas==0.24.1 and we find in class Categrical it runs:

# at about 326 lines
if is_categorical(values): # this situation is True
            # GH23814, for perf, if values._values already an instance of
            # Categorical, set values to codes, and run fastpath
            if (isinstance(values, (ABCSeries, ABCIndexClass)) and
               isinstance(values._values, type(self))):
                values = values._values.codes.copy()
                fastpath = True

We find it sets values to codes, then call function coerce_indexer_dtype and get the codes. But in pandas == 0.23.4 it runs:

# at about 363 lines
elif is_categorical_dtype(values):
            old_codes = (values.cat.codes if isinstance(values, ABCSeries)
                         else values.codes)
            codes = _recode_for_categories(old_codes, values.dtype.categories,
                                           dtype.categories)

However, I don't know why the latter version has this changes. So if you want to get excepted output temporarily, try to use pandas == 0.23.4 or know these changes and tell me...

batterseapower · 2019-02-15T14:17:30Z

Yeah, it does work with 0.23.x: I had a bunch of code relying on this, it started mysteriously segfaulting when I upgraded.

The comments in that code refer to #23814 and thence to #23888 by @eoveson (https://github.com/pandas-dev/pandas/pull/23888/files#diff-f3b2ea15ba728b55cab4a1acd97d996dR353). So this looks like a well-intentioned performance improvement gone a bit wrong?

jorisvandenbossche · 2019-02-15T19:54:03Z

@batterseapower Thanks for the report!

This is indeed a regression, let's try to fix it for 0.24.2. @batterseapower or @yichen8 a PR is very welcome. We should indeed keep a call to _recode_for_categories (or at least in certain cases)

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Categorical Categorical Data Type labels Feb 15, 2019

jorisvandenbossche added this to the 0.24.2 milestone Feb 15, 2019

batterseapower mentioned this issue Feb 19, 2019

BUG: Fix potential segfault after pd.Categorical(pd.Series(...), categories=...) #25368

Merged

jreback closed this as completed in #25368 Mar 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.Categorical(Series, categories=..) returns broken data with out-of-bound codes #25318

pd.Categorical(Series, categories=..) returns broken data with out-of-bound codes #25318

batterseapower commented Feb 14, 2019

ghost commented Feb 15, 2019

batterseapower commented Feb 15, 2019

jorisvandenbossche commented Feb 15, 2019

pd.Categorical(Series, categories=..) returns broken data with out-of-bound codes #25318

pd.Categorical(Series, categories=..) returns broken data with out-of-bound codes #25318

Comments

batterseapower commented Feb 14, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

ghost commented Feb 15, 2019

batterseapower commented Feb 15, 2019

jorisvandenbossche commented Feb 15, 2019

Output of `pd.show_versions()`