df.to_json segfaults with categorical index #10317

sborgeson · 2015-06-09T17:25:42Z

DataFrame.to_json is reliably segfaulting python when the DataFrame has an index of type CategoricalIndex.

import pandas as pd
idx = pd.Categorical([1,2,3], categories=[1,2,3])
df = pd.DataFrame( {  'count' : pd.Series([3,2,2],index=idx) } )
# this will crash python (2.6.X or 2.7.X on linux 64 or win 64 with pandas 0.16.1)
print df.to_json(orient='split')

If I call with orient='index', I get a value error instead:

# this throws a ValueError
print df.to_json(orient='index')

ValueError: Label array sizes do not match corresponding data shape

For what it's worth, my work around, which is acceptable in my application, is to convert my index to strings:

df.index = df.index.astype(str)
print df.to_json(orient='split')

Windows config:

INSTALLED VERSIONS

commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.1
nose: None
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.15.1
statsmodels: None
IPython: 2.1.0
sphinx: None
patsy: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 2.0.3
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

linux config:

INSTALLED VERSIONS

commit: None
python: 2.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-274.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.1
nose: None
Cython: None
numpy: 1.9.2
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.2.0
numexpr: 2.4.3
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

jreback · 2015-06-09T19:55:33Z

yep, not implemented ATM. pull-requests welcome. This would require delving into the c-code, but its not that difficult. The reason this fails is that .values is the accessor which the json code uses. This returns a numpy array for all types, except for Categorical/Sparse. I think this could be change to .get_values() to properly handle these types of things (this would also densify a sparse series for example)

sborgeson · 2015-06-09T21:07:55Z

With respect for all the work you do, avoiding a segfault is not an enhancement. If you don't have near term plans for support serializing DataFrames with categorical indices (my example was the result of using pandas.core.algorithms.quantile), you should add a guard clause that throws an exception indicating this fact, not let the code run through to segfault in C. This is especially true because earlier versions of Pandas (0.14 in my experience) did allow categorical indices, so a simple package update can introduce a segfault.

sborgeson · 2015-06-09T21:21:43Z

Looking a little deeper, I can see that there is a guard clause, but it is not working. in io/json.py, _format_axes() for FrameWriter, which relies on this line:

if not self.obj.index.is_unique and self.orient in ('index', 'columns'):
    raise ValueError(" .... ")

However, the CategoricalIndex in the example above returns True for is_unique, and because the 'split' orientation is not addressed in the guard clause, it runs past this protection and into C where it segfaults.

jreback · 2015-06-09T21:41:09Z

@sborgeson I marked this as a bug, then enhancment tag is because I think this may need to be restructured a bit.

as I said happy to take a patch, even it raises a ValueError on the python side as we do try to avoid segfaults.

All that said, CategoricalIndex is quite a new feature (only released a few weeks ago), and even though we have quite a few tests, stuff does break. This is the first bug report for this.

Further when upgrading multiple major versions it is important to review the release notes and test. Which is how I am sure you found this :)

So going to release 0.16.2 in next few days, if you'd like to but a simple patch up would be gr8 to get it in.

sborgeson · 2015-06-09T22:57:00Z

As I seem to have already demonstrated, I'm not sure I know enough about the latest index types to reliably patch this. I was thinking the clause would be something like (in io/json.py, _format_axes() for FrameWriter):

if type(self.obj.index) == pandas.core.index.CategoricalIndex: raise ValueError( ... )

But I've looked at core/index.py and I see that there are several index types, especially MultiIndex that might also require protection. I also noticed that the ValueError that is thrown for orient='index' comes out of the C code rather than the guard clauses in Python, so I'm no longer convinced I understand the intention of guard clauses I was looking at vs. the error handling in C. I think I'll have to wait for the benevolent actions of a developer more familiar with the code base than I am. So thanks for wrangling the bugs.

jreback · 2015-06-10T00:43:20Z

@sborgeson

fixed in #10321

note that I am going to keep this open, as this should actually be handled in the c-code (maybe)

…v#10317)

evanpw · 2015-06-10T01:05:04Z

I've got a fix for the C-code which passes the same tests as #10321.

jreback · 2015-06-10T01:06:22Z

@evanpw ohh, excellent. pull in my tests and put up your fix.!

Bug in to_json causing segfault with a CategoricalIndex (GH #10317)

jreback · 2015-06-10T10:30:12Z

closed by #10322

@sborgeson this is now merged into master and will be in 0.16.2. thanks for the report!

sborgeson · 2015-06-10T16:51:15Z

This was a very efficient process. Thanks for your work on such a great set of tools.

jreback added Bug Enhancement IO JSON read_json, to_json, json_normalize Categorical Categorical Data Type Difficulty Intermediate labels Jun 9, 2015

jreback added this to the 0.17.0 milestone Jun 9, 2015

jreback modified the milestones: 0.16.2, 0.17.0 Jun 9, 2015

jreback added the Error Reporting Incorrect or improved errors from pandas label Jun 9, 2015

jreback mentioned this issue Jun 10, 2015

BUG: Bug in to_json with certain orients and a CategoricalIndex would segfault, closes #10317 #10321

Closed

jreback modified the milestones: 0.17.0, 0.16.2 Jun 10, 2015

evanpw added a commit to evanpw/pandas that referenced this issue Jun 10, 2015

Bug in to_json causing segfault with a CategoricalIndex (GH pandas-de…

588437c

…v#10317)

evanpw mentioned this issue Jun 10, 2015

Bug in to_json causing segfault with a CategoricalIndex (GH #10317) #10322

Merged

jreback modified the milestones: 0.16.2, 0.17.0 Jun 10, 2015

jreback added a commit that referenced this issue Jun 10, 2015

Merge pull request #10322 from evanpw/json

07ea11c

Bug in to_json causing segfault with a CategoricalIndex (GH #10317)

jreback closed this as completed Jun 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.to_json segfaults with categorical index #10317

df.to_json segfaults with categorical index #10317

sborgeson commented Jun 9, 2015

jreback commented Jun 9, 2015

sborgeson commented Jun 9, 2015

sborgeson commented Jun 9, 2015

jreback commented Jun 9, 2015

sborgeson commented Jun 9, 2015

jreback commented Jun 10, 2015

evanpw commented Jun 10, 2015

jreback commented Jun 10, 2015

jreback commented Jun 10, 2015

sborgeson commented Jun 10, 2015

df.to_json segfaults with categorical index #10317

df.to_json segfaults with categorical index #10317

Comments

sborgeson commented Jun 9, 2015

INSTALLED VERSIONS

INSTALLED VERSIONS

jreback commented Jun 9, 2015

sborgeson commented Jun 9, 2015

sborgeson commented Jun 9, 2015

jreback commented Jun 9, 2015

sborgeson commented Jun 9, 2015

jreback commented Jun 10, 2015

evanpw commented Jun 10, 2015

jreback commented Jun 10, 2015

jreback commented Jun 10, 2015

sborgeson commented Jun 10, 2015