Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.to_json segfaults with categorical index #10317

Closed
sborgeson opened this issue Jun 9, 2015 · 10 comments
Closed

df.to_json segfaults with categorical index #10317

sborgeson opened this issue Jun 9, 2015 · 10 comments
Labels
Bug Categorical Categorical Data Type Enhancement Error Reporting Incorrect or improved errors from pandas IO JSON read_json, to_json, json_normalize
Milestone

Comments

@sborgeson
Copy link

DataFrame.to_json is reliably segfaulting python when the DataFrame has an index of type CategoricalIndex.

import pandas as pd
idx = pd.Categorical([1,2,3], categories=[1,2,3])
df = pd.DataFrame( {  'count' : pd.Series([3,2,2],index=idx) } )
# this will crash python (2.6.X or 2.7.X on linux 64 or win 64 with pandas 0.16.1)
print df.to_json(orient='split')

If I call with orient='index', I get a value error instead:

# this throws a ValueError
print df.to_json(orient='index')
ValueError: Label array sizes do not match corresponding data shape

For what it's worth, my work around, which is acceptable in my application, is to convert my index to strings:

df.index = df.index.astype(str)
print df.to_json(orient='split')

Windows config:

INSTALLED VERSIONS

commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.1
nose: None
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.15.1
statsmodels: None
IPython: 2.1.0
sphinx: None
patsy: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 2.0.3
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

linux config:

INSTALLED VERSIONS

commit: None
python: 2.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-274.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.1
nose: None
Cython: None
numpy: 1.9.2
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.2.0
numexpr: 2.4.3
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented Jun 9, 2015

yep, not implemented ATM. pull-requests welcome. This would require delving into the c-code, but its not that difficult. The reason this fails is that .values is the accessor which the json code uses. This returns a numpy array for all types, except for Categorical/Sparse. I think this could be change to .get_values() to properly handle these types of things (this would also densify a sparse series for example)

@jreback jreback added Bug Enhancement IO JSON read_json, to_json, json_normalize Categorical Categorical Data Type Difficulty Intermediate labels Jun 9, 2015
@jreback jreback added this to the 0.17.0 milestone Jun 9, 2015
@sborgeson
Copy link
Author

With respect for all the work you do, avoiding a segfault is not an enhancement. If you don't have near term plans for support serializing DataFrames with categorical indices (my example was the result of using pandas.core.algorithms.quantile), you should add a guard clause that throws an exception indicating this fact, not let the code run through to segfault in C. This is especially true because earlier versions of Pandas (0.14 in my experience) did allow categorical indices, so a simple package update can introduce a segfault.

@sborgeson
Copy link
Author

Looking a little deeper, I can see that there is a guard clause, but it is not working. in io/json.py, _format_axes() for FrameWriter, which relies on this line:

if not self.obj.index.is_unique and self.orient in ('index', 'columns'):
    raise ValueError(" .... ")

However, the CategoricalIndex in the example above returns True for is_unique, and because the 'split' orientation is not addressed in the guard clause, it runs past this protection and into C where it segfaults.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2015

@sborgeson I marked this as a bug, then enhancment tag is because I think this may need to be restructured a bit.

as I said happy to take a patch, even it raises a ValueError on the python side as we do try to avoid segfaults.

All that said, CategoricalIndex is quite a new feature (only released a few weeks ago), and even though we have quite a few tests, stuff does break. This is the first bug report for this.

Further when upgrading multiple major versions it is important to review the release notes and test. Which is how I am sure you found this :)

So going to release 0.16.2 in next few days, if you'd like to but a simple patch up would be gr8 to get it in.

@jreback jreback modified the milestones: 0.16.2, 0.17.0 Jun 9, 2015
@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label Jun 9, 2015
@sborgeson
Copy link
Author

As I seem to have already demonstrated, I'm not sure I know enough about the latest index types to reliably patch this. I was thinking the clause would be something like (in io/json.py, _format_axes() for FrameWriter):

if type(self.obj.index) == pandas.core.index.CategoricalIndex: raise ValueError( ... )

But I've looked at core/index.py and I see that there are several index types, especially MultiIndex that might also require protection. I also noticed that the ValueError that is thrown for orient='index' comes out of the C code rather than the guard clauses in Python, so I'm no longer convinced I understand the intention of guard clauses I was looking at vs. the error handling in C. I think I'll have to wait for the benevolent actions of a developer more familiar with the code base than I am. So thanks for wrangling the bugs.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2015

@sborgeson

fixed in #10321

note that I am going to keep this open, as this should actually be handled in the c-code (maybe)

@evanpw
Copy link
Contributor

evanpw commented Jun 10, 2015

I've got a fix for the C-code which passes the same tests as #10321.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2015

@evanpw ohh, excellent. pull in my tests and put up your fix.!

@jreback jreback modified the milestones: 0.16.2, 0.17.0 Jun 10, 2015
jreback added a commit that referenced this issue Jun 10, 2015
Bug in to_json causing segfault with a CategoricalIndex (GH #10317)
@jreback
Copy link
Contributor

jreback commented Jun 10, 2015

closed by #10322

@sborgeson this is now merged into master and will be in 0.16.2. thanks for the report!

@jreback jreback closed this as completed Jun 10, 2015
@sborgeson
Copy link
Author

This was a very efficient process. Thanks for your work on such a great set of tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Enhancement Error Reporting Incorrect or improved errors from pandas IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

3 participants