BUG: fix json segfaults #12802

Komnomnomnom · 2016-04-05T17:21:19Z

closes #11473
closes #10778
closes #11299

This fixes several potential segfaults in the json code:

time objects were freed and potentially garbage collected before their data had been read to_json segfaults with timezone-aware datetimes #11473. Was also segfaulting if an exception was raised during conversion (e.g. when using dateutil timezones).
0d arrays were not being handled corectly (needed goto INVALID) pandas.io.tests.test_json.test_ujson.NumpyJSONTests testOdArray segfaults under python2.7/3.5 in 0.17.0 #11299
all blocks were assumed to be ndarrays df.to_json segfaults with categorical column types #10778

Fixing #10778 means non-ndarray blocks are now supported (although I think at present category is the only one?).

Not tested on windows yet. Seeing some unrelated travis failures on my fork (msgpack), is that normal?

jreback · 2016-04-05T17:32:43Z

non-ndarray values are the extension types

categorical
datetime w/tz
sparse (which are actually a ndarray sub-class); not a big deal ATM.

jreback · 2016-04-05T17:36:50Z

for tz (that's on 0.18.0), but add as a test for your PR.

In [5]: df = DataFrame({'A' : pd.date_range('20130101',periods=3,tz='US/Eastern'), 'B' : pd.date_range('20130101',periods=3)})

In [6]: df
Out[6]: 
                          A          B
0 2013-01-01 00:00:00-05:00 2013-01-01
1 2013-01-02 00:00:00-05:00 2013-01-02
2 2013-01-03 00:00:00-05:00 2013-01-03

In [7]: df.dtypes
Out[7]: 
A    datetime64[ns, US/Eastern]
B                datetime64[ns]
dtype: object

In [8]: df.to_json()
/Users/jreback/miniconda/envs/coursebuild/bin/python.app: line 3: 49679 Segmentation fault: 11  /Users/jreback/miniconda/envs/coursebuild/python.app/Contents/MacOS/python "$@"

Komnomnomnom · 2016-04-05T17:40:33Z

welp, ok I'll take another look.

jreback · 2016-04-05T17:43:27Z

@Komnomnomnom thanks!

jreback · 2016-04-09T15:05:57Z

pls update when you can

Komnomnomnom · 2016-04-12T10:58:39Z

The json code assumed in several places that the .values attribute would always return an ndarray, which of course is not the case for any of the extension types. It then makes extensive use of the numpy C-API to iterate over the assumed ndarray, which is a terrible idea for non-ndarray objects (or even ndarray subclasses in the case of sparse).

I added a helper function to handle this a bit better. It initially attempts to use the .values attribute and falls back to get_values() if necessary. If get_values() fails an exception is raised. Subclasses of ndarray are not accepted. Seems to work ok now for sparse and categorical, but it could be better as some coercion / copying is needed.

datetime w/tz no longer segfaults but it is not JSON serializable as both values and get_values() do not return an ndarray. I'm inclined to think this is a bug?

In [35]: df
Out[35]: 
                          A
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00

In [36]: df.values
Out[36]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [37]: df.get_values()
Out[37]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [38]: df._data.blocks[0].values
Out[38]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [39]: df._data.blocks[0].get_values()
Out[39]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [40]: type(df.values)
Out[40]: pandas.tseries.index.DatetimeIndex

In [41]: df.dtypes
Out[41]: 
A    datetime64[ns, US/Eastern]
dtype: object

In [42]: type(df._data.blocks[0])
Out[42]: pandas.core.internals.DatetimeTZBlock

In [43]: df._data.blocks[0].dtype
Out[43]: datetime64[ns, US/Eastern]

jreback · 2016-04-12T12:56:10Z

@Komnomnomnom no that is all as expected. Internally the DatetimeTZBlock holds a DatetimeIndex rather than a ndarray (though it is duck-like a ndarray).

How do you think we should serialize this?

In general converting to a ndarray is not a great thing as these are now UTC times (don't be fooled as to the repr, they have been converted).

In [44]: np.array(df._data.blocks[1].values)
Out[44]: 
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')

converting to a list will preserve everything

In [42]: list(df._data.blocks[1].values)
Out[42]: 
[Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern', offset='D'),
 Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern', offset='D'),
 Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern', offset='D')]

This converts to string, not sure how useful this is to you (we use this for csv writing).

In [54]: df._data.blocks[1].values.to_native_types()
Out[54]: 
array(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
       '2013-01-03 00:00:00-05:00'], dtype=object)

This is probably the most general thing you can do (and will coerce all things correctly). So if you find a block that doesn' thave .values as something you recoginize you can just do this.

In [57]: df._data.blocks[1].get_values(dtype=object)
Out[57]: 
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern'),
       Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern'),
       Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern')], dtype=object)

jreback · 2016-04-17T15:46:57Z

lmk when you are ready for review

jreback · 2016-04-25T13:52:56Z

@Komnomnomnom hows this coming

Komnomnomnom · 2016-04-25T13:54:28Z

@jreback should have some time this evening

Komnomnomnom · 2016-04-25T13:54:44Z

(on UK time)

jreback · 2016-04-25T13:55:20Z

@Komnomnomnom great thanks!

Komnomnomnom · 2016-04-26T15:55:40Z

@jreback TZ block is supported now. Note atm the JSON handler always converts to UTC when encoding and this seems to be the case for any time object. It would be nice if it honoured TZ timestamps (see #12997) but I've added some regression tests to ensure this behaviour remains consistent in the meantime.

So the behaviour now is to repeatedly call .values on an object until a concrete ndarray is returned with a fall-back to get_values() if that doesn't work. So the resulting calls end up being:

DatetimeTZBlock -> tzblock.values.values
SparseBlock -> spblock.values.values
CategorocalBlock -> catblock.get_values()

I just want to give it a run through vbench, otherwise I'm happy with it.

Let me know if you have any comments. Am I missing something regarding the TZ behaviour?

jreback · 2016-04-26T16:03:01Z

@Komnomnomnom no problem with coercing to UTC, that's what we do anyone when transforming to any non-tz supported.

note that this is numpy 1.11, under older ones you would get a differernt DISPLAY (but the actual values are correct).

In [1]: s = Series(pd.date_range('20130101',periods=3,tz='US/Eastern'))

In [2]: s
Out[2]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [3]: s.values
Out[3]: 
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')

jreback · 2016-04-26T16:04:40Z

doc/source/whatsnew/v0.18.1.txt

@@ -321,6 +322,9 @@ Deprecations



+- Potential segfault in ``DataFrame.to_json`` when serialising ``datetime.time`` (:issue:`11473`).


add the issue for the datetime w/tz fixes?

DatetimeTZBlock was covered by #10778 mentioned here. (#11473 is a separate issue related to datetime objects)

Komnomnomnom · 2016-04-26T17:00:06Z

@jreback is there CI support for windows? If not is there any chance you could run the JSON tests on your windows box?

jreback · 2016-04-26T17:02:06Z

yes, you can sign up for http://www.appveyor.com/.

we run this on master, but you can run on a personal (free) accct.

I will run now (on my local vm).

jreback · 2016-04-26T17:03:00Z

@Komnomnomnom passes!

jreback · 2016-04-26T17:03:43Z

side issue. for conformity I was going to rename:

``pandas\io\tests\test_json` to pandas\io\tests\json\ (also have to make a tiny change in `setup.py`. Want to do that here?

jreback · 2016-04-26T17:04:19Z

or I can do it in a followup.

Komnomnomnom · 2016-04-26T17:05:03Z

Great!

np I'll rename it now

jreback · 2016-04-26T17:05:17Z

@Komnomnomnom this looks fine. ping me when your travis run is green (no need to wait for master)

jreback · 2016-04-26T17:05:47Z

@Komnomnomnom

also move the test_json_norm.py in there as well.

jreback · 2016-04-26T17:26:00Z

@Komnomnomnom looks good. ping when green.

Komnomnomnom · 2016-04-26T18:44:15Z

Some unrelated msgpack and pickle failures (during version comparison?), not sure what's going on there. Otherwise clean:

https://travis-ci.org/Komnomnomnom/pandas/builds/125890385

jreback · 2016-04-26T18:49:17Z

@Komnomnomnom thanks. yeah those happen when it doesn't get the tag information, not really sure why it doesnt' always work.

jreback · 2016-04-26T18:55:09Z

thanks for the fixes!

jreback added IO JSON read_json, to_json, json_normalize Bug labels Apr 5, 2016

jreback added this to the 0.18.1 milestone Apr 5, 2016

jreback added Dtype Conversions Unexpected or buggy dtype conversions Timezones Timezone data dtype Compat pandas objects compatability with Numpy or Python functions Categorical Categorical Data Type labels Apr 5, 2016

Komnomnomnom force-pushed the json-seg-faults branch from 46110d9 to 6cb26fe Compare April 12, 2016 10:40

This was referenced Apr 12, 2016

BUG: json invoke default handler for unsupported numpy dtypes #12878

Closed

to_json raises 'Unhandled numpy dtype 15' with complex values. #12554

Closed

Komnomnomnom force-pushed the json-seg-faults branch from 6cb26fe to fc7d458 Compare April 12, 2016 12:30

Komnomnomnom force-pushed the json-seg-faults branch from fc7d458 to e3e6cf8 Compare April 26, 2016 15:15

jreback reviewed Apr 26, 2016
View reviewed changes

BUG: fix json segfaults

af006a4

Komnomnomnom force-pushed the json-seg-faults branch from e3e6cf8 to af006a4 Compare April 26, 2016 16:54

CLN: rename json test inline with others

b14d0df

jreback closed this in 37a7e69 Apr 26, 2016

Komnomnomnom deleted the json-seg-faults branch April 26, 2016 19:00

dsaxton mentioned this pull request May 24, 2020

Support of ordinal based on pandas' ordered Categorical type? vega/altair#245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix json segfaults #12802

BUG: fix json segfaults #12802

Komnomnomnom commented Apr 5, 2016 •

edited

Loading

jreback commented Apr 5, 2016

jreback commented Apr 5, 2016

Komnomnomnom commented Apr 5, 2016

jreback commented Apr 5, 2016

jreback commented Apr 9, 2016

Komnomnomnom commented Apr 12, 2016

jreback commented Apr 12, 2016

jreback commented Apr 17, 2016

jreback commented Apr 25, 2016

Komnomnomnom commented Apr 25, 2016

Komnomnomnom commented Apr 25, 2016

jreback commented Apr 25, 2016

Komnomnomnom commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback Apr 26, 2016

Komnomnomnom Apr 26, 2016

Komnomnomnom commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

Komnomnomnom commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

Komnomnomnom commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

		@@ -321,6 +322,9 @@ Deprecations



		- Potential segfault in ``DataFrame.to_json`` when serialising ``datetime.time`` (:issue:`11473`).

BUG: fix json segfaults #12802

BUG: fix json segfaults #12802

Conversation

Komnomnomnom commented Apr 5, 2016 • edited Loading

jreback commented Apr 5, 2016

jreback commented Apr 5, 2016

Komnomnomnom commented Apr 5, 2016

jreback commented Apr 5, 2016

jreback commented Apr 9, 2016

Komnomnomnom commented Apr 12, 2016

jreback commented Apr 12, 2016

jreback commented Apr 17, 2016

jreback commented Apr 25, 2016

Komnomnomnom commented Apr 25, 2016

Komnomnomnom commented Apr 25, 2016

jreback commented Apr 25, 2016

Komnomnomnom commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback Apr 26, 2016

Choose a reason for hiding this comment

Komnomnomnom Apr 26, 2016

Choose a reason for hiding this comment

Komnomnomnom commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

Komnomnomnom commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

Komnomnomnom commented Apr 26, 2016

jreback commented Apr 26, 2016

jreback commented Apr 26, 2016

Komnomnomnom commented Apr 5, 2016 •

edited

Loading