Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix json segfaults #12802

Closed
wants to merge 2 commits into from

Conversation

Komnomnomnom
Copy link
Contributor

@Komnomnomnom Komnomnomnom commented Apr 5, 2016

closes #11473
closes #10778
closes #11299

  • tests added / passed
  • vbench / asv ok
  • windows tests pass
  • valgrind clean
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

This fixes several potential segfaults in the json code:

Fixing #10778 means non-ndarray blocks are now supported (although I think at present category is the only one?).

Not tested on windows yet. Seeing some unrelated travis failures on my fork (msgpack), is that normal?

@jreback jreback added IO JSON read_json, to_json, json_normalize Bug labels Apr 5, 2016
@jreback
Copy link
Contributor

jreback commented Apr 5, 2016

non-ndarray values are the extension types

  • categorical
  • datetime w/tz
  • sparse (which are actually a ndarray sub-class); not a big deal ATM.

@jreback jreback added this to the 0.18.1 milestone Apr 5, 2016
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Timezones Timezone data dtype Compat pandas objects compatability with Numpy or Python functions Categorical Categorical Data Type labels Apr 5, 2016
@jreback
Copy link
Contributor

jreback commented Apr 5, 2016

for tz (that's on 0.18.0), but add as a test for your PR.

In [5]: df = DataFrame({'A' : pd.date_range('20130101',periods=3,tz='US/Eastern'), 'B' : pd.date_range('20130101',periods=3)})

In [6]: df
Out[6]: 
                          A          B
0 2013-01-01 00:00:00-05:00 2013-01-01
1 2013-01-02 00:00:00-05:00 2013-01-02
2 2013-01-03 00:00:00-05:00 2013-01-03

In [7]: df.dtypes
Out[7]: 
A    datetime64[ns, US/Eastern]
B                datetime64[ns]
dtype: object

In [8]: df.to_json()
/Users/jreback/miniconda/envs/coursebuild/bin/python.app: line 3: 49679 Segmentation fault: 11  /Users/jreback/miniconda/envs/coursebuild/python.app/Contents/MacOS/python "$@"

@Komnomnomnom
Copy link
Contributor Author

welp, ok I'll take another look.

@jreback
Copy link
Contributor

jreback commented Apr 5, 2016

@Komnomnomnom thanks!

@jreback
Copy link
Contributor

jreback commented Apr 9, 2016

pls update when you can

@Komnomnomnom
Copy link
Contributor Author

The json code assumed in several places that the .values attribute would always return an ndarray, which of course is not the case for any of the extension types. It then makes extensive use of the numpy C-API to iterate over the assumed ndarray, which is a terrible idea for non-ndarray objects (or even ndarray subclasses in the case of sparse).

I added a helper function to handle this a bit better. It initially attempts to use the .values attribute and falls back to get_values() if necessary. If get_values() fails an exception is raised. Subclasses of ndarray are not accepted. Seems to work ok now for sparse and categorical, but it could be better as some coercion / copying is needed.

datetime w/tz no longer segfaults but it is not JSON serializable as both values and get_values() do not return an ndarray. I'm inclined to think this is a bug?

In [35]: df
Out[35]: 
                          A
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00

In [36]: df.values
Out[36]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [37]: df.get_values()
Out[37]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [38]: df._data.blocks[0].values
Out[38]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [39]: df._data.blocks[0].get_values()
Out[39]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [40]: type(df.values)
Out[40]: pandas.tseries.index.DatetimeIndex

In [41]: df.dtypes
Out[41]: 
A    datetime64[ns, US/Eastern]
dtype: object

In [42]: type(df._data.blocks[0])
Out[42]: pandas.core.internals.DatetimeTZBlock

In [43]: df._data.blocks[0].dtype
Out[43]: datetime64[ns, US/Eastern]

@jreback
Copy link
Contributor

jreback commented Apr 12, 2016

@Komnomnomnom no that is all as expected. Internally the DatetimeTZBlock holds a DatetimeIndex rather than a ndarray (though it is duck-like a ndarray).

How do you think we should serialize this?

In general converting to a ndarray is not a great thing as these are now UTC times (don't be fooled as to the repr, they have been converted).

In [44]: np.array(df._data.blocks[1].values)
Out[44]: 
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')

converting to a list will preserve everything

In [42]: list(df._data.blocks[1].values)
Out[42]: 
[Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern', offset='D'),
 Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern', offset='D'),
 Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern', offset='D')]

This converts to string, not sure how useful this is to you (we use this for csv writing).

In [54]: df._data.blocks[1].values.to_native_types()
Out[54]: 
array(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
       '2013-01-03 00:00:00-05:00'], dtype=object)

This is probably the most general thing you can do (and will coerce all things correctly). So if you find a block that doesn' thave .values as something you recoginize you can just do this.

In [57]: df._data.blocks[1].get_values(dtype=object)
Out[57]: 
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern'),
       Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern'),
       Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern')], dtype=object)

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

lmk when you are ready for review

@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

@Komnomnomnom hows this coming

@Komnomnomnom
Copy link
Contributor Author

@jreback should have some time this evening

@Komnomnomnom
Copy link
Contributor Author

(on UK time)

@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

@Komnomnomnom great thanks!

@Komnomnomnom
Copy link
Contributor Author

@jreback TZ block is supported now. Note atm the JSON handler always converts to UTC when encoding and this seems to be the case for any time object. It would be nice if it honoured TZ timestamps (see #12997) but I've added some regression tests to ensure this behaviour remains consistent in the meantime.

So the behaviour now is to repeatedly call .values on an object until a concrete ndarray is returned with a fall-back to get_values() if that doesn't work. So the resulting calls end up being:

  • DatetimeTZBlock -> tzblock.values.values
  • SparseBlock -> spblock.values.values
  • CategorocalBlock -> catblock.get_values()

I just want to give it a run through vbench, otherwise I'm happy with it.

Let me know if you have any comments. Am I missing something regarding the TZ behaviour?

@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

@Komnomnomnom no problem with coercing to UTC, that's what we do anyone when transforming to any non-tz supported.

note that this is numpy 1.11, under older ones you would get a differernt DISPLAY (but the actual values are correct).

In [1]: s = Series(pd.date_range('20130101',periods=3,tz='US/Eastern'))

In [2]: s
Out[2]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [3]: s.values
Out[3]: 
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')

@@ -321,6 +322,9 @@ Deprecations



- Potential segfault in ``DataFrame.to_json`` when serialising ``datetime.time`` (:issue:`11473`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue for the datetime w/tz fixes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeTZBlock was covered by #10778 mentioned here. (#11473 is a separate issue related to datetime objects)

@Komnomnomnom
Copy link
Contributor Author

@jreback is there CI support for windows? If not is there any chance you could run the JSON tests on your windows box?

@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

yes, you can sign up for http://www.appveyor.com/.

we run this on master, but you can run on a personal (free) accct.

I will run now (on my local vm).

@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

@Komnomnomnom passes!

@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

side issue. for conformity I was going to rename:

``pandas\io\tests\test_json` to pandas\io\tests\json\ (also have to make a tiny change in `setup.py`. Want to do that here?

@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

or I can do it in a followup.

@Komnomnomnom
Copy link
Contributor Author

Great!

np I'll rename it now

@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

@Komnomnomnom this looks fine. ping me when your travis run is green (no need to wait for master)

@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

@Komnomnomnom

also move the test_json_norm.py in there as well.

@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

@Komnomnomnom looks good. ping when green.

@Komnomnomnom
Copy link
Contributor Author

Some unrelated msgpack and pickle failures (during version comparison?), not sure what's going on there. Otherwise clean:

https://travis-ci.org/Komnomnomnom/pandas/builds/125890385

@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

@Komnomnomnom thanks. yeah those happen when it doesn't get the tag information, not really sure why it doesnt' always work.

@jreback jreback closed this in 37a7e69 Apr 26, 2016
@jreback
Copy link
Contributor

jreback commented Apr 26, 2016

thanks for the fixes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions IO JSON read_json, to_json, json_normalize Timezones Timezone data dtype
Projects
None yet
2 participants