ENH: Add JSON export option for DataFrame (take 2) #1263

Komnomnomnom · 2012-05-21T09:05:51Z

Second attempt for JSON pull request (original #1226) for issue #631.

All tests pass apart from the two below, tested on 64 bit OSX and 32 bit Ubuntu.

Time stamp JSON encoding/decoding in test_frame and test_series fails. I haven't looked into this too much, as Wes mentioned the work with timestamps was still ongoing but this appears to occur because the code always operates on the underlying numpy array rather than the pandas objects and the numpy array returns a bad Python date object to be encoded.

For numpy 1.6

(Pdb) series.index[0]  
Timestamp(2000, 1, 3, 0, 0)
(Pdb) series.index.values[0]
1970-01-11 232:00:00 
(Pdb) ujson.encode(series.index)   # encoder operates on series.index.values
'[1699200,864000,950400,1036800,1123200,1382400,1468800,1555200,1641600,1728000,1065600,1152000,1238400,1324800,1411200,1670400,1756800,921600,1008000,1094400,1353600,1440000,1526400,1612800,1699200,1036800,1123200,1209600,1296000,1382400]'

For numpy 1.7 the situation is improved and the encoding is correct (although the timestamps are in nanoseconds):

(Pdb) series.index[0]
Timestamp(2000, 1, 3, 0, 0)
(Pdb) series.index.values[0]
numpy.datetime64('2000-01-03T00:00:00.000000000+0000')
(Pdb) ujson.encode(series.index)
'[946857600000000000,946944000000000000,947030400000000000,947116800000000000,947203200000000000,947462400000000000,947548800000000000,947635200000000000,947721600000000000,947808000000000000,948067200000000000,948153600000000000,948240000000000000,948326400000000000,948412800000000000,948672000000000000,948758400000000000,948844800000000000,948931200000000000,949017600000000000,949276800000000000,949363200000000000,949449600000000000,949536000000000000,949622400000000000,949881600000000000,949968000000000000,950054400000000000,950140800000000000,950227200000000000]'

wesm · 2012-05-21T13:05:32Z

Cool. Do you have an opinion on whether timestamps should be converted to JavaScript timestamps (milliseconds since the epoch)? Everything is nanoseconds now in pandas. I'm guessing "probably not" but I guess it could be an option in to_json

Komnomnomnom · 2012-05-21T13:48:31Z

Hmm when decoding to DataFrames it obviously shouldn't be an issue but it would be nice to have the option of milliseconds and/or seconds for sharing data with JavaScript, assuming it's easy to deduce the unit from the datetime object. Esp as JSON will probably (?!) primarily be used for sending data client side (i.e. to browsers).

Note the original ujson encodes datetimes to seconds and I don't think there is a standard JSON timestamp unit.

Sort of a tangent but it would also be nice to have an efficient way of converting those timestamps back to datetimes when rebuilding the DataFrame (I'm not suggesting it should be part of the JSON decoding). Maybe that exists already though and I'm just not aware of it?

seth-brown · 2012-05-23T00:20:07Z

Just throwing in my 2 cents; I'd really like a simple way to export a DataFrame as JSON in Pandas.

wesm · 2012-05-23T00:21:00Z

@drBunsen after this PR is merged, it will be dirt simple: df.to_json() (or possibly specifying a different JSON format depending on your application)

wesm · 2012-05-23T23:34:26Z

@Komnomnomnom working on the timestamp handling issues. Doing everything (in particular, working with the pandas data structures) in C is probably not be the best long-term solution since Index subclasses may not be completely represented as NumPy arrays. I don't have time to refactor this, but I'm going to insert some kludges (+ tests) to deal with the datetime64[ns] arrays so that things work as they should for now.

wesm · 2012-05-24T00:07:42Z

OK I think I have the kludge mostly sorted. @Komnomnomnom you have CRLF line endings in these files:

http://help.github.com/line-endings/

…ssues per #1263

wesm · 2012-05-24T00:20:31Z

Cool, I merged this. There will be a bunch of followup issues no doubt as things settle. One question is how to encode datetime.datetime objects:

In [2]: ujson.encode(datetime.now())
Out[2]: '1337804262'

In [3]: np.array(datetime.now(), dtype='M8[ns]').view('i8')
Out[3]: array(1337804288505948000)

I have the appropriate code in datetime.pyx and np_datetime.c to do the PyDateTime -> nanosecond timestamp conversion. Need to think a bit more, cannot leave it as is I don't think. Should also have some flexibility about the unit for timestamps. Like exporting as millisecond timestamps is probably what you want for JavaScript Date objects for, e.g., d3 usage.

Komnomnomnom · 2012-05-24T06:59:28Z

Thanks Wes. Re the CRLF, I initially preserved the line endings and tabs just to be consistent with the original ujson. Should have switched them for pandas though :).

Komnomnomnom · 2012-05-24T11:23:01Z

It will probably be necessary to add special handling functions (in either C or Python) for complex objects like e.g. MultiIndex but here's some thoughts about different options for the ujson code to deal with the pandas wrapping of numpy values, i.e. things like the timestamp issue.

write a custom PyArray getitem C function in ujson code which would do any necessary interpretation of the values. This would mean duplicating any pandas special handling into this function. (yuck)
when encoding numpy arrays use the Python __getitem__ method for a 'container' object if it exists, where e.g. the container is a pandas objects. This adds some overhead but it keeps things general and shouldn't change the current code too much. It does assume that all the special handling would be taken care of by __getitem__ though.
add special handling functions for series, index and dataframe (i.e. don't consider them to be numpy arrays) and call their iteration and accessor methods directly. (Note special handling functions already exist but they are merely thin wrappers around the numpy array handling, and only used to support the split format).
pass ujson a python function or object which will be called to deal with iteration and/or accessing items.

There are probably more possibilities but if it works 2 is the better option IMO, it doesn't impact the current code too much and keeps things general. Although if the timestamp issue is the only one then perhaps the kludge should remain until numpy 1.7 comes along?

wesm · 2012-05-24T14:46:17Z

Will have to return to this at some point. I am fairly certain a lot of performance is being left on the table due to all of the "boxing" of array values. The right approach as always is to add performance tests to the vbench suite (vb_suite/) so we can monitor and track the performance of to_json.

I'm testing on both NumPy 1.6 and 1.7, so as long as the kludge works and the tests pass, good enough for me right now.

trottier · 2013-06-26T01:59:08Z

fyi: ultrajson/ultrajson#83

jreback · 2013-06-26T02:38:01Z

@trottier this is in master now #3876 and docs: http://pandas.pydata.org/pandas-docs/dev/io.html#json

trottier · 2013-06-26T02:56:28Z

Sorry, I should have been more explicit. I would recommend against using ujson in pandas because (and unfortunately this isn't documented) ujson handles floating point numbers unconventionally.

ultrajson/ultrajson#69 (comment)
ultrajson/ultrajson#83
ultrajson/ultrajson#90

>>> import ujson
>>> ujson.dumps(1e-40)
'0.0'
>>> ujson.dumps(1e-40, double_precision=17)
'0.0'

simplejson is almost as fast, and doesn't have these issues.

ENH: Add JSON export option for DataFrame take 2

465c21e

wesm mentioned this pull request May 23, 2012

DatetimeIndex should accept list of integers #1303

Closed

wesm added a commit that referenced this pull request May 24, 2012

BUG: fix CRLF and kludge around datetime64 nanosecond JSON encoding i…

45bbffa

…ssues per #1263

wesm closed this May 24, 2012

wesm mentioned this pull request May 24, 2012

Second vs. nanosecond resolution in encoding indexes vs datetime.datetime objects in JSON encoding #1304

Closed

wesm mentioned this pull request May 25, 2012

Add JSON export option for DataFrame #631

Closed

trottier mentioned this pull request Jun 26, 2013

Ensure accurate encoding/decoding of big and small floats #4042

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add JSON export option for DataFrame (take 2) #1263

ENH: Add JSON export option for DataFrame (take 2) #1263

Komnomnomnom commented May 21, 2012

wesm commented May 21, 2012

Komnomnomnom commented May 21, 2012

seth-brown commented May 23, 2012

wesm commented May 23, 2012

wesm commented May 23, 2012

wesm commented May 24, 2012

wesm commented May 24, 2012

Komnomnomnom commented May 24, 2012

Komnomnomnom commented May 24, 2012

wesm commented May 24, 2012

trottier commented Jun 26, 2013

jreback commented Jun 26, 2013

trottier commented Jun 26, 2013

ENH: Add JSON export option for DataFrame (take 2) #1263

ENH: Add JSON export option for DataFrame (take 2) #1263

Conversation

Komnomnomnom commented May 21, 2012

wesm commented May 21, 2012

Komnomnomnom commented May 21, 2012

seth-brown commented May 23, 2012

wesm commented May 23, 2012

wesm commented May 23, 2012

wesm commented May 24, 2012

wesm commented May 24, 2012

Komnomnomnom commented May 24, 2012

Komnomnomnom commented May 24, 2012

wesm commented May 24, 2012

trottier commented Jun 26, 2013

jreback commented Jun 26, 2013

trottier commented Jun 26, 2013