Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add JSON export option for DataFrame (take 2) #1263

Closed
wants to merge 1 commit into from

Conversation

Komnomnomnom
Copy link
Contributor

Second attempt for JSON pull request (original #1226) for issue #631.

All tests pass apart from the two below, tested on 64 bit OSX and 32 bit Ubuntu.

Time stamp JSON encoding/decoding in test_frame and test_series fails. I haven't looked into this too much, as Wes mentioned the work with timestamps was still ongoing but this appears to occur because the code always operates on the underlying numpy array rather than the pandas objects and the numpy array returns a bad Python date object to be encoded.

For numpy 1.6

(Pdb) series.index[0]  
Timestamp(2000, 1, 3, 0, 0)
(Pdb) series.index.values[0]
1970-01-11 232:00:00 
(Pdb) ujson.encode(series.index)   # encoder operates on series.index.values
'[1699200,864000,950400,1036800,1123200,1382400,1468800,1555200,1641600,1728000,1065600,1152000,1238400,1324800,1411200,1670400,1756800,921600,1008000,1094400,1353600,1440000,1526400,1612800,1699200,1036800,1123200,1209600,1296000,1382400]'

For numpy 1.7 the situation is improved and the encoding is correct (although the timestamps are in nanoseconds):

(Pdb) series.index[0]
Timestamp(2000, 1, 3, 0, 0)
(Pdb) series.index.values[0]
numpy.datetime64('2000-01-03T00:00:00.000000000+0000')
(Pdb) ujson.encode(series.index)
'[946857600000000000,946944000000000000,947030400000000000,947116800000000000,947203200000000000,947462400000000000,947548800000000000,947635200000000000,947721600000000000,947808000000000000,948067200000000000,948153600000000000,948240000000000000,948326400000000000,948412800000000000,948672000000000000,948758400000000000,948844800000000000,948931200000000000,949017600000000000,949276800000000000,949363200000000000,949449600000000000,949536000000000000,949622400000000000,949881600000000000,949968000000000000,950054400000000000,950140800000000000,950227200000000000]'

@wesm
Copy link
Member

wesm commented May 21, 2012

Cool. Do you have an opinion on whether timestamps should be converted to JavaScript timestamps (milliseconds since the epoch)? Everything is nanoseconds now in pandas. I'm guessing "probably not" but I guess it could be an option in to_json

@Komnomnomnom
Copy link
Contributor Author

Hmm when decoding to DataFrames it obviously shouldn't be an issue but it would be nice to have the option of milliseconds and/or seconds for sharing data with JavaScript, assuming it's easy to deduce the unit from the datetime object. Esp as JSON will probably (?!) primarily be used for sending data client side (i.e. to browsers).

Note the original ujson encodes datetimes to seconds and I don't think there is a standard JSON timestamp unit.

Sort of a tangent but it would also be nice to have an efficient way of converting those timestamps back to datetimes when rebuilding the DataFrame (I'm not suggesting it should be part of the JSON decoding). Maybe that exists already though and I'm just not aware of it?

@seth-brown
Copy link

Just throwing in my 2 cents; I'd really like a simple way to export a DataFrame as JSON in Pandas.

@wesm
Copy link
Member

wesm commented May 23, 2012

@drBunsen after this PR is merged, it will be dirt simple: df.to_json() (or possibly specifying a different JSON format depending on your application)

@wesm
Copy link
Member

wesm commented May 23, 2012

@Komnomnomnom working on the timestamp handling issues. Doing everything (in particular, working with the pandas data structures) in C is probably not be the best long-term solution since Index subclasses may not be completely represented as NumPy arrays. I don't have time to refactor this, but I'm going to insert some kludges (+ tests) to deal with the datetime64[ns] arrays so that things work as they should for now.

@wesm
Copy link
Member

wesm commented May 24, 2012

OK I think I have the kludge mostly sorted. @Komnomnomnom you have CRLF line endings in these files:

http://help.github.com/line-endings/

wesm added a commit that referenced this pull request May 24, 2012
@wesm
Copy link
Member

wesm commented May 24, 2012

Cool, I merged this. There will be a bunch of followup issues no doubt as things settle. One question is how to encode datetime.datetime objects:

In [2]: ujson.encode(datetime.now())
Out[2]: '1337804262'

In [3]: np.array(datetime.now(), dtype='M8[ns]').view('i8')
Out[3]: array(1337804288505948000)

I have the appropriate code in datetime.pyx and np_datetime.c to do the PyDateTime -> nanosecond timestamp conversion. Need to think a bit more, cannot leave it as is I don't think. Should also have some flexibility about the unit for timestamps. Like exporting as millisecond timestamps is probably what you want for JavaScript Date objects for, e.g., d3 usage.

@Komnomnomnom
Copy link
Contributor Author

Thanks Wes. Re the CRLF, I initially preserved the line endings and tabs just to be consistent with the original ujson. Should have switched them for pandas though :).

@Komnomnomnom
Copy link
Contributor Author

It will probably be necessary to add special handling functions (in either C or Python) for complex objects like e.g. MultiIndex but here's some thoughts about different options for the ujson code to deal with the pandas wrapping of numpy values, i.e. things like the timestamp issue.

  1. write a custom PyArray getitem C function in ujson code which would do any necessary interpretation of the values. This would mean duplicating any pandas special handling into this function. (yuck)
  2. when encoding numpy arrays use the Python __getitem__ method for a 'container' object if it exists, where e.g. the container is a pandas objects. This adds some overhead but it keeps things general and shouldn't change the current code too much. It does assume that all the special handling would be taken care of by __getitem__ though.
  3. add special handling functions for series, index and dataframe (i.e. don't consider them to be numpy arrays) and call their iteration and accessor methods directly. (Note special handling functions already exist but they are merely thin wrappers around the numpy array handling, and only used to support the split format).
  4. pass ujson a python function or object which will be called to deal with iteration and/or accessing items.

There are probably more possibilities but if it works 2 is the better option IMO, it doesn't impact the current code too much and keeps things general. Although if the timestamp issue is the only one then perhaps the kludge should remain until numpy 1.7 comes along?

@wesm
Copy link
Member

wesm commented May 24, 2012

Will have to return to this at some point. I am fairly certain a lot of performance is being left on the table due to all of the "boxing" of array values. The right approach as always is to add performance tests to the vbench suite (vb_suite/) so we can monitor and track the performance of to_json.

I'm testing on both NumPy 1.6 and 1.7, so as long as the kludge works and the tests pass, good enough for me right now.

@trottier
Copy link

fyi: ultrajson/ultrajson#83

@jreback
Copy link
Contributor

jreback commented Jun 26, 2013

@trottier this is in master now #3876 and docs: http://pandas.pydata.org/pandas-docs/dev/io.html#json

@trottier
Copy link

Sorry, I should have been more explicit. I would recommend against using ujson in pandas because (and unfortunately this isn't documented) ujson handles floating point numbers unconventionally.

ultrajson/ultrajson#69 (comment)
ultrajson/ultrajson#83
ultrajson/ultrajson#90

>>> import ujson
>>> ujson.dumps(1e-40)
'0.0'
>>> ujson.dumps(1e-40, double_precision=17)
'0.0'

simplejson is almost as fast, and doesn't have these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants