ENH: add ujson support in pandas.io.json #3804

jreback · 2013-06-07T23:05:47Z

This is @wesm PR #3583 with this:

It builds now, and passes travis on py2 and py3, had 2 issues:

clean was erasing the *.c files from ujson
the module import didn't work because it was using the original init function

Converted to new io API: to_json / read_json

Docs added

hayd · 2013-06-07T23:22:16Z

yay!

hayd · 2013-06-08T00:16:18Z

This is pretty awesome.

One thing I think worth being explicit in the docs (am I right in saying this?) only works with valid JSON.

jreback · 2013-06-08T03:11:01Z

the json routines read/write from strings
this is unlike any of the other io routines that pandas has
which all take a path_or_buf

is this typical of dealing with JSON data?

should we have a kw to do this? always do it?

hayd · 2013-06-08T10:28:07Z

@jreback That is an excellent point, this should work as all the other read_s. I don't think this is necessarily typical to always have the string but at least it does makes it clear the read_json takes the entire string at once rather than by chunks.

(The first thing I did was open a json_file ~~f.readlines().~~ f.read())

It'd certainly be a useful feature is we could go pd.from_json(datas_url) and perhaps this would be a fairly standard use case.

We could either:

Have have a string kwarg, filepath_or_buffer is first argument (this would be my preference).
Check if it's a filepath_or_buffer, if not there, treat as json string (seems like a can of worms)

(Also, to clarify previous point, from_json only reads valid json :) )

jreback · 2013-06-08T11:26:53Z

do u have a URL that yields JSON?

hayd · 2013-06-08T11:33:06Z

Yoinked from so:

http://search.twitter.com/search.json?q=blue%20angels&rpp=5&include_entities=true&result_type=mixed

... or http://search.twitter.com/search.json?q=pandas%20python

hayd · 2013-06-08T12:41:27Z

A more interesting example: https://api.github.com/repos/pydata/pandas/issues?per_page=100

jreback · 2013-06-08T13:29:50Z

parsed first try!

(Pdb) url_table
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 19 columns):
assignee        6  non-null values
body            100  non-null values
closed_at       0  non-null values
comments        100  non-null values
comments_url    100  non-null values
created_at      100  non-null values
events_url      100  non-null values
html_url        100  non-null values
id              100  non-null values
labels          100  non-null values
labels_url      100  non-null values
milestone       75  non-null values
number          100  non-null values
pull_request    100  non-null values
state           100  non-null values
title           100  non-null values
updated_at      100  non-null values
url             100  non-null values
user            100  non-null values
dtypes: int64(3), object(16)

hayd · 2013-06-08T13:32:33Z

The one thing that trips is dates (you just have to to_datetime after), but that can be left for another day.

Whoop! :)

jreback · 2013-06-08T13:33:48Z

yeh....we'll see how this goes....in 0.12 can add infer_types directive, kind of like read_html

cpcloud · 2013-06-08T13:35:18Z

i wonder if there are any other similar libraries or systems that have this much io functionality in a single package...

hayd · 2013-06-08T13:37:59Z

(Does infer_types work for unix time stamps? ...to get the roundtrip working. Anyway....)

cpcloud · 2013-06-08T13:38:35Z

doubtful since those are just integers...but i haven't tested

cpcloud · 2013-06-08T13:41:19Z

i tried

date +"%s" | python -c 'import sys; from dateutil.parser import parse; parse(sys.stdin.read())'

that doesn't work so i'm going to say no it won't work.

hayd · 2013-06-08T13:42:17Z

You can do pd.to_datetime on the column after reading.

jreback · 2013-06-08T13:43:36Z

yes an know....we have an open issue #3540 to make a better API for this, but look at #2015

so if you know they are epoch time stamps (e.g. passing in as an option maybe), then its easy, we can convert them

jreback · 2013-06-08T13:44:56Z

Timestamp accepts a nanosecond based epoch timestamp (e.g. nanoseconds since 1970; so epoch time stamps are in seconds since 1970, so juse int(1e9) and it will work).....

but there is an issue because sometimes they are not in seconds...so have to disambiguate

hayd · 2013-06-08T13:46:20Z

Just saying as to_json exports timestamps to unix time.

cpcloud · 2013-06-08T13:48:17Z

oh yes Timestamp works...hm maybe should add to read_html...i doubt people are using html tables to store unix timestamp but who the hell knows? maybe i'll wait until the api is sorted out

jreback · 2013-06-08T13:49:44Z

could add a parse_dates arg that takes a list of fields to try to convert?

jreback · 2013-06-08T13:50:42Z

read_html much tougher because people are just reading it in ...and there is no standard...@hayd is right...since its a standard, could even try to convert an integer column (if they are in range)?

hayd · 2013-06-08T13:56:39Z

(obviously far too reckless to just .applymap(pd.to_datetime) lol)

jreback · 2013-06-08T13:56:57Z

what's a quick way to fix inconcistent space/tabs....something got screwed up...

cpcloud · 2013-06-08T13:57:47Z

M-x untabify

cpcloud · 2013-06-08T13:58:00Z

on a region i think prolly whole file works too

jreback · 2013-06-08T13:58:09Z

@hayd actually .apply(pd.to_datetime) will not change the column if all conversions fail, so it is safe sort of

try out: convert_objects(convert_dates='coerce'))

hayd · 2013-06-08T13:59:08Z

Quite slow though?

cpcloud · 2013-06-08T14:00:13Z

convert_objects is ok speed wise it operates on blocks using cython functions so it's gotta be faster than lambdas :)

cpcloud · 2013-06-08T14:00:23Z

@jreback correct me if i'm wrong here...

jreback · 2013-06-11T10:36:11Z

@Komnomnomnom go ahead and paste here
and well get this in

Komnomnomnom · 2013-06-11T12:33:47Z

Ok the following patch should make it safe to call Npy_releaseContext multiple times (which is what was causing the problem). Segmentation fault is gone and valgrind output from Python 2.7 debug build is clean. Likewise all tests pass for Python 2.7 and valgrind output for json tests is clean (i.e. there are no warnings for json related code).

diff --git a/pandas/src/ujson/python/JSONtoObj.c b/pandas/src/ujson/python/JSONtoObj.c
index 1db7586..160c30f 100644
--- a/pandas/src/ujson/python/JSONtoObj.c
+++ b/pandas/src/ujson/python/JSONtoObj.c
@@ -10,6 +10,7 @@ typedef struct __PyObjectDecoder
     JSONObjectDecoder dec;

     void* npyarr;       // Numpy context buffer
+    void* npyarr_addr;  // Ref to npyarr ptr to track DECREF calls
     npy_intp curdim;    // Current array dimension

     PyArray_Descr* dtype;
@@ -67,9 +68,7 @@ void Npy_releaseContext(NpyArrContext* npyarr)
         }
         if (npyarr->dec)
         {
-            // Don't set to null, used to make sure we don't Py_DECREF npyarr
-            // in releaseObject
-            // npyarr->dec->npyarr = NULL;
+            npyarr->dec->npyarr = NULL;
             npyarr->dec->curdim = 0;
         }
         Py_XDECREF(npyarr->labels[0]);
@@ -88,6 +87,7 @@ JSOBJ Object_npyNewArray(void* _decoder)
     {
         // start of array - initialise the context buffer
         npyarr = decoder->npyarr = PyObject_Malloc(sizeof(NpyArrContext));
+        decoder->npyarr_addr = npyarr;

         if (!npyarr)
         {
@@ -515,7 +515,7 @@ JSOBJ Object_newDouble(double value)
 static void Object_releaseObject(JSOBJ obj, void* _decoder)
 {
     PyObjectDecoder* decoder = (PyObjectDecoder*) _decoder;
-    if (obj != decoder->npyarr)
+    if (obj != decoder->npyarr_addr)
     {
         Py_XDECREF( ((PyObject *)obj));
     }
@@ -555,6 +555,7 @@ PyObject* JSONToObj(PyObject* self, PyObject *args, PyObject *kwargs)
     pyDecoder.dec = dec;
     pyDecoder.curdim = 0;
     pyDecoder.npyarr = NULL;
+    pyDecoder.npyarr_addr = NULL;

     decoder = (JSONObjectDecoder*) &pyDecoder;

@@ -609,6 +610,7 @@ PyObject* JSONToObj(PyObject* self, PyObject *args, PyObject *kwargs)

     if (PyErr_Occurred())
     {
+        Npy_releaseContext(pyDecoder.npyarr);
         return NULL;
     }

DOC: docs in io.rst/whatsnew/release notes/api TST: cleaned up cruft in test_series/test_frame

…will return a StringIO object) read_json will read from a string-like or filebuf or url (consistent with other parsers)

…or JSON string added keywords parse_dates,keep_default_dates to allow for date parsing in columns of a Frame (default is False, not to parse dates)

…(which both can be can be parsed with parse_dates=True in read_json)

jreback · 2013-06-11T14:05:12Z

patch applied.....looking good now

hayd · 2013-06-11T15:38:06Z

@jreback Something like this for requests: hayd@dbd968b

jreback · 2013-06-11T19:13:33Z

@wesm this is mergable....any objections?

wesm · 2013-06-11T19:16:21Z

Looks good to me, bombs away

jreback · 2013-06-11T19:18:27Z

3.2.1.....

ENH: add ujson support in pandas.io.json

Komnomnomnom · 2013-06-11T19:22:49Z

Awesome. I'll see about merging in upstream changes. Will send thru a pull request soonish.

jreback · 2013-06-11T19:43:37Z

oh...you have additional dependencies on this?

Komnomnomnom · 2013-06-11T19:49:54Z

Mentioned in #3583, there have been some enhancements / fixes in ultrajson since the pandas json version was originally written. Nothing major (I think) and should be straightforward enough to merge but it'd be a good idea to keep them in sync I think.

jreback · 2013-06-11T20:17:53Z

ok...sure...

wesm · 2013-06-13T02:48:53Z

thanks all for making this happen, especially to @Komnomnomnom for authoring this code in the first place =)

jreback mentioned this pull request Jun 7, 2013

Moving pandasjson back into mainline pandas #3583

Closed

wesm and others added 10 commits June 11, 2013 10:02

ENH: pull pandasjson back into pandas

e31f839

DOC: add ultrajson license

8327c5b

TST: json manip test script. and trigger travis

ade5d0f

BLD: fix setup.py to work on current pandas

9633880

CLN: revised json support to use the to_json/read_json in pandas.io.json

7dd12cc

DOC: docs in io.rst/whatsnew/release notes/api TST: cleaned up cruft in test_series/test_frame

DOC: io.rst doc updates

a9dafe3

API: to_json now writes to a file by default (if None is provided it …

6422041

…will return a StringIO object) read_json will read from a string-like or filebuf or url (consistent with other parsers)

ENH: removed json argument, now path_or_buf can be a path,buffer,url,…

8e673cf

…or JSON string added keywords parse_dates,keep_default_dates to allow for date parsing in columns of a Frame (default is False, not to parse dates)

ENH: added date_format parm to to_josn to allow epoch or iso formats …

2697b49

…(which both can be can be parsed with parse_dates=True in read_json)

BUG: patch in weird nested decoding issue, courtesy of @Komnomnomnom

8e4314d

jreback added a commit that referenced this pull request Jun 11, 2013

Merge pull request #3804 from jreback/ujson

a7f37d4

ENH: add ujson support in pandas.io.json

jreback merged commit a7f37d4 into pandas-dev:master Jun 11, 2013

jreback mentioned this pull request Jun 11, 2013

ENH: Add JSON export option for DataFrame #631 #1226

Closed

This was referenced Jun 11, 2013

Choice of unit when serializing timestamps to JSON #1305

Closed

Handling of nested JSON records #1067

Closed

This was referenced Jun 11, 2013

ENH Prefer requests over urllib2 #3856

Closed

json round trip exception #3867

Closed

hayd mentioned this pull request Jul 23, 2013

Reading from mongodb (BSON) #4329

Closed

ENH: add ujson support in pandas.io.json #3804

ENH: add ujson support in pandas.io.json #3804

Conversation

jreback commented Jun 7, 2013

hayd commented Jun 7, 2013

hayd commented Jun 8, 2013

jreback commented Jun 8, 2013

hayd commented Jun 8, 2013

jreback commented Jun 8, 2013

hayd commented Jun 8, 2013

hayd commented Jun 8, 2013

jreback commented Jun 8, 2013

hayd commented Jun 8, 2013

jreback commented Jun 8, 2013

cpcloud commented Jun 8, 2013

hayd commented Jun 8, 2013

cpcloud commented Jun 8, 2013

cpcloud commented Jun 8, 2013

hayd commented Jun 8, 2013

jreback commented Jun 8, 2013

jreback commented Jun 8, 2013

hayd commented Jun 8, 2013

cpcloud commented Jun 8, 2013

jreback commented Jun 8, 2013

jreback commented Jun 8, 2013

hayd commented Jun 8, 2013

jreback commented Jun 8, 2013

cpcloud commented Jun 8, 2013

cpcloud commented Jun 8, 2013

jreback commented Jun 8, 2013

hayd commented Jun 8, 2013

cpcloud commented Jun 8, 2013

cpcloud commented Jun 8, 2013

jreback commented Jun 11, 2013

Komnomnomnom commented Jun 11, 2013

jreback commented Jun 11, 2013

hayd commented Jun 11, 2013

jreback commented Jun 11, 2013

wesm commented Jun 11, 2013

jreback commented Jun 11, 2013

Komnomnomnom commented Jun 11, 2013

jreback commented Jun 11, 2013

Komnomnomnom commented Jun 11, 2013

jreback commented Jun 11, 2013

wesm commented Jun 13, 2013