Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
ENH: add ujson support in pandas.io.json #3804
Conversation
jreback
referenced
this pull request
Jun 7, 2013
Closed
Moving pandasjson back into mainline pandas #3583
|
yay! |
|
This is pretty awesome. One thing I think worth being explicit in the docs (am I right in saying this?) only works with valid JSON. |
|
the json routines read/write from strings is this typical of dealing with JSON data? should we have a kw to do this? always do it? |
|
@jreback That is an excellent point, this should work as all the other (The first thing I did was open a json_file It'd certainly be a useful feature is we could go We could either:
(Also, to clarify previous point, from_json only reads valid json :) ) |
|
do u have a URL that yields JSON? |
|
A more interesting example: https://api.github.com/repos/pydata/pandas/issues?per_page=100 |
|
parsed first try!
|
|
The one thing that trips is dates (you just have to to_datetime after), but that can be left for another day. Whoop! :) |
|
yeh....we'll see how this goes....in 0.12 can add |
|
i wonder if there are any other similar libraries or systems that have this much io functionality in a single package... |
|
(Does infer_types work for unix time stamps? ...to get the roundtrip working. Anyway....) |
|
doubtful since those are just integers...but i haven't tested |
|
i tried date +"%s" | python -c 'import sys; from dateutil.parser import parse; parse(sys.stdin.read())'that doesn't work so i'm going to say no it won't work. |
|
You can do |
|
but there is an issue because sometimes they are not in seconds...so have to disambiguate |
|
Just saying as |
|
oh yes |
|
could add a |
|
|
|
(obviously far too reckless to just |
|
what's a quick way to fix inconcistent space/tabs....something got screwed up... |
|
M-x untabify |
|
on a region i think prolly whole file works too |
|
@hayd actually try out: |
|
Quite slow though? |
|
convert_objects is ok speed wise it operates on blocks using cython functions so it's gotta be faster than lambdas :) |
|
@jreback correct me if i'm wrong here... |
|
Well:
|
|
i stand corrected... |
|
but, you actually wouldn't do that you would just take the int columns...so should be a bit faster..(as they are already parsed to int) |
|
Ok, so if we make it a unix time and roundtrip:
We could parse after we've created it pretty quick like this:
Then this is not toooo slow:
? In fact, just doing this over all columns isn't too slow:
|
|
Ah, |
|
Why doesn't |
|
Yeah, let's not use any of the methods I mentioned. They not robust at all. I'm not sure how you can check whether a unix time is in a certain range (since you may actually have a timestamp from |
|
try: |
|
I think most robust is to accept |
|
ok...updated to read from file/stringio/url and write to file (or stringio if its None) we don't actually write to a string in any other to_xxx but It think if you really want it you can pass None
alternativey I could make the path optional but I have to have or should I just try to open any string as a file??? and if we can't then its a JSON string? |
|
I was kind of thinking (if we wanted to have option to pass a string):
So you'd use it like |
|
ok...done (used the arg |
|
This is looking good! |
|
should we just scrap the json arg? I can test whether the file exists if so then it's a file otherwise I see if it has a read attribute so it like stringio otherwise it's a string |
|
It could be safe, but then first arg is not the standard I think I prefer |
|
ok will change json to string |
|
Well, I was thinking you could then do Agreed about |
|
ok I guess it comes down to do h typically use json as strings or write them to files ? |
|
I did feel pretty strongly First argument could be Maybe if we wanted more control/"safety" we could pass a boolean ( Perhaps I am just overcomplicating/overthinking it. Sorry. |
|
Just to roll back to the dates munging thing.
although fast converts everything to dates, even columns of low numbers. It's not clear how we could "guess" whether columns were dates or not... but what about hack like this, where we try to apply it to columns which are very likely to be date. For example, I think these would cover a large number of cases (and we can tweek it):
And the user can add some additional ones to check?
Thoughts? |
|
So it's like:
:s |
|
@hayd ok..fixed up almost like you suggested, I dropped
I also try to convert the index on a Series and either the index or columns (depending on the orient) for a Frame if it looks like a datelike series, should I provide an option to turn this on/off? (again in case we are getting false positives?) |
|
@hayd maybe should add a date_format kw in to_json epoch (default) |
|
Having kw for way datetimes are outputted is an awesome idea! (what is the reason iso8601 not being default?) It's probably a good idea to have Oooooh, good work:
Not sure I understand the series/column bit (I have a strong suspicion I'm being thick), are you suggesting some of these should work?
? |
And your examples from above (I turned on date parsing for series and its index)
This is interpreeted as a frame (I don't try to parse column labels by default in a frame, I guess I could)
|
|
Wowza. This is looking good. (Probably the only dodgey bit is my "guess at which are date columns" hack but you've made it easy to add to that. Going to look at some more apis and see how this does.) |
|
@hayd a couple of more examples from the web would be great |
|
@wesm ? |
|
Date col name coverage is pretty good I think, maybe this improves it slightly:
Everything has Just WorkedTM so far. :) Thoughts on default |
|
I'll add those in default format I think for compat should be epoch |
|
I agree with epoch. Hmmm here's a failing one (the claim is it's valid: http://jsonlint.com/ ):
|
|
I'm going to throw a curve ball here (a can of worms for the future) about deeply nested json e.g. jsend etc., where the json is neatly wrapped up in silly tags e.g. Actually stuff is also going weird from those files (the ones on their site are actually invalid). When I do this I get a segmentation fault (!)... everytime:
or
|
|
things are going to break this....I got the same behavior you did (no parse on the first, seg fault on the 2nd)....Not really sure why; error messages aren't great |
|
Could it be that it doesn't make sense to parse it to a DataFrame? |
|
This is the top-link, which I grabbed, openened in notpad and saved to a string, it DOES parse, sort of
|
|
See what I mean about some of these being wrapped in junk, the meat is in items. Presumably it's not easy to pass in 'items' and we return dataframe for json['items'], or in the one above pass in ('data', 'posts') and get back json['data']['posts'] ? |
|
I wonder if it is the same problem and the stackexchange api is wrapping it in 'data'? i.e. ('data', 'items') |
|
You guys want me to merge this for 0.11.1? |
|
Nice stuff btw |
|
(I think it'd be great to have in 0.11.1 :) ) |
|
Looks great. Opens up the opportunity to read from databases like MongoDB which store data as JSON and get the data in DF (which can ofcourse be very unstructured) and vice versa. See for instance (this)[http://api.mongodb.org/python/current/tutorial.html#querying-for-more-than-one-document] . |
|
Well, here's the valgrind output
Awesome. Let me look quickly at the invalid write (causing the segfault) |
|
I'm not equipped to debug this today. @Komnomnomnom can I beg you to take a look? |
|
(stating the obvious here: presumably it's to do with it's bad shape.) |
|
yeah, definitely raises an exception inside the decoder but there's some data that's being incorrectly freed or otherwise modified |
|
@wesm should be able to take a look tomorrow (on vacation, returning today) |
|
@jreback The stackoverflow issue is because of its gzip encoding. (Probably quite a few additional arguments for this in the future or maybe io.common...)
|
|
ahh....makes sense.....should prob be able to detect that.... actually I think I see the problem with that particular one:
also see here: pydata#2636 def need a generalized gzip handled (in io.common) |
|
Maybe we should just use requests everywhere? I don't mind doing the porting for to it (or is there a reason not to?). e.g. it not being in the standard lib... |
|
can we include it as a sub-library ? its apache2 license, don't know if thats compaiblel |
|
Licenses should be compatible, I was thinking just have an optional dependency with a fall back (i.e. if they have it installed use it, else use urllib2), was going to make another thread, but I think I'll just put a pr together later in the week and see what people think. |
|
I'd go for that....makes things cleaner |
|
So I've tracked at least part of the seg fault problem down to line 78, as @wesm surmised it looks like it is freeing some memory it shouldn't be. With this commented out it parses successfully. Still investigating but once solved @jreback should I just paste the patch here or submit another pull request? |
|
@Komnomnomnom go ahead and paste here |
|
Ok the following patch should make it safe to call diff --git a/pandas/src/ujson/python/JSONtoObj.c b/pandas/src/ujson/python/JSONtoObj.c
index 1db7586..160c30f 100644
--- a/pandas/src/ujson/python/JSONtoObj.c
+++ b/pandas/src/ujson/python/JSONtoObj.c
@@ -10,6 +10,7 @@ typedef struct __PyObjectDecoder
JSONObjectDecoder dec;
void* npyarr; // Numpy context buffer
+ void* npyarr_addr; // Ref to npyarr ptr to track DECREF calls
npy_intp curdim; // Current array dimension
PyArray_Descr* dtype;
@@ -67,9 +68,7 @@ void Npy_releaseContext(NpyArrContext* npyarr)
}
if (npyarr->dec)
{
- // Don't set to null, used to make sure we don't Py_DECREF npyarr
- // in releaseObject
- // npyarr->dec->npyarr = NULL;
+ npyarr->dec->npyarr = NULL;
npyarr->dec->curdim = 0;
}
Py_XDECREF(npyarr->labels[0]);
@@ -88,6 +87,7 @@ JSOBJ Object_npyNewArray(void* _decoder)
{
// start of array - initialise the context buffer
npyarr = decoder->npyarr = PyObject_Malloc(sizeof(NpyArrContext));
+ decoder->npyarr_addr = npyarr;
if (!npyarr)
{
@@ -515,7 +515,7 @@ JSOBJ Object_newDouble(double value)
static void Object_releaseObject(JSOBJ obj, void* _decoder)
{
PyObjectDecoder* decoder = (PyObjectDecoder*) _decoder;
- if (obj != decoder->npyarr)
+ if (obj != decoder->npyarr_addr)
{
Py_XDECREF( ((PyObject *)obj));
}
@@ -555,6 +555,7 @@ PyObject* JSONToObj(PyObject* self, PyObject *args, PyObject *kwargs)
pyDecoder.dec = dec;
pyDecoder.curdim = 0;
pyDecoder.npyarr = NULL;
+ pyDecoder.npyarr_addr = NULL;
decoder = (JSONObjectDecoder*) &pyDecoder;
@@ -609,6 +610,7 @@ PyObject* JSONToObj(PyObject* self, PyObject *args, PyObject *kwargs)
if (PyErr_Occurred())
{
+ Npy_releaseContext(pyDecoder.npyarr);
return NULL;
} |
wesm
and others
added some commits
May 12, 2013
|
patch applied.....looking good now |
|
@jreback Something like this for requests: hayd/pandas@dbd968b |
|
@wesm this is mergable....any objections? |
|
Looks good to me, bombs away |
|
3.2.1..... |
jreback
added a commit
that referenced
this pull request
Jun 11, 2013
|
|
jreback |
a7f37d4
|
jreback
merged commit a7f37d4
into pandas-dev:master
Jun 11, 2013
jreback
referenced
this pull request
Jun 11, 2013
Closed
ENH: Add JSON export option for DataFrame #631 #1226
|
Awesome. I'll see about merging in upstream changes. Will send thru a pull request soonish. |
|
oh...you have additional dependencies on this? |
This was referenced Jun 11, 2013
|
ok...sure... |
This was referenced Jun 11, 2013
|
thanks all for making this happen, especially to @Komnomnomnom for authoring this code in the first place =) |
jreback commentedJun 7, 2013
This is @wesm PR #3583 with this:
It builds now, and passes travis on py2 and py3, had 2 issues:
Converted to new io API:
to_json/read_jsonDocs added