ENH: add ujson support in pandas.io.json #3804

Merged
merged 10 commits into from Jun 11, 2013

Conversation

Projects
None yet
6 participants
Contributor

jreback commented Jun 7, 2013

This is @wesm PR #3583 with this:

It builds now, and passes travis on py2 and py3, had 2 issues:

  • clean was erasing the *.c files from ujson
  • the module import didn't work because it was using the original init function

Converted to new io API: to_json / read_json

Docs added

Contributor

hayd commented Jun 7, 2013

yay!

Contributor

hayd commented Jun 8, 2013

This is pretty awesome.

One thing I think worth being explicit in the docs (am I right in saying this?) only works with valid JSON.

Contributor

jreback commented Jun 8, 2013

the json routines read/write from strings
this is unlike any of the other io routines that pandas has
which all take a path_or_buf

is this typical of dealing with JSON data?

should we have a kw to do this? always do it?

Contributor

hayd commented Jun 8, 2013

@jreback That is an excellent point, this should work as all the other read_s. I don't think this is necessarily typical to always have the string but at least it does makes it clear the read_json takes the entire string at once rather than by chunks.

(The first thing I did was open a json_file f.readlines(). f.read())

It'd certainly be a useful feature is we could go pd.from_json(datas_url) and perhaps this would be a fairly standard use case.

We could either:

  1. Have have a string kwarg, filepath_or_buffer is first argument (this would be my preference).
  2. Check if it's a filepath_or_buffer, if not there, treat as json string (seems like a can of worms)

(Also, to clarify previous point, from_json only reads valid json :) )

Contributor

jreback commented Jun 8, 2013

do u have a URL that yields JSON?

Contributor

hayd commented Jun 8, 2013

Contributor

jreback commented Jun 8, 2013

parsed first try!

(Pdb) url_table
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 19 columns):
assignee        6  non-null values
body            100  non-null values
closed_at       0  non-null values
comments        100  non-null values
comments_url    100  non-null values
created_at      100  non-null values
events_url      100  non-null values
html_url        100  non-null values
id              100  non-null values
labels          100  non-null values
labels_url      100  non-null values
milestone       75  non-null values
number          100  non-null values
pull_request    100  non-null values
state           100  non-null values
title           100  non-null values
updated_at      100  non-null values
url             100  non-null values
user            100  non-null values
dtypes: int64(3), object(16)
Contributor

hayd commented Jun 8, 2013

The one thing that trips is dates (you just have to to_datetime after), but that can be left for another day.

Whoop! :)

Contributor

jreback commented Jun 8, 2013

yeh....we'll see how this goes....in 0.12 can add infer_types directive, kind of like read_html

Member

cpcloud commented Jun 8, 2013

i wonder if there are any other similar libraries or systems that have this much io functionality in a single package...

Contributor

hayd commented Jun 8, 2013

(Does infer_types work for unix time stamps? ...to get the roundtrip working. Anyway....)

Member

cpcloud commented Jun 8, 2013

doubtful since those are just integers...but i haven't tested

Member

cpcloud commented Jun 8, 2013

i tried

date +"%s" | python -c 'import sys; from dateutil.parser import parse; parse(sys.stdin.read())'

that doesn't work so i'm going to say no it won't work.

Contributor

hayd commented Jun 8, 2013

You can do pd.to_datetime on the column after reading.

Contributor

jreback commented Jun 8, 2013

yes an know....we have an open issue #3540 to make a better API for this, but look at #2015

so if you know they are epoch time stamps (e.g. passing in as an option maybe), then its easy, we can convert them

Contributor

jreback commented Jun 8, 2013

Timestamp accepts a nanosecond based epoch timestamp (e.g. nanoseconds since 1970; so epoch time stamps are in seconds since 1970, so juse int(1e9) and it will work).....

but there is an issue because sometimes they are not in seconds...so have to disambiguate

Contributor

hayd commented Jun 8, 2013

Just saying as to_json exports timestamps to unix time.

Member

cpcloud commented Jun 8, 2013

oh yes Timestamp works...hm maybe should add to read_html...i doubt people are using html tables to store unix timestamp but who the hell knows? maybe i'll wait until the api is sorted out

Contributor

jreback commented Jun 8, 2013

could add a parse_dates arg that takes a list of fields to try to convert?

Contributor

jreback commented Jun 8, 2013

read_html much tougher because people are just reading it in ...and there is no standard...@hayd is right...since its a standard, could even try to convert an integer column (if they are in range)?

Contributor

hayd commented Jun 8, 2013

(obviously far too reckless to just .applymap(pd.to_datetime) lol)

Contributor

jreback commented Jun 8, 2013

what's a quick way to fix inconcistent space/tabs....something got screwed up...

Member

cpcloud commented Jun 8, 2013

M-x untabify

Member

cpcloud commented Jun 8, 2013

on a region i think prolly whole file works too

Contributor

jreback commented Jun 8, 2013

@hayd actually .apply(pd.to_datetime) will not change the column if all conversions fail, so it is safe sort of

try out: convert_objects(convert_dates='coerce'))

Contributor

hayd commented Jun 8, 2013

Quite slow though?

Member

cpcloud commented Jun 8, 2013

convert_objects is ok speed wise it operates on blocks using cython functions so it's gotta be faster than lambdas :)

Member

cpcloud commented Jun 8, 2013

@jreback correct me if i'm wrong here...

Contributor

hayd commented Jun 8, 2013

Well:

In [44]: %timeit with open('issues.json', 'r') as f: s = pd.read_json(f.read())
100 loops, best of 3: 8.83 ms per loop

In [45]: %timeit s.convert_objects(convert_dates='coerce')
1 loops, best of 3: 520 ms per loop
Member

cpcloud commented Jun 8, 2013

i stand corrected...

Contributor

jreback commented Jun 8, 2013

but, you actually wouldn't do that you would just take the int columns...so should be a bit faster..(as they are already parsed to int)

Contributor

hayd commented Jun 8, 2013

Ok, so if we make it a unix time and roundtrip:

In [11]: df['created_at'] = pd.to_datetime(df['created_at'])

In [12]: df = pd.read_json(df.to_json())

In [13]: df.created_at.iloc[0]
Out[13]: 1370697759000000000

We could parse after we've created it pretty quick like this:

In [14]: int_cols = df.dtypes[df.dtypes == np.int64].index.tolist()

In [15]: int_cols  # maybe by default try to do this
Out[15]: [u'comments', u'created_at', u'id', u'number']

parse_dates = []  # addition columns to convert

Then this is not toooo slow:

In [21]: %timeit for col in int_cols + parse_dates: df[col] = pd.to_datetime(df[col])
1000 loops, best of 3: 1.12 ms per loop

?

In fact, just doing this over all columns isn't too slow:

In [31]: %timeit for col in df.columns: df[col] = pd.to_datetime(df[col])
100 loops, best of 3: 9.05 ms per loop
Contributor

hayd commented Jun 8, 2013

Ah, int_cols = df.dtypes[df.dtypes == np.int64].index.tolist() is shockingly slow, probably better way to do that.

Contributor

hayd commented Jun 8, 2013

Why doesn't to_json just export ISO 8601 ?

Contributor

hayd commented Jun 8, 2013

Yeah, let's not use any of the methods I mentioned. They not robust at all.

I'm not sure how you can check whether a unix time is in a certain range (since you may actually have a timestamp from 1970-01-01 00:00:00.000000001.

Contributor

jreback commented Jun 8, 2013

try: df.blocks['int64'].columns

Contributor

jreback commented Jun 8, 2013

I think most robust is to accept parse_dates, the problem is say you have a column of all 1; that is technically a valid date

Contributor

jreback commented Jun 8, 2013

ok...updated to read from file/stringio/url and write to file (or stringio if its None)

we don't actually write to a string in any other to_xxx but It think if you really want it you can pass None
e.g.

to_json(None) returns a StringIO object

alternativey I could make the path optional

but I have to have read_json take a StringIO otherwise a 'string' is interpreted as a filename/url
(I think for safety you have to do this)?

or should I just try to open any string as a file??? and if we can't then its a JSON string?

Contributor

hayd commented Jun 8, 2013

I was kind of thinking (if we wanted to have option to pass a string):

def read_json(path=None, string=None,...):
    if path is None and isinstance(string, basetring):
        return the json from the string like it was before
    else:
        do what it's doing now

So you'd use it like pd.read_json(string='{"0":{"0":1,"1":3},"1":{"0":2,"1":4}}'). :s

Contributor

jreback commented Jun 8, 2013

ok...done (used the arg json=).....see the docs and tell me if it makes sense

Contributor

hayd commented Jun 8, 2013

This is looking good!

Contributor

jreback commented Jun 8, 2013

should we just scrap the json arg?
I can just see if its a file like?

I can test whether the file exists if so then it's a file

otherwise I see if it has a read attribute so it like stringio

otherwise it's a string
I think that's safe?

Contributor

hayd commented Jun 8, 2013

It could be safe, but then first arg is not the standard filepath_or_buffer, I like it as an argument. Saves worrying about it being hypothetically safe or not, I think it's safe too. Not sure either way.

I think I prefer string over json though, more descriptive.

Contributor

jreback commented Jun 8, 2013

ok will change json to string
why at the end of kw?
also I might as well just return a reg string from to_json now
rather than stringio

Contributor

hayd commented Jun 8, 2013

Well, I was thinking you could then do read_json(data_url, 'index'), not that doing such a thing is necessarily a good idea... but

Agreed about to_json() being a string.

Contributor

jreback commented Jun 8, 2013

ok

I guess it comes down to do h typically use json as strings or write them to files ?
we could make the first argument the string and have a path= argument instead for path/URL/buffer?

Contributor

hayd commented Jun 8, 2013

I did feel pretty strongly filepath_or_buffer should be first argument, my guess is files/urls will be the most common use case (but I've no evidence to support that). Maybe you're right and read_json(string) should also just work...

First argument could be json which is either a filepath_or_buffer (checked first), else a valid json string.

Maybe if we wanted more control/"safety" we could pass a boolean (string? defaults to None) which if True doesn't try for file_path etc, and if False doesn't try for string, if None tries both.

Perhaps I am just overcomplicating/overthinking it. Sorry.

Contributor

hayd commented Jun 8, 2013

Just to roll back to the dates munging thing.

int_cols = df.blocks['int64'].columns
for col in int_cols: df[col] = pd.to_datetime(df[col])

although fast converts everything to dates, even columns of low numbers.

It's not clear how we could "guess" whether columns were dates or not... but what about hack like this, where we try to apply it to columns which are very likely to be date. For example, I think these would cover a large number of cases (and we can tweek it):

%timeit for col in [col for col in df.columns if col.endswith('_at')]: df[col] = pd.to_datetime(df[col])
1000 loops, best of 3: 719 us per loop

%timeit for col in [col for col in df.columns if col.endswith('_at') or col.endswith('_time')]: df[col] = pd.to_datetime(df[col])
1000 loops, best of 3: 732 us per loop

%timeit for col in [col for col in df.columns if col.endswith('_at') or col.endswith('_time') or col.lower() == 'modified' or col.lower() == 'date']: df[col] = pd.to_datetime(df[col])
1000 loops, best of 3: 745 us per loop

And the user can add some additional ones to check?

date_cols = ['modified']  # addition columns to parse_dates (just like with read_csv's na_values). 
%timeit for col in [col for col in df.columns if col.endswith('_at') or col.endswith('_time') or col.lower() == 'modified' or col.lower() == 'date' or col in date_cols]: df[col] = pd.to_datetime(df[col])
1000 loops, best of 3: 752 us per loop

keep_default_dates = False
%timeit for col in [col for col in df.columns if (keep_default_dates and (col.endswith('_at') or col.endswith('_time') or col.lower() == 'modified' or col.lower() == 'date') or col in date_cols]: df[col] = pd.to_datetime(df[col])

Thoughts?

Contributor

hayd commented Jun 8, 2013

So it's like:

def read_json(json, orient, typ, dtype, numpy, parse_dates=True, date_cols=None, keep_default_dates=True):

    # json is filepath_or_buffer or valid_json_string like we were saying
    # do what you are doing atm

    if isinstance(obj, DataFrame) and parse_dates:  # not sure what to do if a Series
        if date_cols is None:
            date_cols = []
        for col in [col for col in obj.columns
                          if (keep_default_dates and (col.endswith('_at') or
                                                       col.endswith('_time') or
                                                       col.lower() == 'modified' or
                                                       col.lower() == 'date' or
                                                       col.lower() == 'datetime') # and we can add some more in here
                               or col in date_cols]:
            obj[col] = pd.to_datetime(obj[col])

:s

Contributor

jreback commented Jun 9, 2013

@hayd ok..fixed up almost like you suggested, I dropped date_cols, instead allowing True/False/a list for parse_dates which is basically the same thing. Also I actually try to parse all columns that are int/float/object, BUT, I apply a heuristic to the float/int ones to avoid false positives, and of course only those columns that are in the keep_default_dates spec are considered

parse_dates=False is now the default, in case we get too many False positives (or something is wrong)

I also try to convert the index on a Series and either the index or columns (depending on the orient) for a Frame if it looks like a datelike series, should I provide an option to turn this on/off? (again in case we are getting false positives?)

Contributor

jreback commented Jun 9, 2013

@hayd maybe should add a date_format kw in to_json
with values like:

epoch (default)
iso8601
?

Contributor

hayd commented Jun 9, 2013

Having kw for way datetimes are outputted is an awesome idea! (what is the reason iso8601 not being default?)

It's probably a good idea to have parse_dates=False as default, it's pretty "experimental". :)

Oooooh, good work:

In [1]: df = pd.read_json('https://api.github.com/repos/pydata/pandas/issues?per_page=3', parse_dates=True)

In [2]: df.created_at[0]
Out[2]: Timestamp('2013-06-09 04:38:21', tz=None)

Not sure I understand the series/column bit (I have a strong suspicion I'm being thick), are you suggesting some of these should work?

In [3]: pd.read_json('{"date": "2013-06-09T04:38:21Z"}', typ='series', parse_dates=True)
Out[3]:
date    2013-06-09T04:38:21Z  # not a Timestamp
dtype: object

In [4]: pd.read_json('{"date": "2013-06-09T04:38:21Z"}', parse_dates=True, typ='series', orient='records')
Out[4]:
date    2013-06-09T04:38:21Z  # not a Timestamp
dtype: object

In [5]: pd.read_json('{"2013-06-09T04:38:21Z": 7}', typ='series', parse_dates=True)
Out[5]:
2013-06-09T04:38:21Z  # not a Timestamp
dtype: int64

In [6]: pd.read_json('{"2013-06-09T04:38:21Z": [7]}', parse_dates=True)
Out[7]:
   2013-06-09T04:38:21Z  # not a Timestamp
0                     7

?

Contributor

jreback commented Jun 9, 2013

In [15]: DataFrame(randn(2,2),index=date_range('20130101',periods=2)).to_json(date_format='iso')
Out[15]: '{"0":{"2013-01-01T00:00:00":-1.3240571997,"2013-01-02T00:00:00":-0.6429140007},"1":{"2013-01-01T00:00:00":0.6358852931,"2013-01-02T00:00:00":-1.0422148029}}'

And your examples from above (I turned on date parsing for series and its index)

In [9]: pd.read_json('{"date": "2013-06-09T04:38:21Z"}', typ='series', parse_dates=True)
Out[9]: 
date   2013-06-09 04:38:21
dtype: datetime64[ns]

In [10]: pd.read_json('{"date": "2013-06-09T04:38:21Z"}', parse_dates=True, typ='series', orient='records')
Out[10]: 
date   2013-06-09 04:38:21
dtype: datetime64[ns]

In [11]: pd.read_json('{"2013-06-09T04:38:21Z": 7}', typ='series', parse_dates=True)
Out[11]: 
2013-06-09 04:38:21    7
dtype: int64

This is interpreeted as a frame (I don't try to parse column labels by default in a frame, I guess I could)

In [12]: pd.read_json('{"2013-06-09T04:38:21Z": [7]}', parse_dates=True)
Out[12]: 
   2013-06-09T04:38:21Z
0                     7
Contributor

hayd commented Jun 9, 2013

Wowza. This is looking good.

(Probably the only dodgey bit is my "guess at which are date columns" hack but you've made it easy to add to that. Going to look at some more apis and see how this does.)

Contributor

jreback commented Jun 9, 2013

@hayd a couple of more examples from the web would be great

Contributor

jreback commented Jun 9, 2013

@wesm ?

Contributor

hayd commented Jun 9, 2013

Date col name coverage is pretty good I think, maybe this improves it slightly:

(col.lower().endswith('_at') or  # quite a few words end in at
 col.endswith('At') or
 col.lower().endswith('time') or  # was `_time` before, perhaps 'time' in col.lower() ?
 col.lower().endswith('date') or  # perhaps 'date' in col.lower() ?
 col.lower() in ['modified', 'created', 'updated'])

Everything has Just WorkedTM so far. :)

Thoughts on default to_json date format ?

Contributor

jreback commented Jun 9, 2013

I'll add those in

default format I think for compat should be epoch
unless y see differently

Contributor

hayd commented Jun 9, 2013

I agree with epoch.

Hmmm here's a failing one (the claim is it's valid: http://jsonlint.com/ ):

pd.read_json('https://api.stackexchange.com/2.1/search?page=1&pagesize=10&order=desc&sort=activity&tagged=pandas&site=stackoverflow', parse_dates=True)
# also can't read it with ordinary json.load so... I guess something is fishy with it
Contributor

hayd commented Jun 9, 2013

I'm going to throw a curve ball here (a can of worms for the future) about deeply nested json e.g. jsend etc., where the json is neatly wrapped up in silly tags e.g. json['data'] or perhaps you really want to make a DataFrame from json['data']['posts'].

Actually stuff is also going weird from those files (the ones on their site are actually invalid). When I do this I get a segmentation fault (!)... everytime:

In [1]: s = r'''{
    "status": "success",
    "data": {
        "posts": [
            {
                "id": 1,
                "title": "A blog post",
                "body": "Some useful content"
            },
            {
                "id": 2,
                "title": "Another blog post",
                "body": "More content"
            }
        ]
    }
}'''

In [2]: import pandas as pd

In [3]: pd.read_json(s)
[1]    23137 bus error  ipython

or

[1]    23301 segmentation fault  ipython
Contributor

jreback commented Jun 9, 2013

things are going to break this....I got the same behavior you did (no parse on the first, seg fault on the 2nd)....Not really sure why; error messages aren't great

Contributor

hayd commented Jun 9, 2013

Could it be that it doesn't make sense to parse it to a DataFrame?

Contributor

jreback commented Jun 9, 2013

This is the top-link, which I grabbed, openened in notpad and saved to a string, it DOES parse, sort of
and its very nested.....what if anything to do here?

In [19]: x = pd.read_json(data)

In [20]: x
Out[20]: 
  has_more                                              items  quota_max  quota_remaining
0     True  {u'view_count': 28, u'title': u'Extracting XML...        300              299
1     True  {u'view_count': 42, u'title': u'Missing data i...        300              299
2     True  {u'view_count': 17, u'title': u'pandas timeser...        300              299
3     True  {u'view_count': 35, u'title': u'How do I creat...        300              299
4     True  {u'closed_date': 1370807778, u'view_count': 34...        300              299
5     True  {u'view_count': 28, u'title': u'Python Pandas ...        300              299
6     True  {u'view_count': 27, u'title': u'Using pandas a...        300              299
7     True  {u'view_count': 31, u'title': u'Merging multip...        300              299
8     True  {u'view_count': 31, u'title': u'Python - Creat...        300              299
9     True  {u'view_count': 14, u'title': u'pandas resampl...        300              299
In [21]: x = pd.read_json(data)['items'].iloc[0]

In [22]: x
Out[22]: 
{u'accepted_answer_id': 16993660,
 u'answer_count': 1,
 u'creation_date': 1370633671,
 u'is_answered': True,
 u'last_activity_date': 1370806974,
 u'last_edit_date': 1370800493,
 u'link': u'http://stackoverflow.com/questions/16991691/extracting-xml-into-data-frame-with-parent-attribute-as-column-title',
 u'owner': {u'accept_rate': 100,
  u'display_name': u'Jessi',
  u'link': u'http://stackoverflow.com/users/2437407/jessi',
  u'profile_image': u'http://www.gravatar.com/avatar/6c77f4d2d81be0774483548a52ade9ef?d=identicon&r=PG',
  u'reputation': 32,
  u'user_id': 2437407,
  u'user_type': u'registered'},
 u'question_id': 16991691,
 u'score': 1,
 u'tags': [u'python', u'pandas', u'lxml'],
 u'title': u'Extracting XML into data frame with parent attribute as column title',
 u'view_count': 28}

Contributor

hayd commented Jun 9, 2013

See what I mean about some of these being wrapped in junk, the meat is in items.

Presumably it's not easy to pass in 'items' and we return dataframe for json['items'], or in the one above pass in ('data', 'posts') and get back json['data']['posts'] ?

Contributor

hayd commented Jun 9, 2013

I wonder if it is the same problem and the stackexchange api is wrapping it in 'data'? i.e. ('data', 'items')

Owner

wesm commented Jun 9, 2013

You guys want me to merge this for 0.11.1?

Owner

wesm commented Jun 9, 2013

Nice stuff btw

Contributor

hayd commented Jun 9, 2013

(I think it'd be great to have in 0.11.1 :) )

Contributor

jreback commented Jun 9, 2013

+1 on 0.11.1.....at the very least to put it out there for people to 'try'...

@wesm would like to investigate why it core dumps (in the case @hayd) put up...

choking somewhere ......

Contributor

nipunbatra commented Jun 10, 2013

Looks great. Opens up the opportunity to read from databases like MongoDB which store data as JSON and get the data in DF (which can ofcourse be very unstructured) and vice versa. See for instance (this)[http://api.mongodb.org/python/current/tutorial.html#querying-for-more-than-one-document] .

Owner

wesm commented Jun 10, 2013

Well, here's the valgrind output

In [9]: pd.read_json(s)
==26273== Invalid read of size 4
==26273==    at 0x4EBED84: PyObject_Free (obmalloc.c:969)
==26273==    by 0x4E985B8: EnvironmentError_dealloc (exceptions.c:641)
==26273==    by 0x4EB66DE: PyDict_DelItem (dictobject.c:854)
==26273==    by 0x4EB6C96: PyDict_DelItemString (dictobject.c:2452)
==26273==    by 0x4F192FB: PyEval_EvalFrameEx (ceval.c:3442)
==26273==    by 0x4F1EEC7: PyEval_EvalFrameEx (ceval.c:4107)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1FD01: PyEval_EvalCode (ceval.c:667)
==26273==    by 0x4F1EF89: PyEval_EvalFrameEx (ceval.c:4718)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==  Address 0x5f30020 is not stack'd, malloc'd or (recently) free'd
==26273== 
==26273== Invalid read of size 8
==26273==    at 0x4EB8507: dict_subscript (dictobject.c:1201)
==26273==    by 0x4F1A95F: PyEval_EvalFrameEx (ceval.c:1391)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4EA3A51: function_call (funcobject.c:526)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4E8622E: instancemethod_call (classobject.c:2602)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4ED799B: slot_tp_init (typeobject.c:5688)
==26273==    by 0x4ED2777: type_call (typeobject.c:739)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==  Address 0x5e04000 is 32 bytes inside a block of size 280 free'd
==26273==    at 0x4C29097: realloc (vg_replace_malloc.c:525)
==26273==    by 0x4EC9257: _PyString_Resize (stringobject.c:3908)
==26273==    by 0x4ECAA08: PyString_FromFormatV (stringobject.c:394)
==26273==    by 0x4F2DC1F: PyErr_Format (errors.c:550)
==26273==    by 0x4E7A563: PySequence_Size (abstract.c:17)
==26273==    by 0xE9E836A: PyArray_IntpFromSequence (conversion_utils.c:870)
==26273==    by 0xE9E8556: PyArray_IntpConverter (conversion_utils.c:120)
==26273==    by 0x4F30915: convertsimple (getargs.c:1253)
==26273==    by 0x4F31A10: vgetargskeywords (getargs.c:514)
==26273==    by 0x4F3203F: _PyArg_ParseTupleAndKeywords_SizeT (getargs.c:1464)
==26273==    by 0xEA326E8: array_empty (multiarraymodule.c:1737)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273== 
==26273== Invalid read of size 8
==26273==    at 0x4EABE63: PyList_Append (listobject.c:280)
==26273==    by 0x4F1A7D7: PyEval_EvalFrameEx (ceval.c:1451)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4EA3A51: function_call (funcobject.c:526)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4E8622E: instancemethod_call (classobject.c:2602)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4ED799B: slot_tp_init (typeobject.c:5688)
==26273==    by 0x4ED2777: type_call (typeobject.c:739)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==  Address 0x5e04000 is 32 bytes inside a block of size 280 free'd
==26273==    at 0x4C29097: realloc (vg_replace_malloc.c:525)
==26273==    by 0x4EC9257: _PyString_Resize (stringobject.c:3908)
==26273==    by 0x4ECAA08: PyString_FromFormatV (stringobject.c:394)
==26273==    by 0x4F2DC1F: PyErr_Format (errors.c:550)
==26273==    by 0x4E7A563: PySequence_Size (abstract.c:17)
==26273==    by 0xE9E836A: PyArray_IntpFromSequence (conversion_utils.c:870)
==26273==    by 0xE9E8556: PyArray_IntpConverter (conversion_utils.c:120)
==26273==    by 0x4F30915: convertsimple (getargs.c:1253)
==26273==    by 0x4F31A10: vgetargskeywords (getargs.c:514)
==26273==    by 0x4F3203F: _PyArg_ParseTupleAndKeywords_SizeT (getargs.c:1464)
==26273==    by 0xEA326E8: array_empty (multiarraymodule.c:1737)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273== 
==26273== Invalid read of size 8
==26273==    at 0x4F1A7DC: PyEval_EvalFrameEx (ceval.c:1452)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4EA3A51: function_call (funcobject.c:526)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4E8622E: instancemethod_call (classobject.c:2602)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4ED799B: slot_tp_init (typeobject.c:5688)
==26273==    by 0x4ED2777: type_call (typeobject.c:739)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4F19021: PyEval_EvalFrameEx (ceval.c:4239)
==26273==  Address 0x5e04000 is 32 bytes inside a block of size 280 free'd
==26273==    at 0x4C29097: realloc (vg_replace_malloc.c:525)
==26273==    by 0x4EC9257: _PyString_Resize (stringobject.c:3908)
==26273==    by 0x4ECAA08: PyString_FromFormatV (stringobject.c:394)
==26273==    by 0x4F2DC1F: PyErr_Format (errors.c:550)
==26273==    by 0x4E7A563: PySequence_Size (abstract.c:17)
==26273==    by 0xE9E836A: PyArray_IntpFromSequence (conversion_utils.c:870)
==26273==    by 0xE9E8556: PyArray_IntpConverter (conversion_utils.c:120)
==26273==    by 0x4F30915: convertsimple (getargs.c:1253)
==26273==    by 0x4F31A10: vgetargskeywords (getargs.c:514)
==26273==    by 0x4F3203F: _PyArg_ParseTupleAndKeywords_SizeT (getargs.c:1464)
==26273==    by 0xEA326E8: array_empty (multiarraymodule.c:1737)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273== 
==26273== Invalid write of size 8
==26273==    at 0x4F1A7E6: PyEval_EvalFrameEx (ceval.c:1452)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4EA3A51: function_call (funcobject.c:526)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4E8622E: instancemethod_call (classobject.c:2602)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4ED799B: slot_tp_init (typeobject.c:5688)
==26273==    by 0x4ED2777: type_call (typeobject.c:739)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4F19021: PyEval_EvalFrameEx (ceval.c:4239)
==26273==  Address 0x5e04000 is 32 bytes inside a block of size 280 free'd
==26273==    at 0x4C29097: realloc (vg_replace_malloc.c:525)
==26273==    by 0x4EC9257: _PyString_Resize (stringobject.c:3908)
==26273==    by 0x4ECAA08: PyString_FromFormatV (stringobject.c:394)
==26273==    by 0x4F2DC1F: PyErr_Format (errors.c:550)
==26273==    by 0x4E7A563: PySequence_Size (abstract.c:17)
==26273==    by 0xE9E836A: PyArray_IntpFromSequence (conversion_utils.c:870)
==26273==    by 0xE9E8556: PyArray_IntpConverter (conversion_utils.c:120)
==26273==    by 0x4F30915: convertsimple (getargs.c:1253)
==26273==    by 0x4F31A10: vgetargskeywords (getargs.c:514)
==26273==    by 0x4F3203F: _PyArg_ParseTupleAndKeywords_SizeT (getargs.c:1464)
==26273==    by 0xEA326E8: array_empty (multiarraymodule.c:1737)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273== 
==26273== Invalid read of size 8
==26273==    at 0x4EA6ED9: listiter_next (listobject.c:2913)
==26273==    by 0x4F188F1: PyEval_EvalFrameEx (ceval.c:2497)
==26273==    by 0x4F1EEC7: PyEval_EvalFrameEx (ceval.c:4107)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4EA3A51: function_call (funcobject.c:526)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4E8622E: instancemethod_call (classobject.c:2602)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==  Address 0x5e04000 is 32 bytes inside a block of size 280 free'd
==26273==    at 0x4C29097: realloc (vg_replace_malloc.c:525)
==26273==    by 0x4EC9257: _PyString_Resize (stringobject.c:3908)
==26273==    by 0x4ECAA08: PyString_FromFormatV (stringobject.c:394)
==26273==    by 0x4F2DC1F: PyErr_Format (errors.c:550)
==26273==    by 0x4E7A563: PySequence_Size (abstract.c:17)
==26273==    by 0xE9E836A: PyArray_IntpFromSequence (conversion_utils.c:870)
==26273==    by 0xE9E8556: PyArray_IntpConverter (conversion_utils.c:120)
==26273==    by 0x4F30915: convertsimple (getargs.c:1253)
==26273==    by 0x4F31A10: vgetargskeywords (getargs.c:514)
==26273==    by 0x4F3203F: _PyArg_ParseTupleAndKeywords_SizeT (getargs.c:1464)
==26273==    by 0xEA326E8: array_empty (multiarraymodule.c:1737)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273== 
==26273== Invalid read of size 8
==26273==    at 0x4F196E3: PyEval_EvalFrameEx (ceval.c:1114)
==26273==    by 0x4F1EEC7: PyEval_EvalFrameEx (ceval.c:4107)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4EA3A51: function_call (funcobject.c:526)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4E8622E: instancemethod_call (classobject.c:2602)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4ED799B: slot_tp_init (typeobject.c:5688)
==26273==  Address 0x5e04000 is 32 bytes inside a block of size 280 free'd
==26273==    at 0x4C29097: realloc (vg_replace_malloc.c:525)
==26273==    by 0x4EC9257: _PyString_Resize (stringobject.c:3908)
==26273==    by 0x4ECAA08: PyString_FromFormatV (stringobject.c:394)
==26273==    by 0x4F2DC1F: PyErr_Format (errors.c:550)
==26273==    by 0x4E7A563: PySequence_Size (abstract.c:17)
==26273==    by 0xE9E836A: PyArray_IntpFromSequence (conversion_utils.c:870)
==26273==    by 0xE9E8556: PyArray_IntpConverter (conversion_utils.c:120)
==26273==    by 0x4F30915: convertsimple (getargs.c:1253)
==26273==    by 0x4F31A10: vgetargskeywords (getargs.c:514)
==26273==    by 0x4F3203F: _PyArg_ParseTupleAndKeywords_SizeT (getargs.c:1464)
==26273==    by 0xEA326E8: array_empty (multiarraymodule.c:1737)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273== 
==26273== Invalid read of size 8
==26273==    at 0x4E780F3: PyObject_IsInstance (abstract.c:2931)
==26273==    by 0x4F14D02: builtin_isinstance (bltinmodule.c:2452)
==26273==    by 0x4F1E378: PyEval_EvalFrameEx (ceval.c:4021)
==26273==    by 0x4F1EEC7: PyEval_EvalFrameEx (ceval.c:4107)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4EA3A51: function_call (funcobject.c:526)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4E8622E: instancemethod_call (classobject.c:2602)
==26273==  Address 0x5e04008 is 40 bytes inside a block of size 280 free'd
==26273==    at 0x4C29097: realloc (vg_replace_malloc.c:525)
==26273==    by 0x4EC9257: _PyString_Resize (stringobject.c:3908)
==26273==    by 0x4ECAA08: PyString_FromFormatV (stringobject.c:394)
==26273==    by 0x4F2DC1F: PyErr_Format (errors.c:550)
==26273==    by 0x4E7A563: PySequence_Size (abstract.c:17)
==26273==    by 0xE9E836A: PyArray_IntpFromSequence (conversion_utils.c:870)
==26273==    by 0xE9E8556: PyArray_IntpConverter (conversion_utils.c:120)
==26273==    by 0x4F30915: convertsimple (getargs.c:1253)
==26273==    by 0x4F31A10: vgetargskeywords (getargs.c:514)
==26273==    by 0x4F3203F: _PyArg_ParseTupleAndKeywords_SizeT (getargs.c:1464)
==26273==    by 0xEA326E8: array_empty (multiarraymodule.c:1737)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273== 
==26273== Invalid read of size 8
==26273==    at 0x4E77E3E: PyObject_CallFunctionObjArgs (abstract.c:2711)
==26273==    by 0x4E7818F: PyObject_IsInstance (abstract.c:2963)
==26273==    by 0x4F14D02: builtin_isinstance (bltinmodule.c:2452)
==26273==    by 0x4F1E378: PyEval_EvalFrameEx (ceval.c:4021)
==26273==    by 0x4F1EEC7: PyEval_EvalFrameEx (ceval.c:4107)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4EA3A51: function_call (funcobject.c:526)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==  Address 0x5e04000 is 32 bytes inside a block of size 280 free'd
==26273==    at 0x4C29097: realloc (vg_replace_malloc.c:525)
==26273==    by 0x4EC9257: _PyString_Resize (stringobject.c:3908)
==26273==    by 0x4ECAA08: PyString_FromFormatV (stringobject.c:394)
==26273==    by 0x4F2DC1F: PyErr_Format (errors.c:550)
==26273==    by 0x4E7A563: PySequence_Size (abstract.c:17)
==26273==    by 0xE9E836A: PyArray_IntpFromSequence (conversion_utils.c:870)
==26273==    by 0xE9E8556: PyArray_IntpConverter (conversion_utils.c:120)
==26273==    by 0x4F30915: convertsimple (getargs.c:1253)
==26273==    by 0x4F31A10: vgetargskeywords (getargs.c:514)
==26273==    by 0x4F3203F: _PyArg_ParseTupleAndKeywords_SizeT (getargs.c:1464)
==26273==    by 0xEA326E8: array_empty (multiarraymodule.c:1737)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273== 
==26273== Invalid read of size 8
==26273==    at 0x4E77294: recursive_isinstance (abstract.c:2890)
==26273==    by 0x4ED32E1: type___instancecheck__ (typeobject.c:591)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4E77E70: PyObject_CallFunctionObjArgs (abstract.c:2760)
==26273==    by 0x4E7818F: PyObject_IsInstance (abstract.c:2963)
==26273==    by 0x4F14D02: builtin_isinstance (bltinmodule.c:2452)
==26273==    by 0x4F1E378: PyEval_EvalFrameEx (ceval.c:4021)
==26273==    by 0x4F1EEC7: PyEval_EvalFrameEx (ceval.c:4107)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==  Address 0x5e04008 is 40 bytes inside a block of size 280 free'd
==26273==    at 0x4C29097: realloc (vg_replace_malloc.c:525)
==26273==    by 0x4EC9257: _PyString_Resize (stringobject.c:3908)
==26273==    by 0x4ECAA08: PyString_FromFormatV (stringobject.c:394)
==26273==    by 0x4F2DC1F: PyErr_Format (errors.c:550)
==26273==    by 0x4E7A563: PySequence_Size (abstract.c:17)
==26273==    by 0xE9E836A: PyArray_IntpFromSequence (conversion_utils.c:870)
==26273==    by 0xE9E8556: PyArray_IntpConverter (conversion_utils.c:120)
==26273==    by 0x4F30915: convertsimple (getargs.c:1253)
==26273==    by 0x4F31A10: vgetargskeywords (getargs.c:514)
==26273==    by 0x4F3203F: _PyArg_ParseTupleAndKeywords_SizeT (getargs.c:1464)
==26273==    by 0xEA326E8: array_empty (multiarraymodule.c:1737)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273== 
==26273== Invalid read of size 1
==26273==    at 0x4ED1880: PyType_IsSubtype (typeobject.c:1146)
==26273==    by 0x4E772A5: recursive_isinstance (abstract.c:2890)
==26273==    by 0x4ED32E1: type___instancecheck__ (typeobject.c:591)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4E77E70: PyObject_CallFunctionObjArgs (abstract.c:2760)
==26273==    by 0x4E7818F: PyObject_IsInstance (abstract.c:2963)
==26273==    by 0x4F14D02: builtin_isinstance (bltinmodule.c:2452)
==26273==    by 0x4F1E378: PyEval_EvalFrameEx (ceval.c:4021)
==26273==    by 0x4F1EEC7: PyEval_EvalFrameEx (ceval.c:4107)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==  Address 0x797420666f20750c is not stack'd, malloc'd or (recently) free'd
==26273== 
==26273== 
==26273== Process terminating with default action of signal 11 (SIGSEGV)
==26273==  General Protection Fault
==26273==    at 0x4ED1880: PyType_IsSubtype (typeobject.c:1146)
==26273==    by 0x4E772A5: recursive_isinstance (abstract.c:2890)
==26273==    by 0x4ED32E1: type___instancecheck__ (typeobject.c:591)
==26273==    by 0x4E75B77: PyObject_Call (abstract.c:2529)
==26273==    by 0x4E77E70: PyObject_CallFunctionObjArgs (abstract.c:2760)
==26273==    by 0x4E7818F: PyObject_IsInstance (abstract.c:2963)
==26273==    by 0x4F14D02: builtin_isinstance (bltinmodule.c:2452)
==26273==    by 0x4F1E378: PyEval_EvalFrameEx (ceval.c:4021)
==26273==    by 0x4F1EEC7: PyEval_EvalFrameEx (ceval.c:4107)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273==    by 0x4F1DA34: PyEval_EvalFrameEx (ceval.c:4117)
==26273==    by 0x4F1FC88: PyEval_EvalCodeEx (ceval.c:3253)
==26273== 
==26273== HEAP SUMMARY:
==26273==     in use at exit: 37,670,463 bytes in 30,050 blocks
==26273==   total heap usage: 300,483 allocs, 270,433 frees, 186,670,109 bytes allocated
==26273== 
==26273== LEAK SUMMARY:
==26273==    definitely lost: 288 bytes in 5 blocks
==26273==    indirectly lost: 240 bytes in 10 blocks
==26273==      possibly lost: 6,920,958 bytes in 5,303 blocks
==26273==    still reachable: 30,748,977 bytes in 24,732 blocks
==26273==         suppressed: 0 bytes in 0 blocks
==26273== Rerun with --leak-check=full to see details of leaked memory
==26273== 
==26273== For counts of detected and suppressed errors, rerun with: -v
==26273== Use --track-origins=yes to see where uninitialised values come from
==26273== ERROR SUMMARY: 9045 errors from 180 contexts (suppressed: 174 from 8)
Killed

Awesome. Let me look quickly at the invalid write (causing the segfault)

Owner

wesm commented Jun 10, 2013

I'm not equipped to debug this today. @Komnomnomnom can I beg you to take a look?

Contributor

hayd commented Jun 10, 2013

(stating the obvious here: presumably it's to do with it's bad shape.)

Owner

wesm commented Jun 10, 2013

yeah, definitely raises an exception inside the decoder but there's some data that's being incorrectly freed or otherwise modified

Contributor

Komnomnomnom commented Jun 10, 2013

@wesm should be able to take a look tomorrow (on vacation, returning today)

Contributor

hayd commented Jun 10, 2013

@jreback The stackoverflow issue is because of its gzip encoding. (Probably quite a few additional arguments for this in the future or maybe io.common...)

In [1]: import requests
In [2]: data = requests.get("https://api.stackexchange.com/2.1/search?page=1&pagesize=10&order=desc&sort=activity&tagged=pandas&site=stackoverflow").text
In [3]: pd.read_json(data)  # just works TM (but is nested)
Contributor

jreback commented Jun 10, 2013

ahh....makes sense.....should prob be able to detect that....

actually I think I see the problem with that particular one:

In [36]: from urllib2 import urlopen

In [37]: buf = urlopen("https://api.stackexchange.com/2.1/search?page=1&pagesize=10&order=desc&sort=activity&tagged=pandas&site=stackoverflow").read()

In [38]: pd.read_json(StringIO(buf.decode('latin-1')))
ValueError: Expected object or value

requests much do the decoding (then you say its gzipped?)

also see here: pydata#2636

def need a generalized gzip handled (in io.common)

Contributor

hayd commented Jun 10, 2013

Maybe we should just use requests everywhere? I don't mind doing the porting for to it (or is there a reason not to?). e.g. it not being in the standard lib...

Contributor

jreback commented Jun 10, 2013

can we include it as a sub-library ? its apache2 license, don't know if thats compaiblel

Contributor

hayd commented Jun 10, 2013

Licenses should be compatible, I was thinking just have an optional dependency with a fall back (i.e. if they have it installed use it, else use urllib2), was going to make another thread, but I think I'll just put a pr together later in the week and see what people think.

Contributor

jreback commented Jun 10, 2013

I'd go for that....makes things cleaner

Contributor

Komnomnomnom commented Jun 11, 2013

So I've tracked at least part of the seg fault problem down to line 78, as @wesm surmised it looks like it is freeing some memory it shouldn't be. With this commented out it parses successfully.

Still investigating but once solved @jreback should I just paste the patch here or submit another pull request?

Contributor

jreback commented Jun 11, 2013

@Komnomnomnom go ahead and paste here
and well get this in

Contributor

Komnomnomnom commented Jun 11, 2013

Ok the following patch should make it safe to call Npy_releaseContext multiple times (which is what was causing the problem). Segmentation fault is gone and valgrind output from Python 2.7 debug build is clean. Likewise all tests pass for Python 2.7 and valgrind output for json tests is clean (i.e. there are no warnings for json related code).

diff --git a/pandas/src/ujson/python/JSONtoObj.c b/pandas/src/ujson/python/JSONtoObj.c
index 1db7586..160c30f 100644
--- a/pandas/src/ujson/python/JSONtoObj.c
+++ b/pandas/src/ujson/python/JSONtoObj.c
@@ -10,6 +10,7 @@ typedef struct __PyObjectDecoder
     JSONObjectDecoder dec;

     void* npyarr;       // Numpy context buffer
+    void* npyarr_addr;  // Ref to npyarr ptr to track DECREF calls
     npy_intp curdim;    // Current array dimension

     PyArray_Descr* dtype;
@@ -67,9 +68,7 @@ void Npy_releaseContext(NpyArrContext* npyarr)
         }
         if (npyarr->dec)
         {
-            // Don't set to null, used to make sure we don't Py_DECREF npyarr
-            // in releaseObject
-            // npyarr->dec->npyarr = NULL;
+            npyarr->dec->npyarr = NULL;
             npyarr->dec->curdim = 0;
         }
         Py_XDECREF(npyarr->labels[0]);
@@ -88,6 +87,7 @@ JSOBJ Object_npyNewArray(void* _decoder)
     {
         // start of array - initialise the context buffer
         npyarr = decoder->npyarr = PyObject_Malloc(sizeof(NpyArrContext));
+        decoder->npyarr_addr = npyarr;

         if (!npyarr)
         {
@@ -515,7 +515,7 @@ JSOBJ Object_newDouble(double value)
 static void Object_releaseObject(JSOBJ obj, void* _decoder)
 {
     PyObjectDecoder* decoder = (PyObjectDecoder*) _decoder;
-    if (obj != decoder->npyarr)
+    if (obj != decoder->npyarr_addr)
     {
         Py_XDECREF( ((PyObject *)obj));
     }
@@ -555,6 +555,7 @@ PyObject* JSONToObj(PyObject* self, PyObject *args, PyObject *kwargs)
     pyDecoder.dec = dec;
     pyDecoder.curdim = 0;
     pyDecoder.npyarr = NULL;
+    pyDecoder.npyarr_addr = NULL;

     decoder = (JSONObjectDecoder*) &pyDecoder;

@@ -609,6 +610,7 @@ PyObject* JSONToObj(PyObject* self, PyObject *args, PyObject *kwargs)

     if (PyErr_Occurred())
     {
+        Npy_releaseContext(pyDecoder.npyarr);
         return NULL;
     }

wesm and others added some commits May 12, 2013

@wesm @jreback wesm ENH: pull pandasjson back into pandas e31f839
@wesm @jreback wesm DOC: add ultrajson license 8327c5b
@wesm @jreback wesm TST: json manip test script. and trigger travis ade5d0f
@jreback jreback BLD: fix setup.py to work on current pandas 9633880
@jreback jreback CLN: revised json support to use the to_json/read_json in pandas.io.json
DOC: docs in io.rst/whatsnew/release notes/api

TST: cleaned up cruft in test_series/test_frame
7dd12cc
@jreback jreback DOC: io.rst doc updates a9dafe3
@jreback jreback API: to_json now writes to a file by default (if None is provided it …
…will return a StringIO object)

     read_json will read from a string-like or filebuf or url (consistent with other parsers)
6422041
@jreback jreback ENH: removed json argument, now path_or_buf can be a path,buffer,url,…
…or JSON string

     added keywords parse_dates,keep_default_dates to allow for date parsing in columns
     of a Frame (default is False, not to parse dates)
8e673cf
@jreback jreback ENH: added date_format parm to to_josn to allow epoch or iso formats …
…(which both can be

     can be parsed with parse_dates=True in read_json)
2697b49
@jreback jreback BUG: patch in weird nested decoding issue, courtesy of @Komnomnomnom 8e4314d
Contributor

jreback commented Jun 11, 2013

patch applied.....looking good now

Contributor

hayd commented Jun 11, 2013

@jreback Something like this for requests: hayd/pandas@dbd968b

Contributor

jreback commented Jun 11, 2013

@wesm this is mergable....any objections?

Owner

wesm commented Jun 11, 2013

Looks good to me, bombs away

Contributor

jreback commented Jun 11, 2013

3.2.1.....

@jreback jreback added a commit that referenced this pull request Jun 11, 2013

@jreback jreback Merge pull request #3804 from jreback/ujson
ENH: add ujson support in pandas.io.json
a7f37d4

@jreback jreback merged commit a7f37d4 into pandas-dev:master Jun 11, 2013

Contributor

Komnomnomnom commented Jun 11, 2013

Awesome. I'll see about merging in upstream changes. Will send thru a pull request soonish.

Contributor

jreback commented Jun 11, 2013

oh...you have additional dependencies on this?

Contributor

Komnomnomnom commented Jun 11, 2013

Mentioned in #3583, there have been some enhancements / fixes in ultrajson since the pandas json version was originally written. Nothing major (I think) and should be straightforward enough to merge but it'd be a good idea to keep them in sync I think.

Contributor

jreback commented Jun 11, 2013

ok...sure...

Owner

wesm commented Jun 13, 2013

thanks all for making this happen, especially to @Komnomnomnom for authoring this code in the first place =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment