Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_json silently skipping records? #4359

Closed
tdhopper opened this issue Jul 25, 2013 · 26 comments

Comments

@tdhopper
Copy link
Contributor

commented Jul 25, 2013

should raise when orient='columns' and index is non_unique
orient='index' and column is non_unique?

I'm trying out to_json and read_json on a data frame with 800k rows. However, after calling to_json on the file, read_json gets back only 2k rows. This happens if I call them in series or if I give to_json a filename and call the filename with read_json. Judging by the size of the file, all the data is being written (the json is roughly the size of the pickled data frame object). Any idea what's going on?

image

@tdhopper

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2013

I just tried to open the Pandas created json file with the json module and I got the error below.


---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-24-08d57842ab3e> in <module>()
      1 import json
      2 with open("data/df.json") as f:
----> 3     j = json.load(f)

C:\Anaconda\lib\json\__init__.pyc in load(fp, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    288         parse_float=parse_float, parse_int=parse_int,
    289         parse_constant=parse_constant, object_pairs_hook=object_pairs_hook,
--> 290         **kw)
    291 
    292 

C:\Anaconda\lib\json\__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    336             parse_int is None and parse_float is None and
    337             parse_constant is None and object_pairs_hook is None and not kw):
--> 338         return _default_decoder.decode(s)
    339     if cls is None:
    340         cls = JSONDecoder

C:\Anaconda\lib\json\decoder.pyc in decode(self, s, _w)
    363 
    364         """
--> 365         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    366         end = _w(s, end).end()
    367         if end != len(s):

C:\Anaconda\lib\json\decoder.pyc in raw_decode(self, s, idx)
    379         """
    380         try:
--> 381             obj, end = self.scan_once(s, idx)
    382         except StopIteration:
    383             raise ValueError("No JSON object could be decoded")

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
@cpcloud

This comment has been minimized.

Copy link
Member

commented Jul 25, 2013

can you provide a minimal reproducible example?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2013

post df.info() as well

@tdhopper

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2013

When I call pd.read_json(...) on the JSON string below, I only get one dataframe row back.

'{"date":{"79820000000.0":1346889720000000000,"79820000000.0":1346889720000000000},"author":{"79820000000.0":"DEBBIE_GI","79820000000.0":"SPINFUEL_ECIGS"},"content":{"79820000000.0":" from University of Athens Tell the Public They Are Not Sure if Smoking is Any More Hazardous than Vaping... http:\\/\\/t.co\\/kL79zxAF","79820000000.0":"@towrofstgh @gingersejuice IT WAS YOU!!! You freaked me out! :-) I thought you were serious, seriously demented that is. Now? Funny as hell."},"following":{"79820000000.0":49,"79820000000.0":436},"followers":{"79820000000.0":38,"79820000000.0":456},"updates":{"79820000000.0":69,"79820000000.0":3024},"content_stripped":{"79820000000.0":"university athens tell public sure smoking hazardous vaping","79820000000.0":"towrofstgh gingersejuice freaked thought serious seriously demented funny hell"},"original_url":{"79820000000.0":"http:\\/\\/t.co\\/kl79zxaf","79820000000.0":""},"rt_source":{"79820000000.0":"","79820000000.0":""},"author_count":{"79820000000.0":54,"79820000000.0":2141},"is_retweet":{"79820000000.0":0,"79820000000.0":0},"is_reply":{"79820000000.0":0,"79820000000.0":1},"has_url":{"79820000000.0":1,"79820000000.0":0},"spam_prediction":{"79820000000.0":1.0,"79820000000.0":0.0}}'
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

can you post what you SHOULD get back

simple json returns this which AFAICT is only 1 row as well.
Is this valid json?

(Pdb) x = simplejson.loads(json)
(Pdb) x
{'author': {'79820000000.0': 'SPINFUEL_ECIGS'}, 'spam_prediction': {'79820000000.0': 0.0}, 'original_url': {'79820000000.0': u''}, 'is_retweet': {'79820000000.0': 0}, 'has_url': {'79820000000.0': 0}, 'content': {'79820000000.0': '@towrofstgh @gingersejuice IT WAS YOU!!! You freaked me out! :-) I thought you were serious, seriously demented that is. Now? Funny as hell.'}, 'following': {'79820000000.0': 436}, 'content_stripped': {'79820000000.0': 'towrofstgh gingersejuice freaked thought serious seriously demented funny hell'}, 'followers': {'79820000000.0': 456}, 'updates': {'79820000000.0': 3024}, 'date': {'79820000000.0': 1346889720000000000}, 'is_reply': {'79820000000.0': 1}, 'author_count': {'79820000000.0': 2141}, 'rt_source': {'79820000000.0': u''}}


@tdhopper

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2013

That JSON was generated by Pandas. Does this help?

image

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

@tdhopper can you post say a to_csv of the original frame so its easy to reconstruct?
just paste the test in

@tdhopper

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2013

article_id,date,author,content,following,followers,updates,content_stripped,original_url,rt_source,author_count,is_retweet,is_reply,has_url,spam_prediction
79820000000.0,2012-09-06 00:02:00,DEBBIE_GI, from University of Athens Tell the Public They Are Not Sure if Smoking is Any More Hazardous than Vaping... http://t.co/kL79zxAF,49,38,69,university athens tell public sure smoking hazardous vaping,http://t.co/kl79zxaf,,54,0,0,1,1.0
79820000000.0,2012-09-06 00:02:00,SPINFUEL_ECIGS,"@towrofstgh @gingersejuice IT WAS YOU!!! You freaked me out! :-) I thought you were serious, seriously demented that is. Now? Funny as hell.",436,456,3024,towrofstgh gingersejuice freaked thought serious seriously demented funny hell,,,2141,0,1,0,0.0
@tdhopper

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2013

The problem is that the article_id is an index column, but it is not unique.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

Try this

In [7]: x = df.set_index('article_id')

In [12]: pd.read_json(x.to_json(orient='split'),orient='split')
Out[12]: 
                           date          author                                            content  following  followers  updates                                   content_stripped          original_url  rt_source  author_count  is_retweet  is_reply  has_url  spam_prediction
79820000000 2012-09-06 00:02:00       DEBBIE_GI   from University of Athens Tell the Public The...         49         38       69  university athens tell public sure smoking haz...  http://t.co/kl79zxaf        NaN            54           0         0        1                1
79820000000 2012-09-06 00:02:00  SPINFUEL_ECIGS  @towrofstgh @gingersejuice IT WAS YOU!!! You f...        436        456     3024  towrofstgh gingersejuice freaked thought serio...                  None        NaN          2141           0         1        0                0
@tdhopper

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2013

Should DataFrame.to_json() give a warning when index values are not unique?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

yep....I think it should actually raise.

cc @Komnomnomnom

so non-unique index when orient='columns' is bad
prob non-unique columsn when orient='index' as well

any others? hmm....need some tests for this...

will mark as a bug

@Komnomnomnom

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2013

Yeah both orient='columns' and orient='index' (the default) encode to JavaScript objects so the keys should be unique. orient='records' is also a problem when the columns are non-unique. orient='values' and orient='split' should be ok.

Its output actually contains the duplicated keys but they are silently dropped by the JSON parser when reading the JSON string back in.

Note to_dict has similar behaviour (silently drops rows with non-unique indices):

In [30]: df = pd.DataFrame([['a','b'],['c','d']],index=[1,1],columns=['x','y'])

In [31]: df
Out[31]: 
   x  y
1  a  b
1  c  d

In [32]: df.to_json()
Out[32]: '{"x":{"1":"a","1":"c"},"y":{"1":"b","1":"d"}}'

In [33]: json.loads(df.to_json())
Out[33]: {u'x': {u'1': u'c'}, u'y': {u'1': u'd'}}

In [34]: df.to_dict()
Out[34]: {'x': {1: 'c'}, 'y': {1: 'd'}}

In [35]: df.to_json(orient='columns')
Out[35]: '{"x":{"1":"a","1":"c"},"y":{"1":"b","1":"d"}}'

In [36]: df = pd.DataFrame([['a','b'],['c','d']],index=[1,1],columns=['x','x'])

In [37]: df
Out[37]: 
   x  x
1  a  b
1  c  d

In [38]: df.to_json()
Out[38]: '{"x":{"1":"a","1":"c"},"x":{"1":"b","1":"d"}}'

In [39]: json.loads(df.to_json())
Out[39]: {u'x': {u'1': u'd'}}

So what do you think, raise an exception prompting the user to either uniqify the data or choose a different orient if index or columns are not unique and are going to be used as keys?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2013

I think should raise in the writing (and to_dict should change too)
quite easy to test for it

index.is_unique

you can do (or can); your post above are basically the tests

lmk

@Komnomnomnom

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2013

FYI to_dict() displays a warning if columns are non-unique:

In [48]: df
Out[48]: 
   x  x
1  a  b
1  c  d

In [47]: df.to_dict()
pandas/core/frame.py:984: UserWarning: DataFrame columns are not unique, some columns will be omitted.
  "columns will be omitted.", UserWarning)
Out[47]: {'x': {1: 'd'}}
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2013

hmm

I think it should raise (and json too) but maybe I am in the minority

@cpcloud @wesm ?

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jul 27, 2013

it will be undefined what is returned from to_dict for non unique columns i think raise there, not sure about to_json. if it encodes them that way then i think it should raise

@Komnomnomnom

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2013

OK I'll put together a PR for the JSON and to_dict changes.

IMO it should just be a warning for to_dict as it deals with the problem but it should be an exception for to_json, as it ends up producing invalid json.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2013

how does to_dict with this? (aside from the warning)

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jul 27, 2013

i'm not sure why this would be useful since you cannot predict which columns will be returned

In [48]: df
Out[48]: 
   x  x
1  a  b
1  c  d

In [47]: df.to_dict()
pandas/core/frame.py:984: UserWarning: DataFrame columns are not unique, some columns will be omitted.
  "columns will be omitted.", UserWarning)
Out[47]: {'x': {1: 'd'}}

i would definitely like a big honking exception if i tried to do this since i unpredictably lose information

@Komnomnomnom

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2013

It deals with the problem in the sense that it produces valid output, unlike to_json

In [65]: dict((('a',1),('a',2)))
Out[65]: {'a': 2}
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2013

ok lets leave to_out for now (though I still may make a pr to fix this)
I think raise on any loss of data because not immediately obvious that u lost something

ideally u can provide a recommendation (could be generic )
that the user try a different orient (eg if index is not unique then split is ok)

@wesm

This comment has been minimized.

Copy link
Member

commented Jul 27, 2013

I'm +1 on raising an exception, force the user to deal with it.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 28, 2013

closed via #4376

@jreback jreback closed this Jul 28, 2013

@tdhopper

This comment has been minimized.

Copy link
Contributor Author

commented Jul 29, 2013

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.