Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json_normalize should raise when record_path doesn't point to an array #26284

Closed
Marky0 opened this issue May 5, 2019 · 7 comments · Fixed by #33585
Closed

json_normalize should raise when record_path doesn't point to an array #26284

Marky0 opened this issue May 5, 2019 · 7 comments · Fixed by #33585
Labels
Error Reporting Incorrect or improved errors from pandas good first issue IO JSON read_json, to_json, json_normalize

Comments

@Marky0
Copy link

Marky0 commented May 5, 2019

There seems to be odd behaviour with json_normalize, specifically when json keys are inconsistently formatted.
json_normalize throws a keyword error as summarised in this question

https://stackoverflow.com/questions/55993192/inconsistent-behaviour-from-json-normalize-with-deeply-nested-json-data

Is this the expected behaviour ?

@jreback
Copy link
Contributor

jreback commented May 5, 2019

pls post a reproducible example and version as the template indicates

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label May 5, 2019
@Marky0
Copy link
Author

Marky0 commented May 5, 2019

I am trying to import deeply nested json into pandas (v0.24.2) using json_normalize and coming across a few inconsistencies which I am struggling to resolve. An example json is as follows, which is inconstantly formatted as indicated by Missing keyEB

json =  [   {'keyA': 1,
             'keyB': 2,
             'keyC': [{
                     'keyCA': 3,
                     'keyCB': {'keyCBA':4,
                               'keyCBB':5,
                               'keyCBC': [{'keyCBCA':6, 'keyCBCB':7, 'keyCBCC':8},
                                          {'keyCBCA':9, 'keyCBCB':10, 'keyCBCC':11},
                                          {'keyCBCA':12, 'keyCBCB':13, 'keyCBCC':14}],
                               'keyCBD':15},
                     'keyCC':16}],
            'keyD':17,
            'keyE': [{
                     'keyEA':18,
                     'keyEB': {'keyEBA':19,'keyEBB':20}
                     }]
            },{
            'keyA': 31,
            'keyB': 32,
            'keyC': [{
                    'keyCA': 33,
                    'keyCB': {'keyCBA': 34,
                              'keyCBB': 35,
                              'keyCBC': [{'keyCBCA': 36, 'keyCBCB': 37, 'keyCBCC': 38},
                                         {'keyCBCA': 39, 'keyCBCB': 40, 'keyCBCC': 41},
                                         {'keyCBCA': 42, 'keyCBCB': 43, 'keyCBCC': 44}],
                              'keyCBD': 45},
                    'keyCC': 46}],
            'keyD': 47,
            'keyE': [{
                    'keyEA': 48,
                    'Missing keyEB': 49
                    }]
            }]

The following code gives the expected behavior of json_normalize, extracting the correctly normalised data :

First level json correctly normalized

from pandas.io.json import json_normalize
json_normalize(data = json)

   keyA  keyB       keyC  keyD       keyE
0     1     2  [{'key...    17  [{'key...
1    31    32  [{'key...    47  [{'key...

Second level KeyC correctly normalized

json_normalize(data = json, record_path = ['keyC'], meta = ['keyA']) 

   keyCA      keyCB  keyCC  keyA
0      3  {'keyC...     16     1
1     33  {'keyC...     46    31

Fourth level keyCBC correctly normalized

json_normalize(data = json, record_path = ['keyC', 'keyCB', 'keyCBC'], meta = ['keyA'])

   keyCBCA  keyCBCB  keyCBCC  keyA
0        6        7        8     1
1        9       10       11     1
2       12       13       14     1
3       36       37       38    31
4       39       40       41    31
5       42       43       44    31

However, other branches seem to be inconsistently normalized.

Third level keyCB ......

json_normalize(data = json, record_path = ['keyC', 'keyCB'], meta = ['keyA'])

        0  keyA
0  keyCBA     1
1  keyCBB     1
2  keyCBC     1
3  keyCBD     1
4  keyCBA    31
5  keyCBB    31
6  keyCBC    31
7  keyCBD    31

#Uhhhh ! I was expecting
#    keyCBA   keyCBB   keyCBC   keyCBD   KeyA
# 0       4        5   [{'key..     15      1
# 1      34       35   [{'key..     45     31

and the following completely bombs with a keyword error because of the missing keyEB

json_normalize(data = json, record_path = ['keyE', 'keyEB'], meta = ['keyA'])

Traceback (most recent call last):......
KeyError: 'keyEB'

#I was expecting
#      keyEBA   keyEBB   keyA
# 0        19       20      1
# 1       NaN      NaN     31

Is this expected behaviour from jsons_normalize ?

@WillAyd
Copy link
Member

WillAyd commented May 6, 2019

So record_path is supposed to point to an array of objects. The one place you are failing at is pointing directly to an object instead of an array, which I think is the root cause of the issue. Off the top of my head it isn't useful to try and normalize a single object, so this should probably raise instead of providing back the result you see which I think is just iterating over the keys of the object

As to the second issue you've pointed out there currently isn't support for ignoring errors for items specified in the record_path. If that's something you would like to request should probably open as a separate issue and we can repurpose this one for better error handling with the issue described above

@WillAyd WillAyd added IO JSON read_json, to_json, json_normalize Error Reporting Incorrect or improved errors from pandas and removed Needs Info Clarification about behavior needed to assess issue labels May 6, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone May 6, 2019
@WillAyd WillAyd changed the title json_normalize fails to import deeply nested json json_normalize should raise when record_path doesn't point to an array May 6, 2019
@Marky0
Copy link
Author

Marky0 commented May 6, 2019

Thanks for the feedback.
For the first issue, I would agree that if the entire json was one dictionary at the root, there would be no useful reason to normalise the json of a single object, unless it was just to read a set of keys, which doesn't need pandas. However, as the entire json is an array at the root (in my case over 1000 objects), then there is value in normalizing the single object providing it can be collected into a dataframe.
Not sure I understand the purpose of returning the keys when record_path points to an object in the current implementation ? Is this ever needed ?
Is there a better way to achieve the result I expected ?

For the second issue I will raise a separate issue as you suggested.
Many Thanks

@WillAyd
Copy link
Member

WillAyd commented May 6, 2019

Not sure I understand the purpose of returning the keys when record_path points to an object in the current implementation ? Is this ever needed ?

Not sure I see a need for this either. If I had to guess its simply a byproduct of the fact that iterating over a dict will return its keys

However, as the entire json is an array at the root (in my case over 1000 objects), then there is value in normalizing the single object providing it can be collected into a dataframe.

Yea I can see that. I'm not sure if the JSON spec makes any distinction between a single object and an array containing a single object that would dictate behavior, but if not I think we'd be open to it if you want to try a PR!

Marky0 added a commit to Marky0/pandas that referenced this issue May 9, 2019
Improves robustness of json_normalize for inconsistently formatted json pandas-dev#26284
Marky0 added a commit to Marky0/pandas that referenced this issue May 9, 2019
Adds new tests for deeply nested json which is inconsistently formatted pandas-dev#26284
@hugosoftdev
Copy link

Hey, i would like to work on this issue. Has it already being resolved?

@Marky0
Copy link
Author

Marky0 commented Feb 26, 2020

Not by me. I recoded normalize.py and got it working for my specific need off-line. Never quite found the time to re-code in a generalised form for approval into the main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas good first issue IO JSON read_json, to_json, json_normalize
Projects
None yet
4 participants