json_normalize should raise when record_path doesn't point to an array #26284

Marky0 · 2019-05-05T15:46:04Z

There seems to be odd behaviour with json_normalize, specifically when json keys are inconsistently formatted.
json_normalize throws a keyword error as summarised in this question

https://stackoverflow.com/questions/55993192/inconsistent-behaviour-from-json-normalize-with-deeply-nested-json-data

Is this the expected behaviour ?

jreback · 2019-05-05T15:52:57Z

pls post a reproducible example and version as the template indicates

Marky0 · 2019-05-05T23:01:39Z

I am trying to import deeply nested json into pandas (v0.24.2) using json_normalize and coming across a few inconsistencies which I am struggling to resolve. An example json is as follows, which is inconstantly formatted as indicated by Missing keyEB

json =  [   {'keyA': 1,
             'keyB': 2,
             'keyC': [{
                     'keyCA': 3,
                     'keyCB': {'keyCBA':4,
                               'keyCBB':5,
                               'keyCBC': [{'keyCBCA':6, 'keyCBCB':7, 'keyCBCC':8},
                                          {'keyCBCA':9, 'keyCBCB':10, 'keyCBCC':11},
                                          {'keyCBCA':12, 'keyCBCB':13, 'keyCBCC':14}],
                               'keyCBD':15},
                     'keyCC':16}],
            'keyD':17,
            'keyE': [{
                     'keyEA':18,
                     'keyEB': {'keyEBA':19,'keyEBB':20}
                     }]
            },{
            'keyA': 31,
            'keyB': 32,
            'keyC': [{
                    'keyCA': 33,
                    'keyCB': {'keyCBA': 34,
                              'keyCBB': 35,
                              'keyCBC': [{'keyCBCA': 36, 'keyCBCB': 37, 'keyCBCC': 38},
                                         {'keyCBCA': 39, 'keyCBCB': 40, 'keyCBCC': 41},
                                         {'keyCBCA': 42, 'keyCBCB': 43, 'keyCBCC': 44}],
                              'keyCBD': 45},
                    'keyCC': 46}],
            'keyD': 47,
            'keyE': [{
                    'keyEA': 48,
                    'Missing keyEB': 49
                    }]
            }]

The following code gives the expected behavior of json_normalize, extracting the correctly normalised data :

First level json correctly normalized

from pandas.io.json import json_normalize
json_normalize(data = json)

   keyA  keyB       keyC  keyD       keyE
0     1     2  [{'key...    17  [{'key...
1    31    32  [{'key...    47  [{'key...

Second level KeyC correctly normalized

json_normalize(data = json, record_path = ['keyC'], meta = ['keyA']) 

   keyCA      keyCB  keyCC  keyA
0      3  {'keyC...     16     1
1     33  {'keyC...     46    31

Fourth level keyCBC correctly normalized

json_normalize(data = json, record_path = ['keyC', 'keyCB', 'keyCBC'], meta = ['keyA'])

   keyCBCA  keyCBCB  keyCBCC  keyA
0        6        7        8     1
1        9       10       11     1
2       12       13       14     1
3       36       37       38    31
4       39       40       41    31
5       42       43       44    31

However, other branches seem to be inconsistently normalized.

Third level keyCB ......

json_normalize(data = json, record_path = ['keyC', 'keyCB'], meta = ['keyA'])

        0  keyA
0  keyCBA     1
1  keyCBB     1
2  keyCBC     1
3  keyCBD     1
4  keyCBA    31
5  keyCBB    31
6  keyCBC    31
7  keyCBD    31

#Uhhhh ! I was expecting
#    keyCBA   keyCBB   keyCBC   keyCBD   KeyA
# 0       4        5   [{'key..     15      1
# 1      34       35   [{'key..     45     31

and the following completely bombs with a keyword error because of the missing keyEB

json_normalize(data = json, record_path = ['keyE', 'keyEB'], meta = ['keyA'])

Traceback (most recent call last):......
KeyError: 'keyEB'

#I was expecting
#      keyEBA   keyEBB   keyA
# 0        19       20      1
# 1       NaN      NaN     31

Is this expected behaviour from jsons_normalize ?

WillAyd · 2019-05-06T01:57:25Z

So record_path is supposed to point to an array of objects. The one place you are failing at is pointing directly to an object instead of an array, which I think is the root cause of the issue. Off the top of my head it isn't useful to try and normalize a single object, so this should probably raise instead of providing back the result you see which I think is just iterating over the keys of the object

As to the second issue you've pointed out there currently isn't support for ignoring errors for items specified in the record_path. If that's something you would like to request should probably open as a separate issue and we can repurpose this one for better error handling with the issue described above

Marky0 · 2019-05-06T09:06:58Z

Thanks for the feedback.
For the first issue, I would agree that if the entire json was one dictionary at the root, there would be no useful reason to normalise the json of a single object, unless it was just to read a set of keys, which doesn't need pandas. However, as the entire json is an array at the root (in my case over 1000 objects), then there is value in normalizing the single object providing it can be collected into a dataframe.
Not sure I understand the purpose of returning the keys when record_path points to an object in the current implementation ? Is this ever needed ?
Is there a better way to achieve the result I expected ?

For the second issue I will raise a separate issue as you suggested.
Many Thanks

WillAyd · 2019-05-06T15:03:35Z

Not sure I understand the purpose of returning the keys when record_path points to an object in the current implementation ? Is this ever needed ?

Not sure I see a need for this either. If I had to guess its simply a byproduct of the fact that iterating over a dict will return its keys

However, as the entire json is an array at the root (in my case over 1000 objects), then there is value in normalizing the single object providing it can be collected into a dataframe.

Yea I can see that. I'm not sure if the JSON spec makes any distinction between a single object and an array containing a single object that would dictate behavior, but if not I think we'd be open to it if you want to try a PR!

Improves robustness of json_normalize for inconsistently formatted json pandas-dev#26284

Adds new tests for deeply nested json which is inconsistently formatted pandas-dev#26284

hugosoftdev · 2020-02-26T15:40:04Z

Hey, i would like to work on this issue. Has it already being resolved?

Marky0 · 2020-02-26T19:46:15Z

Not by me. I recoded normalize.py and got it working for my specific need off-line. Never quite found the time to re-code in a generalised form for approval into the main branch.

WillAyd added the Needs Info Clarification about behavior needed to assess issue label May 5, 2019

WillAyd added IO JSON read_json, to_json, json_normalize Error Reporting Incorrect or improved errors from pandas and removed Needs Info Clarification about behavior needed to assess issue labels May 6, 2019

WillAyd added this to the Contributions Welcome milestone May 6, 2019

WillAyd changed the title ~~json_normalize fails to import deeply nested json~~ json_normalize should raise when record_path doesn't point to an array May 6, 2019

WillAyd added the good first issue label May 6, 2019

Marky0 added a commit to Marky0/pandas that referenced this issue May 9, 2019

Update to json_normalize to improve robustness

113e415

Improves robustness of json_normalize for inconsistently formatted json pandas-dev#26284

Marky0 mentioned this issue May 9, 2019

Update to json_normalize to improve robustness Marky0/pandas#1

Merged

4 tasks

Marky0 added a commit to Marky0/pandas that referenced this issue May 9, 2019

Add new tests to test_normalize.py

9715c95

Adds new tests for deeply nested json which is inconsistently formatted pandas-dev#26284

Marky0 mentioned this issue May 9, 2019

Robustness improvement for normalize.py #26328

Closed

4 tasks

This was referenced Apr 16, 2020

BUG: Raise a TypeError when record_path doesn't point to an array LTe/pandas#2

Closed

BUG: Raise a TypeError when record_path doesn't point to an array #33585

Merged

WillAyd closed this as completed in #33585 Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json_normalize should raise when record_path doesn't point to an array #26284

json_normalize should raise when record_path doesn't point to an array #26284

Marky0 commented May 5, 2019

jreback commented May 5, 2019

Marky0 commented May 5, 2019

WillAyd commented May 6, 2019

Marky0 commented May 6, 2019

WillAyd commented May 6, 2019

hugosoftdev commented Feb 26, 2020

Marky0 commented Feb 26, 2020

json_normalize should raise when record_path doesn't point to an array #26284

json_normalize should raise when record_path doesn't point to an array #26284

Comments

Marky0 commented May 5, 2019

jreback commented May 5, 2019

Marky0 commented May 5, 2019

WillAyd commented May 6, 2019

Marky0 commented May 6, 2019

WillAyd commented May 6, 2019

hugosoftdev commented Feb 26, 2020

Marky0 commented Feb 26, 2020