Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json_normalize raises TypeError exception #22706

Closed
vuminhle opened this issue Sep 14, 2018 · 8 comments

Comments

Projects
None yet
4 participants
@vuminhle
Copy link
Contributor

commented Sep 14, 2018

Code Sample, a copy-pastable example if possible

from pandas.io.json import json_normalize

d = {
    'name': 'alan smith',    
    'info': {
        'phones': [{
            'area': 111,
            'number': 2222
        }, {
            'area': 333,
            'number': 4444
        }]
    }
}
json_normalize(d, record_path=["info", "phones"])

Problem description

The above code throws TypeError exception:

Traceback (most recent call last):
  File ".\test.py", line 15, in <module>
    json_normalize(d, record_path = ["info", "phones"])
  File "C:\Python36\lib\site-packages\pandas\io\json\normalize.py", line 262, in json_normalize
    _recursive_extract(data, record_path, {}, level=0)
  File "C:\Python36\lib\site-packages\pandas\io\json\normalize.py", line 235, in _recursive_extract
    seen_meta, level=level + 1)
  File "C:\Python36\lib\site-packages\pandas\io\json\normalize.py", line 238, in _recursive_extract
    recs = _pull_field(obj, path[0])
  File "C:\Python36\lib\site-packages\pandas\io\json\normalize.py", line 185, in _pull_field
    result = result[spec]
TypeError: string indices must be integers

Expected Output

area number
0 111 2222
1 333 4444

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.23.4
pytest: 3.6.2
pip: 18.0
setuptools: 40.2.0
Cython: None
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: 1.5.5
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd

This comment has been minimized.

Copy link
Member

commented Sep 15, 2018

Thanks for the report - investigation and PRs are always welcome!

@WillAyd WillAyd added this to the Contributions Welcome milestone Sep 15, 2018

@vuminhle

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2018

If record_path points to a nested dict of dicts, after one _recursive_extract, data is the inner dict ({'phones': ...} in the example)

When data is a dict, the for loop here only iterates over the keys.

Do we assume that data is always a list? If that is the case, there are two options:

  1. Turn data into a list if it is a dict (similar to line 194).
  2. Hoist the for loop into a method. If data is not a list call this method instead of iterating over the elements.

I prefer (2). Let me know what you think. I can create a PR.

@vuminhle

This comment has been minimized.

Copy link
Contributor Author

commented Sep 19, 2018

@WillAyd : What do you think of the proposed fix? I'll create a PR if you think it's the right thing to do.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

Do we assume that data is always a list? If that is the case, there are two options:

The docstring claims that either a dict or list of dicts is allowed. The only example with a dict doesn't really do any normalization though:

>>> data = {'A': [1, 2]}
>>> json_normalize(data, 'A', record_prefix='Prefix.')
    Prefix.0
0          1
1          2

I'm inclined to do whatever is easiest to maintain in the long-run, though it's not clear what that is in this case.

@WillAyd

This comment has been minimized.

Copy link
Member

commented Sep 19, 2018

I don't think we should assume that it is always a list. In my mind the behavior for record_path should mirror whatever happens at the top level but just resolving that at the specified record_path. These calls have an equivalent return:

In [6]: json_normalize({'foo': 1, 'bar': 2, 'baz': 3})
Out[6]: 
   bar  baz  foo
0    2    3    1

In [7]: json_normalize([{'foo': 1, 'bar': 2, 'baz': 3}])
Out[7]: 
   bar  baz  foo
0    2    3    1

So I would assume the following to also be equivalent (though currently failing)

>>> json_normalize({'info': {'phones': {'foo': 1, 'bar': 2, 'baz': 3}}}, record_path=['info', 'phones'])
>>> json_normalize({'info': {'phones': [{'foo': 1, 'bar': 2, 'baz': 3}]}}, record_path=['info', 'phones'])
@vuminhle

This comment has been minimized.

Copy link
Contributor Author

commented Sep 20, 2018

To be clear, I asked about data in _recursive_extract (not the parameter data in json_normalize).

I agree with @WillAyd that the list assumption inside _recursive_extract is wrong. Inside this function data can be anything (list, dict, value). That's why my proposed fix above has a check to deal with non-list type. The proposed fix is as follows:

def _extract(data, path, seen_meta, level):
    for obj in data: # the body of else clause at L237
        ...

def _recursive_extract(data, path, seen_meta, level=0):
    if len(path) > 1: 
        # unchanged
    else:
        if isinstance(data, list):
            for obj in data: # similar to the current version
                _extract(obj, path, seen_meta, level) 
        else:
            _extract(data, path, seen_meta, level) # this is new to deal with non-list data

Note that the current version is

def _recursive_extract(data, path, seen_meta, level=0):
    if len(path) > 1: 
        # unchanged
    else:
        for obj in data: 
            _extract(obj, path, seen_meta, level) 

which raises exception when data is not a list.

@WillAyd

This comment has been minimized.

Copy link
Member

commented Sep 21, 2018

@vuminhle feel free to submit a PR for code review

vuminhle added a commit to vuminhle/pandas that referenced this issue Sep 22, 2018

vuminhle added a commit to vuminhle/pandas that referenced this issue Sep 22, 2018

@vuminhle

This comment has been minimized.

Copy link
Contributor Author

commented Sep 22, 2018

Btw I think #21605 has the same root cause.
PR #22804 should also fix this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.