Handling of nested JSON records #1067

wesm · 2012-04-16T15:49:48Z

Is there a simple way of grabbing nested keys when constructing a Pandas Dataframe from JSON. Using the example JSON from below, how would I build a Dataframe that uses this column_header = ['id_str', 'text', 'user.screen_name'], (i.e. how do I get the 'screen_name' from the 'user' key without flattening the JSON).

Thanks,

Vishal

{   u'_id': ObjectId('4f8b95e8a504d022e2000000'),
    u'contributors': None,
    u'coordinates': None,
    u'created_at': u'Mon Apr 16 03:45:44 +0000 2012',
    u'entities': {   u'hashtags': [],
                     u'urls': [   {   u'display_url': u'stks.co/3M4X',
                                      u'expanded_url': u'http://stks.co/3M4X',
                                      u'indices': [72, 92],
                                      u'url': u'http://t.co/mGdTYusF'}],
                     u'user_mentions': []},
    u'favorited': False,
    u'geo': None,
    u'id': 191734090783916032L,
    u'id_str': u'191734090783916032',
    u'in_reply_to_screen_name': None,
    u'in_reply_to_status_id': None,
    u'in_reply_to_status_id_str': None,
    u'in_reply_to_user_id': None,
    u'in_reply_to_user_id_str': None,
    u'place': None,
    u'possibly_sensitive': False,
    u'possibly_sensitive_editable': True,
    u'processed_metadata': {   u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>),
                               u'search_queries': [   u'$AAPL',
                                                      u'$MSFT',
                                                      u'$GOOG'],
                               u'source': u'Twitter Streaming API'},
    u'retweet_count': 0,
    u'retweeted': False,
    u'source': u'<a href="http://stocktwits.com" rel="nofollow">StockTwits Web</a>',
    u'text': u'Interesting infographic on the internet and evolution of social media \u2794 http://t.co/mGdTYusF $FB $GOOG $TWIT $LNKD $AOL',
    u'truncated': False,
    u'user': {   u'_id': u'speculatethemkt',
                 u'contributors_enabled': False,
                 u'created_at': u'Tue Nov 30 02:28:20 +0000 2010',
                 u'default_profile': False,
                 u'default_profile_image': False,
                 u'description': u"I'm a 22-year old full-time forex trader living the location independent lifestyle. Author of The Trading Elite \u2794 http://amzn.to/I0nacY",
                 u'favourites_count': 1,
                 u'follow_request_sent': None,
                 u'followers_count': 19658,
                 u'following': None,
                 u'friends_count': 596,
                 u'geo_enabled': False,
                 u'id': 221226895,
                 u'id_str': u'221226895',
                 u'is_translator': False,
                 u'lang': u'en',
                 u'listed_count': 6,
                 u'location': u'Portland, OR',
                 u'name': u'Jared M.',
                 u'notifications': None,
                 u'processed_metadata': {   u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>),
                                            u'search_queries': [   u'$AAPL',
                                                                   u'$MSFT',
                                                                   u'$GOOG'],
                                            u'source': u'Twitter Streaming API'},
                 u'profile_background_color': u'4f4f4f',
                 u'profile_background_image_url': u'http://a0.twimg.com/profile_background_images/438353849/TWITTER-BG.jpg',
                 u'profile_background_image_url_https': u'https://si0.twimg.com/profile_background_images/438353849/TWITTER-BG.jpg',
                 u'profile_background_tile': False,
                 u'profile_image_url': u'http://a0.twimg.com/profile_images/2039368182/twitpic1_normal.png',
                 u'profile_image_url_https': u'https://si0.twimg.com/profile_images/2039368182/twitpic1_normal.png',
                 u'profile_link_color': u'bd0000',
                 u'profile_sidebar_border_color': u'eeeeee',
                 u'profile_sidebar_fill_color': u'efefef',
                 u'profile_text_color': u'333333',
                 u'profile_use_background_image': True,
                 u'protected': False,
                 u'screen_name': u'speculatethemkt',
                 u'show_all_inline_media': True,
                 u'statuses_count': 492,
                 u'time_zone': u'Pacific Time (US & Canada)',
                 u'url': u'http://www.speculatethemarkets.com',
                 u'utc_offset': -28800,
                 u'verified': False}}

The text was updated successfully, but these errors were encountered:

hayd · 2013-05-08T15:59:24Z

related: http://stackoverflow.com/questions/16444797/which-is-the-most-efficient-way-of-loading-a-json-dataset-into-pandas-dataframes

jreback · 2013-06-11T20:16:01Z

this is invalid JSON (according to jsonlint), but generalized inference is IMHO too complicated, but #3804 should be able to do some of this. close this issue?

hayd · 2013-06-11T20:30:48Z

Is it feasible to grab the user section, (actually this example from the other thread is better):

Convert the (data, posts) section to DataFrame

s = r'''{
    "status": "success",
    "data": {
        "posts": [
            {
                "id": 1,
                "title": "A blog post",
                "body": "Some useful content"
            },
            {
                "id": 2,
                "title": "Another blog post",
                "body": "More content"
            }
        ]
    }
}'''

read_json(s, grab_nest=(data, posts)) # some better argument name
                  body  id              title
0  Some useful content   1        A blog post
1         More content   2  Another blog post

jreback · 2013-06-12T14:51:28Z

This is probably getting too cute.....

In [52]: def extract(df, l):
   ....:     for e in l:
   ....:         df = df[e]
   ....:     return df
   ....: 

In [54]: DataFrame.extract = extract

In [56]: DataFrame(pd.read_json(s).extract(['data','posts']))
Out[56]: 
                  body  id              title
0  Some useful content   1        A blog post
1         More content   2  Another blog post

hayd · 2013-06-12T18:41:12Z

Ha! Perhaps less overhead to do pd.DataFrame(extract(pd.io.json.loads(s), ('data', 'posts'))), but in either case we would lose the datetime parsing atm.

hayd · 2013-06-12T19:43:46Z

This could be a reasonably ok solution... tricky with orient (?), then parse_dates or whatever?

...other choice is just to loads/dumps/parse? :s

hayd · 2013-06-12T21:22:47Z

here's a bigish nested json: https://github.com/hayd/sf-city-lots-json ~200mb

I think I'd want to extract ['features'].

I'm on an incredibly old macbook air, hence slow timings:

In [9]: %time with open('citylots.json', 'r') as f: pd.read_json(pd.io.json.dumps(extract(pd.io.json.loads(f.read()), ['features'])))
CPU times: user 29.13 s, sys: 28.27 s, total: 57.40 s
Wall time: 304.71 s

In [10]: %time with open('citylots.json', 'r') as f: pd.DataFrame(extract(pd.io.json.loads(f.read()), ['features']))
CPU times: user 11.96 s, sys: 11.79 s, total: 23.75 s
Wall time: 136.50 s

In [11]: %time with open('citylots.json', 'r') as f: pd.read_json(f.read())CPU times: user 13.47 s, sys: 10.41 s, total: 23.88 s
Wall time: 77.47 s

What is an extreme for reading in json?

jreback · 2013-06-13T02:01:07Z

After I figured out all I needed to do was clone the repository!
(these also include full dtype conversions) FYI

In [3]: %time with open('citylots.json', 'r') as f: pd.read_json(pd.io.json.dumps(extract(pd.io.json.loads(f.read()), ['features'])))
CPU times: user 13.03 s, sys: 0.50 s, total: 13.53 s
Wall time: 15.12 s

In [6]: %time with open('citylots.json', 'r') as f: pd.DataFrame(extract(pd.io.json.loads(f.read()), ['features']))
CPU times: user 6.03 s, sys: 0.08 s, total: 6.11 s
Wall time: 6.13 s

In [7]: %time with open('citylots.json', 'r') as f: pd.read_json(f.read())
CPU times: user 6.27 s, sys: 0.16 s, total: 6.44 s
Wall time: 6.45 s

jreback · 2013-06-13T02:54:38Z

see #3876

hayd · 2013-06-13T09:07:11Z

Wow, I should never do any data analysis on that laptop... (sorry I forgot that you had to clone it).

But what I mean is, you'd lose the control from the read_json arguments. Will be interesting to see if this use case comes up a lot "in the wild".

... really this really this kind of stuff should be done the other end, e.g. with http://www.elasticsearch.org/guide/reference/api/get/ (getting the _source directly).

wesm · 2013-06-13T13:30:26Z

I have a JSON normalization function I can clean up and make a PR before before anyone goes crazy writing one to save you some time. It would be nice to have a higher performance one at some point though

nehalecky · 2013-06-18T22:14:34Z

Hey @wesm and @hayd! I've been keeping my eye on this thread for a few days now—really impressive all the work that built up to this, thank you. Anyways, I though you might know that I could go crazy writing some JSON normalization soon. Perhaps I should wait? :)

Thanks for all.

hayd · 2013-06-23T09:48:29Z

Related: https://groups.google.com/forum/#!topic/pydata/XkiWtZKT698 (json is a list of nested dictionaries...)

Hey @nehalecky , I think @wesm says he has something in the works already, so perhaps if you can hold off til he's pushed, then you could hack that? :)

…ttening nested records per #1067

ImNick23 · 2019-03-13T16:43:02Z

Is there a simple way of grabbing nested keys when constructing a Pandas Dataframe from JSON. Using the example JSON from below, how would I build a Dataframe that uses this column_header = ['id_str', 'text', 'user.screen_name'], (i.e. how do I get the 'screen_name' from the 'user' key without flattening the JSON).

Thanks,

Vishal

{   u'_id': ObjectId('4f8b95e8a504d022e2000000'),
    u'contributors': None,
    u'coordinates': None,
    u'created_at': u'Mon Apr 16 03:45:44 +0000 2012',
    u'entities': {   u'hashtags': [],
                     u'urls': [   {   u'display_url': u'stks.co/3M4X',
                                      u'expanded_url': u'http://stks.co/3M4X',
                                      u'indices': [72, 92],
                                      u'url': u'http://t.co/mGdTYusF'}],
                     u'user_mentions': []},
    u'favorited': False,
    u'geo': None,
    u'id': 191734090783916032L,
    u'id_str': u'191734090783916032',
    u'in_reply_to_screen_name': None,
    u'in_reply_to_status_id': None,
    u'in_reply_to_status_id_str': None,
    u'in_reply_to_user_id': None,
    u'in_reply_to_user_id_str': None,
    u'place': None,
    u'possibly_sensitive': False,
    u'possibly_sensitive_editable': True,
    u'processed_metadata': {   u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>),
                               u'search_queries': [   u'$AAPL',
                                                      u'$MSFT',
                                                      u'$GOOG'],
                               u'source': u'Twitter Streaming API'},
    u'retweet_count': 0,
    u'retweeted': False,
    u'source': u'<a href="http://stocktwits.com" rel="nofollow">StockTwits Web</a>',
    u'text': u'Interesting infographic on the internet and evolution of social media \u2794 http://t.co/mGdTYusF $FB $GOOG $TWIT $LNKD $AOL',
    u'truncated': False,
    u'user': {   u'_id': u'speculatethemkt',
                 u'contributors_enabled': False,
                 u'created_at': u'Tue Nov 30 02:28:20 +0000 2010',
                 u'default_profile': False,
                 u'default_profile_image': False,
                 u'description': u"I'm a 22-year old full-time forex trader living the location independent lifestyle. Author of The Trading Elite \u2794 http://amzn.to/I0nacY",
                 u'favourites_count': 1,
                 u'follow_request_sent': None,
                 u'followers_count': 19658,
                 u'following': None,
                 u'friends_count': 596,
                 u'geo_enabled': False,
                 u'id': 221226895,
                 u'id_str': u'221226895',
                 u'is_translator': False,
                 u'lang': u'en',
                 u'listed_count': 6,
                 u'location': u'Portland, OR',
                 u'name': u'Jared M.',
                 u'notifications': None,
                 u'processed_metadata': {   u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>),
                                            u'search_queries': [   u'$AAPL',
                                                                   u'$MSFT',
                                                                   u'$GOOG'],
                                            u'source': u'Twitter Streaming API'},
                 u'profile_background_color': u'4f4f4f',
                 u'profile_background_image_url': u'http://a0.twimg.com/profile_background_images/438353849/TWITTER-BG.jpg',
                 u'profile_background_image_url_https': u'https://si0.twimg.com/profile_background_images/438353849/TWITTER-BG.jpg',
                 u'profile_background_tile': False,
                 u'profile_image_url': u'http://a0.twimg.com/profile_images/2039368182/twitpic1_normal.png',
                 u'profile_image_url_https': u'https://si0.twimg.com/profile_images/2039368182/twitpic1_normal.png',
                 u'profile_link_color': u'bd0000',
                 u'profile_sidebar_border_color': u'eeeeee',
                 u'profile_sidebar_fill_color': u'efefef',
                 u'profile_text_color': u'333333',
                 u'profile_use_background_image': True,
                 u'protected': False,
                 u'screen_name': u'speculatethemkt',
                 u'show_all_inline_media': True,
                 u'statuses_count': 492,
                 u'time_zone': u'Pacific Time (US & Canada)',
                 u'url': u'http://www.speculatethemarkets.com',
                 u'utc_offset': -28800,
                 u'verified': False}}

I am trying convert this kind of 3/4 levels of nested json into python dataframe with every attribute present in it. I am able to extract up to 2 levels. How can i do for rest?

wesm added a commit that referenced this issue Jun 23, 2013

ENH: add functional (but pretty slow) json_normalize function for fla…

2ca2930

…ttening nested records per #1067

wesm mentioned this issue Jun 23, 2013

Basic JSON normalization/flattening #4007

Merged

wesm added a commit that referenced this issue Oct 3, 2013

ENH: add functional (but pretty slow) json_normalize function for fla…

4790b93

…ttening nested records per #1067

jreback closed this as completed in #4007 Oct 3, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of nested JSON records #1067

Handling of nested JSON records #1067

wesm commented Apr 16, 2012

hayd commented May 8, 2013

jreback commented Jun 11, 2013

hayd commented Jun 11, 2013

jreback commented Jun 12, 2013

hayd commented Jun 12, 2013

hayd commented Jun 12, 2013

hayd commented Jun 12, 2013

jreback commented Jun 13, 2013

jreback commented Jun 13, 2013

hayd commented Jun 13, 2013

wesm commented Jun 13, 2013

nehalecky commented Jun 18, 2013

hayd commented Jun 23, 2013

ImNick23 commented Mar 13, 2019

Handling of nested JSON records #1067

Handling of nested JSON records #1067

Comments

wesm commented Apr 16, 2012

hayd commented May 8, 2013

jreback commented Jun 11, 2013

hayd commented Jun 11, 2013

jreback commented Jun 12, 2013

hayd commented Jun 12, 2013

hayd commented Jun 12, 2013

hayd commented Jun 12, 2013

jreback commented Jun 13, 2013

jreback commented Jun 13, 2013

hayd commented Jun 13, 2013

wesm commented Jun 13, 2013

nehalecky commented Jun 18, 2013

hayd commented Jun 23, 2013

ImNick23 commented Mar 13, 2019