Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of nested JSON records #1067

Closed
wesm opened this issue Apr 16, 2012 · 14 comments · Fixed by #4007
Closed

Handling of nested JSON records #1067

wesm opened this issue Apr 16, 2012 · 14 comments · Fixed by #4007
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Milestone

Comments

@wesm
Copy link
Member

wesm commented Apr 16, 2012

Is there a simple way of grabbing nested keys when constructing a Pandas Dataframe from JSON. Using the example JSON from below, how would I build a Dataframe that uses this column_header = ['id_str', 'text', 'user.screen_name'], (i.e. how do I get the 'screen_name' from the 'user' key without flattening the JSON).

Thanks,

Vishal

{   u'_id': ObjectId('4f8b95e8a504d022e2000000'),
    u'contributors': None,
    u'coordinates': None,
    u'created_at': u'Mon Apr 16 03:45:44 +0000 2012',
    u'entities': {   u'hashtags': [],
                     u'urls': [   {   u'display_url': u'stks.co/3M4X',
                                      u'expanded_url': u'http://stks.co/3M4X',
                                      u'indices': [72, 92],
                                      u'url': u'http://t.co/mGdTYusF'}],
                     u'user_mentions': []},
    u'favorited': False,
    u'geo': None,
    u'id': 191734090783916032L,
    u'id_str': u'191734090783916032',
    u'in_reply_to_screen_name': None,
    u'in_reply_to_status_id': None,
    u'in_reply_to_status_id_str': None,
    u'in_reply_to_user_id': None,
    u'in_reply_to_user_id_str': None,
    u'place': None,
    u'possibly_sensitive': False,
    u'possibly_sensitive_editable': True,
    u'processed_metadata': {   u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>),
                               u'search_queries': [   u'$AAPL',
                                                      u'$MSFT',
                                                      u'$GOOG'],
                               u'source': u'Twitter Streaming API'},
    u'retweet_count': 0,
    u'retweeted': False,
    u'source': u'<a href="http://stocktwits.com" rel="nofollow">StockTwits Web</a>',
    u'text': u'Interesting infographic on the internet and evolution of social media \u2794 http://t.co/mGdTYusF $FB $GOOG $TWIT $LNKD $AOL',
    u'truncated': False,
    u'user': {   u'_id': u'speculatethemkt',
                 u'contributors_enabled': False,
                 u'created_at': u'Tue Nov 30 02:28:20 +0000 2010',
                 u'default_profile': False,
                 u'default_profile_image': False,
                 u'description': u"I'm a 22-year old full-time forex trader living the location independent lifestyle. Author of The Trading Elite \u2794 http://amzn.to/I0nacY",
                 u'favourites_count': 1,
                 u'follow_request_sent': None,
                 u'followers_count': 19658,
                 u'following': None,
                 u'friends_count': 596,
                 u'geo_enabled': False,
                 u'id': 221226895,
                 u'id_str': u'221226895',
                 u'is_translator': False,
                 u'lang': u'en',
                 u'listed_count': 6,
                 u'location': u'Portland, OR',
                 u'name': u'Jared M.',
                 u'notifications': None,
                 u'processed_metadata': {   u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>),
                                            u'search_queries': [   u'$AAPL',
                                                                   u'$MSFT',
                                                                   u'$GOOG'],
                                            u'source': u'Twitter Streaming API'},
                 u'profile_background_color': u'4f4f4f',
                 u'profile_background_image_url': u'http://a0.twimg.com/profile_background_images/438353849/TWITTER-BG.jpg',
                 u'profile_background_image_url_https': u'https://si0.twimg.com/profile_background_images/438353849/TWITTER-BG.jpg',
                 u'profile_background_tile': False,
                 u'profile_image_url': u'http://a0.twimg.com/profile_images/2039368182/twitpic1_normal.png',
                 u'profile_image_url_https': u'https://si0.twimg.com/profile_images/2039368182/twitpic1_normal.png',
                 u'profile_link_color': u'bd0000',
                 u'profile_sidebar_border_color': u'eeeeee',
                 u'profile_sidebar_fill_color': u'efefef',
                 u'profile_text_color': u'333333',
                 u'profile_use_background_image': True,
                 u'protected': False,
                 u'screen_name': u'speculatethemkt',
                 u'show_all_inline_media': True,
                 u'statuses_count': 492,
                 u'time_zone': u'Pacific Time (US & Canada)',
                 u'url': u'http://www.speculatethemarkets.com',
                 u'utc_offset': -28800,
                 u'verified': False}}
@jreback
Copy link
Contributor

jreback commented Jun 11, 2013

this is invalid JSON (according to jsonlint), but generalized inference is IMHO too complicated, but #3804 should be able to do some of this. close this issue?

@hayd
Copy link
Contributor

hayd commented Jun 11, 2013

Is it feasible to grab the user section, (actually this example from the other thread is better):

Convert the (data, posts) section to DataFrame

s = r'''{
    "status": "success",
    "data": {
        "posts": [
            {
                "id": 1,
                "title": "A blog post",
                "body": "Some useful content"
            },
            {
                "id": 2,
                "title": "Another blog post",
                "body": "More content"
            }
        ]
    }
}'''

read_json(s, grab_nest=(data, posts)) # some better argument name
                  body  id              title
0  Some useful content   1        A blog post
1         More content   2  Another blog post

@jreback
Copy link
Contributor

jreback commented Jun 12, 2013

This is probably getting too cute.....

In [52]: def extract(df, l):
   ....:     for e in l:
   ....:         df = df[e]
   ....:     return df
   ....: 

In [54]: DataFrame.extract = extract

In [56]: DataFrame(pd.read_json(s).extract(['data','posts']))
Out[56]: 
                  body  id              title
0  Some useful content   1        A blog post
1         More content   2  Another blog post

@hayd
Copy link
Contributor

hayd commented Jun 12, 2013

Ha! Perhaps less overhead to do pd.DataFrame(extract(pd.io.json.loads(s), ('data', 'posts'))), but in either case we would lose the datetime parsing atm.

@hayd
Copy link
Contributor

hayd commented Jun 12, 2013

This could be a reasonably ok solution... tricky with orient (?), then parse_dates or whatever?

...other choice is just to loads/dumps/parse? :s

@hayd
Copy link
Contributor

hayd commented Jun 12, 2013

here's a bigish nested json: https://github.com/hayd/sf-city-lots-json ~200mb

I think I'd want to extract ['features'].

I'm on an incredibly old macbook air, hence slow timings:

In [9]: %time with open('citylots.json', 'r') as f: pd.read_json(pd.io.json.dumps(extract(pd.io.json.loads(f.read()), ['features'])))
CPU times: user 29.13 s, sys: 28.27 s, total: 57.40 s
Wall time: 304.71 s

In [10]: %time with open('citylots.json', 'r') as f: pd.DataFrame(extract(pd.io.json.loads(f.read()), ['features']))
CPU times: user 11.96 s, sys: 11.79 s, total: 23.75 s
Wall time: 136.50 s

In [11]: %time with open('citylots.json', 'r') as f: pd.read_json(f.read())CPU times: user 13.47 s, sys: 10.41 s, total: 23.88 s
Wall time: 77.47 s

What is an extreme for reading in json?

@jreback
Copy link
Contributor

jreback commented Jun 13, 2013

After I figured out all I needed to do was clone the repository!
(these also include full dtype conversions) FYI

In [3]: %time with open('citylots.json', 'r') as f: pd.read_json(pd.io.json.dumps(extract(pd.io.json.loads(f.read()), ['features'])))
CPU times: user 13.03 s, sys: 0.50 s, total: 13.53 s
Wall time: 15.12 s

In [6]: %time with open('citylots.json', 'r') as f: pd.DataFrame(extract(pd.io.json.loads(f.read()), ['features']))
CPU times: user 6.03 s, sys: 0.08 s, total: 6.11 s
Wall time: 6.13 s

In [7]: %time with open('citylots.json', 'r') as f: pd.read_json(f.read())
CPU times: user 6.27 s, sys: 0.16 s, total: 6.44 s
Wall time: 6.45 s

@jreback
Copy link
Contributor

jreback commented Jun 13, 2013

see #3876

@hayd
Copy link
Contributor

hayd commented Jun 13, 2013

Wow, I should never do any data analysis on that laptop... (sorry I forgot that you had to clone it).

But what I mean is, you'd lose the control from the read_json arguments. Will be interesting to see if this use case comes up a lot "in the wild".

... really this really this kind of stuff should be done the other end, e.g. with http://www.elasticsearch.org/guide/reference/api/get/ (getting the _source directly).

@wesm
Copy link
Member Author

wesm commented Jun 13, 2013

I have a JSON normalization function I can clean up and make a PR before before anyone goes crazy writing one to save you some time. It would be nice to have a higher performance one at some point though

@nehalecky
Copy link
Contributor

Hey @wesm and @hayd! I've been keeping my eye on this thread for a few days now—really impressive all the work that built up to this, thank you. Anyways, I though you might know that I could go crazy writing some JSON normalization soon. Perhaps I should wait? :)

Thanks for all.

@hayd
Copy link
Contributor

hayd commented Jun 23, 2013

Related: https://groups.google.com/forum/#!topic/pydata/XkiWtZKT698 (json is a list of nested dictionaries...)

Hey @nehalecky , I think @wesm says he has something in the works already, so perhaps if you can hold off til he's pushed, then you could hack that? :)

wesm added a commit that referenced this issue Jun 23, 2013
wesm added a commit that referenced this issue Oct 3, 2013
@ImNick23
Copy link

Is there a simple way of grabbing nested keys when constructing a Pandas Dataframe from JSON. Using the example JSON from below, how would I build a Dataframe that uses this column_header = ['id_str', 'text', 'user.screen_name'], (i.e. how do I get the 'screen_name' from the 'user' key without flattening the JSON).

Thanks,

Vishal

{   u'_id': ObjectId('4f8b95e8a504d022e2000000'),
    u'contributors': None,
    u'coordinates': None,
    u'created_at': u'Mon Apr 16 03:45:44 +0000 2012',
    u'entities': {   u'hashtags': [],
                     u'urls': [   {   u'display_url': u'stks.co/3M4X',
                                      u'expanded_url': u'http://stks.co/3M4X',
                                      u'indices': [72, 92],
                                      u'url': u'http://t.co/mGdTYusF'}],
                     u'user_mentions': []},
    u'favorited': False,
    u'geo': None,
    u'id': 191734090783916032L,
    u'id_str': u'191734090783916032',
    u'in_reply_to_screen_name': None,
    u'in_reply_to_status_id': None,
    u'in_reply_to_status_id_str': None,
    u'in_reply_to_user_id': None,
    u'in_reply_to_user_id_str': None,
    u'place': None,
    u'possibly_sensitive': False,
    u'possibly_sensitive_editable': True,
    u'processed_metadata': {   u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>),
                               u'search_queries': [   u'$AAPL',
                                                      u'$MSFT',
                                                      u'$GOOG'],
                               u'source': u'Twitter Streaming API'},
    u'retweet_count': 0,
    u'retweeted': False,
    u'source': u'<a href="http://stocktwits.com" rel="nofollow">StockTwits Web</a>',
    u'text': u'Interesting infographic on the internet and evolution of social media \u2794 http://t.co/mGdTYusF $FB $GOOG $TWIT $LNKD $AOL',
    u'truncated': False,
    u'user': {   u'_id': u'speculatethemkt',
                 u'contributors_enabled': False,
                 u'created_at': u'Tue Nov 30 02:28:20 +0000 2010',
                 u'default_profile': False,
                 u'default_profile_image': False,
                 u'description': u"I'm a 22-year old full-time forex trader living the location independent lifestyle. Author of The Trading Elite \u2794 http://amzn.to/I0nacY",
                 u'favourites_count': 1,
                 u'follow_request_sent': None,
                 u'followers_count': 19658,
                 u'following': None,
                 u'friends_count': 596,
                 u'geo_enabled': False,
                 u'id': 221226895,
                 u'id_str': u'221226895',
                 u'is_translator': False,
                 u'lang': u'en',
                 u'listed_count': 6,
                 u'location': u'Portland, OR',
                 u'name': u'Jared M.',
                 u'notifications': None,
                 u'processed_metadata': {   u'created_date': datetime.datetime(2012, 4, 16, 3, 45, 44, tzinfo=<bson.tz_util.FixedOffset object at 0x104d63790>),
                                            u'search_queries': [   u'$AAPL',
                                                                   u'$MSFT',
                                                                   u'$GOOG'],
                                            u'source': u'Twitter Streaming API'},
                 u'profile_background_color': u'4f4f4f',
                 u'profile_background_image_url': u'http://a0.twimg.com/profile_background_images/438353849/TWITTER-BG.jpg',
                 u'profile_background_image_url_https': u'https://si0.twimg.com/profile_background_images/438353849/TWITTER-BG.jpg',
                 u'profile_background_tile': False,
                 u'profile_image_url': u'http://a0.twimg.com/profile_images/2039368182/twitpic1_normal.png',
                 u'profile_image_url_https': u'https://si0.twimg.com/profile_images/2039368182/twitpic1_normal.png',
                 u'profile_link_color': u'bd0000',
                 u'profile_sidebar_border_color': u'eeeeee',
                 u'profile_sidebar_fill_color': u'efefef',
                 u'profile_text_color': u'333333',
                 u'profile_use_background_image': True,
                 u'protected': False,
                 u'screen_name': u'speculatethemkt',
                 u'show_all_inline_media': True,
                 u'statuses_count': 492,
                 u'time_zone': u'Pacific Time (US & Canada)',
                 u'url': u'http://www.speculatethemarkets.com',
                 u'utc_offset': -28800,
                 u'verified': False}}

I am trying convert this kind of 3/4 levels of nested json into python dataframe with every attribute present in it. I am able to extract up to 2 levels. How can i do for rest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants