# Pandas Recipes

* https://pandas.pydata.org/pandas-docs/stable/index.html
* https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
* https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/



In [None]:
import json
import dimcli
from dimcli.shortcuts import dslquery
import pandas as pd
from pandas.io.json import json_normalize

### Reminder: about tidy data

http://www.jeannicholashould.com/tidy-data-in-python.html

* Each variable forms a column and contains values
* Each observation forms a row
* Each type of observational unit forms a table

A few definitions:

* Variable: A measurement or an attribute. Height, weight, sex, etc.
* Value: The actual measurement or attribute. 152 cm, 80 kg, female, etc.
* Observation: All values measure on the same unit. Each person.


## Load some data

In [None]:
### With empty columns

df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])

### by columns



In [None]:
df = pd.DataFrame({
    'name' : ['val1', 'val2', 'etc..'],
    'category' : ['val1', 'val2', 'etc..'],
})



### by rows (records)



In [None]:
df = pd.DataFrame.from_dict([
    {'name': 'val1', 'category' : 'val1' },
    {'name': 'val1', 'category' : 'val1' },
    ...
})

### From JSON


In [None]:
pd.read_json('file.json', orient='columns') # for rows, use 'records' 

pd.to_json('out.json', orient='columns')

### From CSV

In [None]:
df = pd.read_csv("/tmp/tmp07wuam09/data/cereal.csv")

## Add cell value

```
ddf.at['Row Index or Label', 'Col name'] = 10
```


## Inspect dataframe and Index

In [None]:
df.shape

(1000, 9)

In [None]:
df.ndim

2

In [None]:
df.dtypes

author_affiliations    object
id                     object
issue                  object
journal                object
pages                  object
title                  object
type                   object
volume                 object
year                    int64
dtype: object

In [None]:
# the 'describe' method returns basic statistic for all columns of a dataframe
df.describe(include='all')

### Working with the index

In [None]:
df.index

RangeIndex(start=0, stop=1000, step=1)

In [None]:
df = df.set_index('year')
df.loc['2003']

author_affiliations    [[{'first_name': 'Karl', 'last_name': 'Derouen...
id                                                        pub.1011026276
issue                                                                  4
journal                {'id': 'jour.1139246', 'title': 'Defence and P...
pages                                                            251-260
title                  The Role of the UN in International Crisis Ter...
type                                                             article
volume                                                                14
Name: 2003, dtype: object

In [None]:
df = df.reset_index() # return index to original integer-based state

## Rows and columns

In [None]:
df.columns

Index(['year', 'author_affiliations', 'id', 'issue', 'journal', 'pages',
       'title', 'type', 'volume'],
      dtype='object')

In [None]:
df.iloc[2] # select by position

year                                                                2015
author_affiliations    [[{'first_name': 'TIM', 'last_name': 'OLIVER',...
id                                                        pub.1013516622
issue                                                                  1
journal                {'id': 'jour.1024766', 'title': 'International...
pages                                                              77-91
title                  To be or not to be in Europe: is that the ques...
type                                                             article
volume                                                                91
Name: 2, dtype: object

In [None]:
df.loc[2] # equivalent to iloc in this case // label based indexing

year                                                                2015
author_affiliations    [[{'first_name': 'TIM', 'last_name': 'OLIVER',...
id                                                        pub.1013516622
issue                                                                  1
journal                {'id': 'jour.1024766', 'title': 'International...
pages                                                              77-91
title                  To be or not to be in Europe: is that the ques...
type                                                             article
volume                                                                91
Name: 2, dtype: object

In [None]:
df.year.head() # grab a column by its name with dot notation

0    2003
1    2006
2    2015
3    2015
4    2015
Name: year, dtype: int64

In [None]:
type(df.year)

pandas.core.series.Series

In [None]:
df['year'].describe()

count    1000.000000
mean     2017.100000
std         0.954417
min      2003.000000
25%      2017.000000
50%      2017.000000
75%      2018.000000
max      2019.000000
Name: year, dtype: float64

In [None]:
df['year'].value_counts()

2017    479
2018    326
2016    154
2019     28
2015     11
2006      1
2003      1
Name: year, dtype: int64

### Select rows based on cell values and update

In [None]:
df.loc[df['journal'].isnull(), "journal"] = "unknown"

In [None]:
df[df['journal'] == "unknown"].head()

Unnamed: 0,year,author_affiliations,id,issue,journal,pages,title,type,volume
10,2015,"[[{'first_name': 'Julie', 'last_name': 'Smith'...",pub.1024252028,,unknown,370-396,Europe: The coalition's poisoned chalice,John,
43,2016,"[[{'first_name': 'Margaret', 'last_name': 'Sto...",pub.1037023067,,unknown,,A Radically Democratic Response to Global Gove...,John,
75,2016,"[[{'first_name': 'Amr', 'last_name': 'Magdy', ...",pub.1083917941,,unknown,7,GeoTrend: spatial trending queries on real-tim...,John,
85,2016,"[[{'first_name': 'Tim', 'last_name': 'Jackson'...",pub.1052654584,,unknown,,"Prosperity without Growth, 2nd",John,
86,2016,"[[{'first_name': 'Cas', 'last_name': 'Mudde', ...",pub.1052987179,,unknown,,The Populist Radical Right,John,


### Renaming columns
* from https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/

In [None]:
# Rename columns using a dictionary to map values
# Rename the Area columnn to 'place_name'
data = data.rename(columns={"Area": "place_name"})

# Again, the inplace parameter will change the dataframe without assignment
data.rename(columns={"Area": "place_name"}, inplace=True)

# Rename multiple columns in one go with a larger dictionary
data.rename(
    columns={
        "Area": "place_name",
        "Y2001": "year_2001"
    },
    inplace=True
)

# Rename all columns using a function, e.g. convert all column names to lower case:
data.rename(columns=str.lower)

### Create a new DF by selecting columns

In [None]:
df2 = df[['id', 'issue', 'pages', 'title', 'type', 'volume', 'year']]

In [None]:
df2.head()

Unnamed: 0,id,issue,pages,title,type,volume,year
0,pub.1059983538,4.0,1593-1636,Measuring Economic Policy Uncertainty,article,131.0,2016
1,pub.1024262334,3.0,575-605,An Illustrated User Guide to the World Input–O...,article,23.0,2015
2,pub.1090420622,,,"Trump, Brexit, and the Rise of Populism: Econo...",preprint,,2016
3,pub.1038710751,9.0,1259-1277,"The Brexit vote: a divided nation, a divided c...",article,23.0,2016
4,pub.1019084958,3.0,323-332,"The 2016 Referendum, Brexit and the Left Behin...",article,87.0,2016


### Dropping columns

In [None]:
df.drop(['author_affiliations'], axis=1, inplace=True)

### Drop columns with missing values 
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [None]:
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
...                             pd.NaT]})

# Drop the rows where at least one element is missing.

df.dropna()
#      name        toy       born
# 1  Batman  Batmobile 1940-04-25

# Drop the columns where at least one element is missing.

df.dropna(axis='columns')
#        name
# 0    Alfred
# 1    Batman
# 2  Catwoman

#Drop the rows where all elements are missing.

df.dropna(how='all')
#        name        toy       born
# 0    Alfred        NaN        NaT
# 1    Batman  Batmobile 1940-04-25
# 2  Catwoman   Bullwhip        NaT

#Keep only the rows with at least 2 non-NA values.

df.dropna(thresh=2)
#        name        toy       born
# 1    Batman  Batmobile 1940-04-25
# 2  Catwoman   Bullwhip        NaT

#Define in which columns to look for missing values.

df.dropna(subset=['name', 'born'])
#        name        toy       born
# 1    Batman  Batmobile 1940-04-25

#Keep the DataFrame with valid entries in the same variable.

df.dropna(inplace=True)
# >>> df
#      name        toy       born
# 1  Batman  Batmobile 1940-04-25

### Drop empty values rows, after replacing empty strings

In [None]:
df['FOR'].replace('', np.nan, inplace=True)

In [None]:
df.dropna(subset=['FOR'], inplace=False).head()

In [None]:
# replace with empty list (can't be done with 'replace')
for row in df.loc[df.ids.isnull(), 'ids'].index:
    df.at[row, 'ids'] = []

### Add new column to existing dataframe

In [None]:
# Use the original df1 indexes to create the series:

df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)

In [None]:
# Declare a list that is to be converted into a column 
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna'] 
  
# Using 'Address' as the column name 
# and equating it to the list 
df['Address'] = address 

### change column order

In [None]:
df = df[['mean', '0', '1', '2', '3']]

#You can get the list of columns with:

cols = list(df.columns.values)

#The output will produce:
#['0', '1', '2', '3', 'mean']

## Count Values

### Count unique values

In [None]:
# Count distict values, use nunique:

df['hID'].nunique()

# Count only non-null values, use count:

df['hID'].count()

# Count total values including null values, use size attribute:

df['hID'].size


In [None]:
# this will show you the distinct element and their number of occurence.

df['race'].value_counts()

In [None]:
# only top ten
affiliations['aff_id'].value_counts()[:10]

### Count missing values
* https://chartio.com/resources/tutorials/how-to-check-if-any-value-is-nan-in-a-pandas-dataframe/
* https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

In [None]:
df.isnull().sum()

author_affiliations      6
id                       0
issue                  162
journal                 56
pages                   38
title                    0
type                     0
volume                  77
year                     0
dtype: int64

In [None]:
# THIS CHANGES ALL VALUES IN ROW! SO NOT IDEAL
ddf[ddf['FOR'].isnull()] = "ciao"

### Get most frequent value (single)

In [None]:
dataframe['name'].value_counts().idxmax()

### Frequency count based on two columns (variables)

https://stackoverflow.com/questions/33271098/python-get-a-frequency-count-based-on-two-columns-variables-in-pandas-datafra

In [None]:
df.groupby(["Group", "Size"]).size()
# Out[11]:
# Group     Size
# Moderate  Medium    1
#           Small     1
# Short     Small     2
# Tall      Large     1
# dtype: int64

df.groupby(["Group", "Size"]).size().reset_index(name="Time")
# Out[12]:
#       Group    Size  Time
# 0  Moderate  Medium     1
# 1  Moderate   Small     1
# 2     Short   Small     2
# 3      Tall   Large     1

In [None]:
# You can also try pd.crosstab()

pd.crosstab(df.Group,df.Size)

# Size      Large  Medium  Small
# Group                         
# Moderate      0       1      1
# Short         0       0      2
# Tall          1       0      0

### Fill in empty values
* https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

In [None]:
# fill everywhere, returns a frame

ddf.fillna(0) 

In [None]:
# this returns a series, not a frame!
ddf['FOR'].fillna("aaa") # NOTE doesn't save anything! 

0    [{'id': '3292', 'name': '1402 Applied Economic...
1        [{'id': '3313', 'name': '1403 Econometrics'}]
2                                                  aaa
3    [{'id': '3675', 'name': '2103 Historical Studi...
4           [{'id': '3448', 'name': '1608 Sociology'}]
5    [{'id': '3197', 'name': '1199 Other Medical an...
6                                                  aaa
7    [{'id': '3432', 'name': '1606 Political Scienc...
8    [{'id': '3432', 'name': '1606 Political Scienc...
9    [{'id': '3432', 'name': '1606 Political Scienc...
Name: FOR, dtype: object

## Transpose axis
* http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html

In [None]:
df3 = df.transpose()
df3.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
author_affiliations,"[[{'first_name': 'Scott R.', 'last_name': 'Bak...","[[{'first_name': 'Marcel P.', 'last_name': 'Ti...","[[{'first_name': 'Ronald', 'last_name': 'Ingle...","[[{'first_name': 'Sara B.', 'last_name': 'Hobo...","[[{'first_name': 'Matthew J.', 'last_name': 'G...",[[{'first_name': 'The Global Consortium for H5...,"[[{'first_name': 'Liesbet', 'last_name': 'Hoog...","[[{'first_name': 'Jonathan', 'last_name': 'Pol...","[[{'first_name': 'Harold D.', 'last_name': 'Cl...","[[{'first_name': 'D.J.', 'last_name': 'Flynn',...",...,"[[{'first_name': 'Yannis', 'last_name': 'Stavr...","[[{'first_name': 'Vegard', 'last_name': 'Jarne...","[[{'first_name': 'Anastasia', 'last_name': 'Go...","[[{'first_name': 'Eva', 'last_name': 'Thomann'...","[[{'first_name': 'Terrie', 'last_name': 'Epste...","[[{'first_name': 'Kristoffer', 'last_name': 'H...","[[{'first_name': 'Charlotte', 'last_name': 'Ga...","[[{'first_name': 'Nikolas', 'last_name': 'Rose...","[[{'first_name': 'Sophie', 'last_name': 'Meuni...",
id,pub.1059983538,pub.1024262334,pub.1090420622,pub.1038710751,pub.1019084958,pub.1062667950,pub.1085208832,pub.1053864137,pub.1096906068,pub.1074211439,...,pub.1101047240,pub.1100074004,pub.1112309901,pub.1115075916,pub.1091824261,pub.1109782631,pub.1083872789,pub.1092641433,pub.1093091454,pub.1084761130
issue,4,3,,9,3,6309,1,1,,S1,...,1,1,,1,,4,,3-4,7,
journal,"{'id': 'jour.1123532', 'title': 'The Quarterly...","{'id': 'jour.1050256', 'title': 'Review of Int...","{'id': 'jour.1276748', 'title': 'SSRN Electron...","{'id': 'jour.1052590', 'title': 'Journal of Eu...","{'id': 'jour.1026839', 'title': 'The Political...","{'id': 'jour.1346339', 'title': 'Science'}","{'id': 'jour.1052590', 'title': 'Journal of Eu...","{'id': 'jour.1147745', 'title': 'Research & Po...",,"{'id': 'jour.1129081', 'title': 'Political Psy...",...,"{'id': 'jour.1006192', 'title': 'American Beha...","{'id': 'jour.1017488', 'title': 'British Journ...","{'id': 'jour.1027568', 'title': 'Journal of Et...","{'id': 'jour.1153608', 'title': 'European Poli...",,"{'id': 'jour.1149390', 'title': 'Media and Com...",,"{'id': 'jour.1028218', 'title': 'Economy and S...","{'id': 'jour.1138887', 'title': 'Journal of Eu...",
pages,1593-1636,575-605,,1259-1277,323-332,213-217,1-27,205316801668691,,127-150,...,43-58,166-189,1-19,37-57,,49-57,49-72,1-21,891-907,


## Sorting values
* https://thispointer.com/pandas-sort-rows-or-columns-in-dataframe-based-on-values-using-dataframe-sort_values/

In [None]:
	
df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
# Arguments :

# by : A string or list of strings basically either column names or index labels based on which sorting will be done.
# axis : If axis is 0, then name or list of names in by argument will be considered as column names. Default is 0
# If axis is 1, then name or list of names in by argument will be considered as row index labels
# ascending : If True sort in ascending else sort in descending order. Default is True
# inplace : If True, perform operation in-place in Dataframe
# na_position : Decides the position of NaNs after sorting i.e. irst puts NaNs at the beginning, last puts NaNs at the end
# Default value is ‘first’

In [None]:
df.sort_values(by=["year"], inplace=True) # sort in place and update the index as well
df.reset_index(drop=True)
df.head()

Unnamed: 0,author_affiliations,id,issue,journal,pages,title,type,volume,year
281,"[[{'first_name': 'Karl', 'last_name': 'Derouen...",pub.1011026276,4.0,"{'id': 'jour.1139246', 'title': 'Defence and P...",251-260,The Role of the UN in International Crisis Ter...,article,14.0,2003
647,"[[{'first_name': 'Meirav', 'last_name': 'Misha...",pub.1053839515,5.0,"{'id': 'jour.1024683', 'title': 'Journal of Pe...",583-600,"Ethnic Diversity, Issues, and International Cr...",article,43.0,2006
351,"[[{'first_name': 'Oliver', 'last_name': 'Daddo...",pub.1033135439,1.0,"{'id': 'jour.1027780', 'title': 'JCMS Journal ...",71-88,Interpreting the Outsider Tradition in British...,article,53.0,2015
114,"[[{'first_name': 'Nicholas', 'last_name': 'Sta...",pub.1011712909,3.0,"{'id': 'jour.1138537', 'title': 'International...",311-323,Have we reached a tipping point? The mainstrea...,article,36.0,2015
809,"[[{'first_name': 'Julie', 'last_name': 'Smith'...",pub.1024252028,,,370-396,Europe: The coalition's poisoned chalice,chapter,,2015


## Apply

In [None]:
# add new column 
df['M1_list'] = df['M1'].apply(lambda x: x.split(","))

## Groupby

### Simple Groupby

In [None]:
df2 = df.groupby('year', as_index=False)
df2.groups.keys()

dict_keys([2003, 2006, 2015, 2016, 2017, 2018, 2019])

In [None]:
group2003 = df2.get_group(2003)
group2003.head()

Unnamed: 0,year,author_affiliations,id,issue,journal,pages,title,type,volume
0,2003,"[[{'first_name': 'Karl', 'last_name': 'Derouen...",pub.1011026276,4,"{'id': 'jour.1139246', 'title': 'Defence and P...",251-260,The Role of the UN in International Crisis Ter...,article,14


In [None]:
# The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. 
# To avoid setting this index, pass “as_index=False” to the groupby operation.
df3 = df.groupby('year', as_index=False)
df3.count()

Unnamed: 0,year,author_affiliations,id,issue,journal,pages,title,type,volume
0,2003,1,1,1,1,1,1,1,1
1,2006,1,1,1,1,1,1,1,1
2,2015,11,11,10,11,11,11,11,10
3,2016,154,154,129,154,143,154,154,139
4,2017,475,479,385,479,462,479,479,448
5,2018,324,326,290,326,318,326,326,301
6,2019,28,28,22,28,26,28,28,23


### Counting results

In [None]:
# https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
# Produces Pandas Series
df3['id'].count()
# Produces Pandas DataFrame
df3[['id']].count()

Unnamed: 0,year,id
0,2003,1
1,2006,1
2,2015,11
3,2016,154
4,2017,479
5,2018,326
6,2019,28


One liner..

In [None]:
df.groupby('year', as_index=False)['id'].count()

Unnamed: 0,year,id
0,2003,1
1,2006,1
2,2015,11
3,2016,154
4,2017,479
5,2018,326
6,2019,28


### Groupby and sum only one column
* https://stackoverflow.com/questions/38985053/pandas-groupby-and-sum-only-one-column

In [None]:
df_by_concept = df.groupby('concept', as_index=False)['score'].sum()

### Using groupby to filter items that occur more than once
* https://stackoverflow.com/questions/32918506/pandas-how-to-filter-for-items-that-occur-more-than-once-in-a-dataframe

In [None]:
df_top_journals = df.groupby('journal.title').filter(lambda x: len(x) > 3)

### Add a column that counts a variable in groupby
* https://stackoverflow.com/questions/29791785/python-pandas-add-a-column-to-my-dataframe-that-counts-a-variable

In [None]:
df['count'] = df.groupby('group')['group'].transform('count')

# Out[223]:
#     org  group  count
# 0  org1      1      2
# 1  org2      1      2
# 2  org3      2      1
# 3  org4      3      3
# 4  org5      3      3
# 5  org6      3      3

In [None]:
#
# add new column by counting unique instances in another column than the grouping one
#
gridaffiliations["tot_pubs"] = gridaffiliations.groupby(['aff_id'])['pub_id'].transform('nunique')

### Group by two variables


In [None]:
gridaffiliations.groupby(['aff_id', 'pub_id']).count()

## Iterate through dataframe
* https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas

In [None]:
for index, row in df.iterrows():
    print(row['c1'], row['c2'])

## Query dataframe

In [None]:
df.query("type=='article'").head()

Unnamed: 0,author_affiliations,id,issue,journal,pages,title,type,volume,year
0,"[[{'first_name': 'Scott R.', 'last_name': 'Bak...",pub.1059983538,4,"{'id': 'jour.1123532', 'title': 'The Quarterly...",1593-1636,Measuring Economic Policy Uncertainty,article,131,2016
1,"[[{'first_name': 'Marcel P.', 'last_name': 'Ti...",pub.1024262334,3,"{'id': 'jour.1050256', 'title': 'Review of Int...",575-605,An Illustrated User Guide to the World Input–O...,article,23,2015
3,"[[{'first_name': 'Sara B.', 'last_name': 'Hobo...",pub.1038710751,9,"{'id': 'jour.1052590', 'title': 'Journal of Eu...",1259-1277,"The Brexit vote: a divided nation, a divided c...",article,23,2016
4,"[[{'first_name': 'Matthew J.', 'last_name': 'G...",pub.1019084958,3,"{'id': 'jour.1026839', 'title': 'The Political...",323-332,"The 2016 Referendum, Brexit and the Left Behin...",article,87,2016
5,[[{'first_name': 'The Global Consortium for H5...,pub.1062667950,6309,"{'id': 'jour.1346339', 'title': 'Science'}",213-217,Role for migratory wild birds in the global sp...,article,354,2016


In [None]:
grid_ids = df.query("country_name=='Netherlands'")[:30]['id']

### Select based on list 

In [None]:
 df[df['A'].isin([3, 6])]

In [None]:
# negative version
 df[~df['A'].isin([3, 6])]

In [None]:
df.query(' column_a == ["val1", "val2", ...]', inplace=True)

In [None]:
my_symbol = 'BUD US'
df.query("Symbol=='{0}'".format(my_symbol))

In [None]:
# select if value is in list (when cell contains a list eg a list of grants)
df[df['supporting_grant_ids'].str.contains('grant.2347731')]

## Merge and aggregrate dataframes
* https://www.dataquest.io/blog/pandas-concatenation-tutorial/

df2 = pd.merge(dfy, dfyears_nl, how='outer')

In [None]:
# concenate dataframes simply add new rows at the bottom
res = df1.append([df2, df3])
# then usually sort and reset index for visualizations etc...
res.rename(columns={'id':'years'}, inplace=True)
res.sort_values(by="years", inplace=True)
res.reset_index(drop=True)

NameError: name 'pubyears1' is not defined

## Melt a dataframe

In [None]:
formatted_df = pd.melt(df,
                       ["religion"],  # the columns to keep as is
                       var_name="income",  # the columnn grouping all melted columns 
                       value_name="freq")  # the column counting the objects melted
formatted_df = formatted_df.sort_values(by=["religion"])
formatted_df.head(10)

## Using json_normalize

* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html
* https://www.kaggle.com/jboysen/quick-tutorial-flatten-nested-json-in-pandas

```
from pandas.io.json import json_normalize #package for flattening json in pandas df

>>> data = [{'state': 'Florida',
...          'shortname': 'FL',
...          'info': {
...               'governor': 'Rick Scott'
...          },
...          'counties': [{'name': 'Dade', 'population': 12345},
...                      {'name': 'Broward', 'population': 40000},
...                      {'name': 'Palm Beach', 'population': 60000}]},
...         {'state': 'Ohio',
...          'shortname': 'OH',
...          'info': {
...               'governor': 'John Kasich'
...          },
...          'counties': [{'name': 'Summit', 'population': 1234},
...                       {'name': 'Cuyahoga', 'population': 1337}]}]
>>> result = json_normalize(data, 'counties', ['state', 'shortname',
...                                           ['info', 'governor']])
>>> result
         name  population info.governor    state shortname
0        Dade       12345    Rick Scott  Florida        FL
1     Broward       40000    Rick Scott  Florida        FL
2  Palm Beach       60000    Rick Scott  Florida        FL
3      Summit        1234   John Kasich     Ohio        OH
4    Cuyahoga        1337   John Kasich     Ohio        OH
```

In [None]:
dfjournals = json_normalize(data.publications)
dfjournals.reset_index()
dfjournals.head()

Unnamed: 0,author_affiliations,id,issue,journal.id,journal.title,pages,title,type,volume,year
0,"[[{'first_name': 'Scott R.', 'last_name': 'Bak...",pub.1059983538,4.0,jour.1123532,The Quarterly Journal of Economics,1593-1636,Measuring Economic Policy Uncertainty,article,131.0,2016
1,"[[{'first_name': 'Marcel P.', 'last_name': 'Ti...",pub.1024262334,3.0,jour.1050256,Review of International Economics,575-605,An Illustrated User Guide to the World Input–O...,article,23.0,2015
2,"[[{'first_name': 'Ronald', 'last_name': 'Ingle...",pub.1090420622,,jour.1276748,SSRN Electronic Journal,,"Trump, Brexit, and the Rise of Populism: Econo...",preprint,,2016
3,"[[{'first_name': 'Sara B.', 'last_name': 'Hobo...",pub.1038710751,9.0,jour.1052590,Journal of European Public Policy,1259-1277,"The Brexit vote: a divided nation, a divided c...",article,23.0,2016
4,"[[{'first_name': 'Matthew J.', 'last_name': 'G...",pub.1019084958,3.0,jour.1026839,The Political Quarterly,323-332,"The 2016 Referendum, Brexit and the Left Behin...",article,87.0,2016


In [None]:
dfjournals['journal.title'].value_counts().head()

Journal of European Public Policy                              30
The British Journal of Politics and International Relations    17
Journal of Ethnic and Migration Studies                        16
Journal of Democracy                                           13
The Political Quarterly                                        12
Name: journal.title, dtype: int64

### Json_normalize on nested objects with missing elements

In [None]:
data = dslquery("""search publications for "brexit" return publications[doi+title+FOR+times_cited] sort by times_cited limit 1000""")

NOTE the following doesn't work because the nested FOR structure sometimes is missing!

```
json_normalize(data.publications[:10], record_path=['FOR'])

==> 'FOR' key Error! 
```

In essence, the problem is that the data above have a missing key! The JSON is not regularly structured

In [None]:
for x in data.publications[:10]:
    if not 'FOR' in x:
        x['FOR'] = []
json_normalize(data.publications[:10], record_path=['FOR'], meta=['doi', 'title']).head()

Unnamed: 0,id,name,doi,title
0,3292,1402 Applied Economics,10.1093/qje/qjw024,Measuring Economic Policy Uncertainty
1,3313,1403 Econometrics,10.1093/qje/qjw024,Measuring Economic Policy Uncertainty
2,3286,1401 Economic Theory,10.1093/qje/qjw024,Measuring Economic Policy Uncertainty
3,3313,1403 Econometrics,10.1111/roie.12178,An Illustrated User Guide to the World Input–O...
4,3675,2103 Historical Studies,10.1080/13501763.2016.1225785,"The Brexit vote: a divided nation, a divided c..."


In [None]:
# another more sophisticated approach
data = dslquery("""search publications for "brexit" return publications[doi+title+FOR+times_cited] sort by times_cited limit 1000""")
# ensure that all pubs have a valid (empty, even) FOR value 
for x in data.publications:
    if not 'FOR' in x:
        x['FOR'] = ""
    else:
        x['FOR'] = [{'name' : x['name'][5:]} for x in x['FOR']] # also remove the digit prefix to improve legibility
# then
json_normalize(data.publications, record_path=['FOR'], meta=["doi", "title"], errors='ignore', record_prefix='for_').head()

Unnamed: 0,for_name,doi,title
0,Applied Economics,10.1093/qje/qjw024,Measuring Economic Policy Uncertainty
1,Econometrics,10.1093/qje/qjw024,Measuring Economic Policy Uncertainty
2,Economic Theory,10.1093/qje/qjw024,Measuring Economic Policy Uncertainty
3,Econometrics,10.1111/roie.12178,An Illustrated User Guide to the World Input–O...
4,Historical Studies,10.1080/13501763.2016.1225785,"The Brexit vote: a divided nation, a divided c..."


### Normalizing nested JSON objects

Normalizing author affiliations is more difficult becuase they are **nested within 2 level** so the unpacking doesnt work out of the box...

In [None]:
data = dslquery("""search publications for "brexit" return publications[basics] sort by times_cited limit 1000""")

In [None]:
data.publications[0]['author_affiliations'][0]

[{'first_name': 'Scott R.',
  'last_name': 'Baker',
  'orcid': '',
  'current_organization_id': '',
  'researcher_id': '',
  'affiliations': []},
 {'first_name': 'Nicholas',
  'last_name': 'Bloom',
  'orcid': '',
  'current_organization_id': '',
  'researcher_id': '',
  'affiliations': []},
 {'first_name': 'Steven J.',
  'last_name': 'Davis',
  'orcid': '',
  'current_organization_id': '',
  'researcher_id': '',
  'affiliations': []}]

TIP: we can just simplify the input JSON by removing the empty level in the nesting structure... 

In [None]:
for x in data.publications:
    if 'author_affiliations' in x:
        x['author_affiliations'] = x['author_affiliations'][0]
    else:
        x['author_affiliations'] = []

In [None]:
df_aff1 = json_normalize(data.publications[:10], record_path=['author_affiliations'], meta=["id", "doi"], errors='ignore')
df_aff1.head()

Unnamed: 0,affiliations,current_organization_id,first_name,last_name,orcid,researcher_id,id,doi
0,[],,Scott R.,Baker,,,pub.1059983538,
1,[],,Nicholas,Bloom,,,pub.1059983538,
2,[],,Steven J.,Davis,,,pub.1059983538,
3,"[{'id': 'grid.4830.f', 'name': 'University of ...",grid.4830.f,Marcel P.,Timmer,,ur.0637613047.37,pub.1024262334,
4,"[{'id': 'grid.4830.f', 'name': 'University of ...",grid.4830.f,Erik,Dietzenbacher,,ur.010050177407.14,pub.1024262334,


this allows us to have stats on number of authors per pub, how many have orcids, and current organizations.

### Recursive applications of json_normalize

We could simply create do another round of json_normalize !

In [None]:
type(json.loads(df_aff1.to_json(orient='records')))

list

In [None]:
json_normalize(json.loads(df_aff1.to_json(orient='records')), record_path=['affiliations'], 
               meta=['id', 'researcher_id', 'first_name', 'last_name'], record_prefix='aff_').head()
        

Unnamed: 0,aff_city,aff_city_id,aff_country,aff_country_code,aff_id,aff_name,aff_state,aff_state_code,id,researcher_id,first_name,last_name
0,Groningen,2755251.0,Netherlands,NL,grid.4830.f,University of Groningen,,,pub.1024262334,ur.0637613047.37,Marcel P.,Timmer
1,Groningen,2755251.0,Netherlands,NL,grid.4830.f,University of Groningen,,,pub.1024262334,ur.010050177407.14,Erik,Dietzenbacher
2,Groningen,2755251.0,Netherlands,NL,grid.4830.f,University of Groningen,,,pub.1024262334,ur.014505206773.12,Bart,Los
3,Vienna,2761369.0,Austria,AT,grid.426374.0,Vienna Institute for International Economic St...,,,pub.1024262334,ur.013540400553.29,Robert,Stehrer
4,Groningen,2755251.0,Netherlands,NL,grid.4830.f,University of Groningen,,,pub.1024262334,ur.07400737673.66,Gaaitzen J.,Vries
