# JSON examples and exercise
****
+ get familiar with packages for dealing with JSON
+ study examples with JSON strings and files 
+ work on exercise to be completed and submitted 
****
+ reference: http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-reader
+ data source: http://jsonstudio.com/resources/
****

In [1]:
import pandas as pd

## imports for Python, Pandas

In [2]:
import json
from pandas.io.json import json_normalize

## JSON example, with string

+ demonstrates creation of normalized dataframes (tables) from nested json string
+ source: http://pandas.pydata.org/pandas-docs/stable/io.html#normalization

In [3]:
# define json string
data = [{'state': 'Florida', 
         'shortname': 'FL',
         'info': {'governor': 'Rick Scott'},
         'counties': [{'name': 'Dade', 'population': 12345},
                      {'name': 'Broward', 'population': 40000},
                      {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {'governor': 'John Kasich'},
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]

In [4]:
# use normalization to create tables from nested element
json_normalize(data, 'counties')

Unnamed: 0,name,population
0,Dade,12345
1,Broward,40000
2,Palm Beach,60000
3,Summit,1234
4,Cuyahoga,1337


In [5]:
# further populate tables created from nested element
json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])

Unnamed: 0,name,population,state,shortname,info.governor
0,Dade,12345,Florida,FL,Rick Scott
1,Broward,40000,Florida,FL,Rick Scott
2,Palm Beach,60000,Florida,FL,Rick Scott
3,Summit,1234,Ohio,OH,John Kasich
4,Cuyahoga,1337,Ohio,OH,John Kasich


****
## JSON example, with file

+ demonstrates reading in a json file as a string and as a table
+ uses small sample file containing data about projects funded by the World Bank 
+ data source: http://jsonstudio.com/resources/

In [6]:
# load json as string
json.load((open('data/world_bank_projects_less.json')))

FileNotFoundError: [Errno 2] No such file or directory: 'data/world_bank_projects_less.json'

In [10]:
# load as Pandas dataframe
sample_json_df = pd.read_json('data/world_bank_projects_less.json')

Unnamed: 0,_id,approvalfy,board_approval_month,boardapprovaldate,borrower,closingdate,country_namecode,countrycode,countryname,countryshortname,...,sectorcode,source,status,supplementprojectflg,theme1,theme_namecode,themecode,totalamt,totalcommamt,url
0,{u'$oid': u'52b213b38594d8a2be17c780'},1999,November,2013-11-12T00:00:00Z,FEDERAL DEMOCRATIC REPUBLIC OF ETHIOPIA,2018-07-07T00:00:00Z,Federal Democratic Republic of Ethiopia!$!ET,ET,Federal Democratic Republic of Ethiopia,Ethiopia,...,"ET,BS,ES,EP",IBRD,Active,N,"{u'Percent': 100, u'Name': u'Education for all'}","[{u'code': u'65', u'name': u'Education for all'}]",65,130000000,130000000,http://www.worldbank.org/projects/P129828/ethi...
1,{u'$oid': u'52b213b38594d8a2be17c781'},2015,November,2013-11-04T00:00:00Z,GOVERNMENT OF TUNISIA,,Republic of Tunisia!$!TN,TN,Republic of Tunisia,Tunisia,...,"BZ,BS",IBRD,Active,N,"{u'Percent': 30, u'Name': u'Other economic man...","[{u'code': u'24', u'name': u'Other economic ma...",5424,0,4700000,http://www.worldbank.org/projects/P144674?lang=en


****
## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [2]:
import pandas as pd
import json

Solution to problem 1: I indexed the column of interest and ran the pandas method value_counts on the result, selecting only the first 10 highest values and printing the result.

In [26]:
json_df = pd.read_json('data/world_bank_projects.json')
most_projects = json_df['countryshortname'].value_counts()[0:10]
print (most_projects)

China                 19
Indonesia             19
Vietnam               17
India                 16
Yemen, Republic of    13
Morocco               12
Bangladesh            12
Nepal                 12
Africa                11
Mozambique            11
Name: countryshortname, dtype: int64


Solution to problem 2: This was a little more complicated because I had to pull the string I needed out of a nested dictionary contained in the dataframe column mjtheme_namecode. I did this by looping over a series of the name codes, pulling out the actual names of each and appending them to a list. Then I intialized a counter object to find the most common occurences in my list and print the results. 

In [42]:
from collections import Counter
json_df = pd.read_json('data/world_bank_projects.json')
project_themes = json_df['mjtheme_namecode']
project_code_names = []
for each in project_themes:
    project_code_names.append(each[0]['name'])
top_ten = Counter(project_code_names)
print(top_ten.most_common(10))

[(u'Environment and natural resources management', 85), (u'Human development', 72), (u'Public sector governance', 64), (u'Social protection and risk management', 57), (u'Rural development', 56), (u'Financial and private sector development', 53), (u'Social dev/gender/inclusion', 43), (u'Trade and integration', 25), (u'Urban development', 23), (u'Economic management', 11)]


Solution to problem 3: I loaded the data into a dataframe and pulled out the series of theme/name codes I needed. Then I looped over the series, pulling out the theme code and corresponding name into a dictionary. Because one of the first missing values occured at the beginning of the panda series, this set code #4 to be an empty string. I changed this by accessing that key in the dictionary and changing its value. Finally I looped over the series once more, identifying each row that had an empty name and replaced it with the corresponding code from the dictionary. I then replaced the column in the original dataframe with my new updated series and printed the result.

In [90]:
json_df = pd.read_json('data/world_bank_projects.json')
project_themes = json_df['mjtheme_namecode']
project_code_dict = {}
for each in project_themes:
    if each[0]['code'] not in project_code_dict:
        project_code_dict[each[0]['code']] = each[0]['name']
project_code_dict['4'] = 'Financial and private sector development'
for each in project_themes:
    if each[0]['name'] == '':
        each[0]['name'] = project_code_dict[each[0]['code']]
json_df['mjtheme_namecode'] = project_themes
print(json_df['mjtheme_namecode'])

0      [{u'code': u'8', u'name': u'Human development'...
1      [{u'code': u'1', u'name': u'Economic managemen...
2      [{u'code': u'5', u'name': u'Trade and integrat...
3      [{u'code': u'7', u'name': u'Social dev/gender/...
4      [{u'code': u'5', u'name': u'Trade and integrat...
5      [{u'code': u'6', u'name': u'Social protection ...
6      [{u'code': u'2', u'name': u'Public sector gove...
7      [{u'code': u'11', u'name': u'Environment and n...
8      [{u'code': u'10', u'name': u'Rural development...
9      [{u'code': u'2', u'name': u'Public sector gove...
10     [{u'code': u'10', u'name': u'Rural development...
11     [{u'code': u'10', u'name': u'Rural development...
12     [{u'code': u'4', u'name': u'Financial and priv...
13     [{u'code': u'5', u'name': u'Trade and integrat...
14     [{u'code': u'6', u'name': u'Social protection ...
15     [{u'code': u'10', u'name': u'Rural development...
16     [{u'code': u'10', u'name': u'Rural development...
17     [{u'code': u'8', u'name'