****
## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.




In [4]:
import pandas as pd
import json
from pandas.io.json import json_normalize

In [5]:
#Load in the data as projects, a normalized table of all the columns of the dataset
data = json.load(open('data/world_bank_projects.json'))
projects = json_normalize(data)

# 1. Find the 10 countries with the most projects

# Import the Counter object from collections, create the counter country_counter from the 'countryshortname' column of projects, and display a list of the ten most common countries with their counts
from collections import Counter
country_counter = Counter(projects['countryshortname'])
country_counter.most_common(10)

[('China', 19),
 ('Indonesia', 19),
 ('Vietnam', 17),
 ('India', 16),
 ('Yemen, Republic of', 13),
 ('Morocco', 12),
 ('Nepal', 12),
 ('Bangladesh', 12),
 ('Mozambique', 11),
 ('Africa', 11)]

In [6]:
# 2. Find the top 10 major project themes (using column 'mjtheme_namecode')

# Create an empty list and append the name of every theme for each project to it
mjthemes = projects['mjtheme_namecode']
themes = []
for entry in mjthemes:
    for i in entry:
        themes.append(i['name'])

# Create a counter for the now populated list of all the themes from the data, and display a list of the ten most common themes and their counts
themes_counter = Counter(themes)
themes_counter.most_common(10)

[('Environment and natural resources management', 223),
 ('Rural development', 202),
 ('Human development', 197),
 ('Public sector governance', 184),
 ('Social protection and risk management', 158),
 ('Financial and private sector development', 130),
 ('', 122),
 ('Social dev/gender/inclusion', 119),
 ('Trade and integration', 72),
 ('Urban development', 47)]

In [9]:
# 3. Create a dataframe with the missing names filled in.

# Make a dictionary with the theme code as the key and the theme name as the value
theme_dict = {}
for entry in projects['mjtheme_namecode']:
    for val in entry:
        if val['name'] != '':
            theme_dict[val['code']] = val['name']
            
# A simple function using theme_dict to return a theme name given a code
def filler_in(code):
    return theme_dict[code]
    

# Loop through each project in the data, and loop through all of each project's themes; reassign entries with an empty theme name to the correct name corresponding to its theme code 
num_proj = len(projects['mjtheme_namecode'])
for i in range(num_proj):
    num_themes_per_proj = len(projects['mjtheme_namecode'][i])
    for j in range(num_themes_per_proj):
        if projects['mjtheme_namecode'][i][j]['name'] == '':
            projects['mjtheme_namecode'][i][j]['name'] = filler_in(projects['mjtheme_namecode'][i][j]['code'])
        
# To check if empty string theme names have been filled in, we can make a new counter similar to the one in 2, and confirm the count for '' is 0
mjthemes = projects['mjtheme_namecode']
themes = []
for entry in mjthemes:
    for i in entry:
        themes.append(i['name'])
themes_counter_2 = Counter(themes)
print(themes_counter_2[''] == 0)

True
