## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
    - `answer_1`
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
    - `answer_2`
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.
    - `answer_3`

In [1]:
# imports
import pandas as pd
import json
from pandas.io.json import json_normalize

In [2]:
# load data into dataframe
json_df = pd.read_json('data/world_bank_projects.json')

## Question 1. Find the 10 countries with most projects

1. Determine country column
2. Use something like: `json_df['country'].value_counts().head(10)` to get top 10 counts

In [3]:
# look at column names
json_df.columns

Index(['_id', 'approvalfy', 'board_approval_month', 'boardapprovaldate',
       'borrower', 'closingdate', 'country_namecode', 'countrycode',
       'countryname', 'countryshortname', 'docty', 'envassesmentcategorycode',
       'grantamt', 'ibrdcommamt', 'id', 'idacommamt', 'impagency',
       'lendinginstr', 'lendinginstrtype', 'lendprojectcost',
       'majorsector_percent', 'mjsector_namecode', 'mjtheme',
       'mjtheme_namecode', 'mjthemecode', 'prodline', 'prodlinetext',
       'productlinetype', 'project_abstract', 'project_name', 'projectdocs',
       'projectfinancialtype', 'projectstatusdisplay', 'regionname', 'sector',
       'sector1', 'sector2', 'sector3', 'sector4', 'sector_namecode',
       'sectorcode', 'source', 'status', 'supplementprojectflg', 'theme1',
       'theme_namecode', 'themecode', 'totalamt', 'totalcommamt', 'url'],
      dtype='object')

In [4]:
# Show all rows/columns in output
# Only use if needed, slows things down

#pd.options.display.max_columns = None
#pd.options.display.max_rows = None

In [5]:
# Examine country--- columns closer

json_df[['country_namecode', 'countrycode', 'countryname', 'countryshortname']].head()

Unnamed: 0,country_namecode,countrycode,countryname,countryshortname
0,Federal Democratic Republic of Ethiopia!$!ET,ET,Federal Democratic Republic of Ethiopia,Ethiopia
1,Republic of Tunisia!$!TN,TN,Republic of Tunisia,Tunisia
2,Tuvalu!$!TV,TV,Tuvalu,Tuvalu
3,Republic of Yemen!$!RY,RY,Republic of Yemen,"Yemen, Republic of"
4,Kingdom of Lesotho!$!LS,LS,Kingdom of Lesotho,Lesotho


**Observations:** `countryshortname` appears to be a good choice to use as country column

In [15]:
answer_1 = json_df['countryshortname'].value_counts().head(10)
answer_1

# `Africa` is not really a country name, further investigation of the `borrowers` column 
# could lead to better identification of actual countries involved.

Indonesia             19
China                 19
Vietnam               17
India                 16
Yemen, Republic of    13
Nepal                 12
Morocco               12
Bangladesh            12
Mozambique            11
Africa                11
Name: countryshortname, dtype: int64

## Question 2. Find the top 10 major project themes (using column 'mjtheme_namecode')

1. Explore `mjtheme_namecode` column
2. Use something like: `json_df['mjtheme_namecode'].value_counts().head(10)` to get top 10 counts

In [7]:
# Examine a few of the major theme columns: 'mjtheme', 'mjtheme_namecode', 'mjthemecode'
json_df[['mjtheme', 'mjtheme_namecode', 'mjthemecode']].head()

Unnamed: 0,mjtheme,mjtheme_namecode,mjthemecode
0,[Human development],"[{'code': '8', 'name': 'Human development'}, {...",811
1,"[Economic management, Social protection and ri...","[{'code': '1', 'name': 'Economic management'},...",16
2,"[Trade and integration, Public sector governan...","[{'code': '5', 'name': 'Trade and integration'...",52116
3,"[Social dev/gender/inclusion, Social dev/gende...","[{'code': '7', 'name': 'Social dev/gender/incl...",77
4,"[Trade and integration, Financial and private ...","[{'code': '5', 'name': 'Trade and integration'...",54


**Observations:** `mjtheme_namecode` appears to be a dict of `mjtheme` and `mjthemecode` - use json_normalize()

In [8]:
# First, need to load json file as string
json_string = json.load(open('data/world_bank_projects.json'))

In [9]:
json_normalize(json_string, 'mjtheme_namecode')

Unnamed: 0,code,name
0,8,Human development
1,11,
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
7,6,Social protection and risk management
8,7,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion


**Observations:** I see now what they mean about missing names. I can recreate this from the `mjtheme` and `mjthemecode` columns. For now, a top ten list can be created using `groupby.size()`.

In [39]:
mjtheme_namecode_df = json_normalize(json_string, 'mjtheme_namecode')

mjtheme_namecode_df.set_index('code', inplace=True)

answer_2 = mjtheme_namecode_df.groupby('name').size().sort_values(ascending=False).head(10)

answer_2

name
Environment and natural resources management    223
Rural development                               202
Human development                               197
Public sector governance                        184
Social protection and risk management           158
Financial and private sector development        130
                                                122
Social dev/gender/inclusion                     119
Trade and integration                            72
Urban development                                47
dtype: int64

In [62]:
mjtheme_namecode_values_df = mjtheme_namecode_df[mjtheme_namecode_df['name'] != ''].drop_duplicates()

In [65]:
answer_3 = mjtheme_namecode_df

In [66]:
answer_3.update(mjtheme_namecode_values_df)

In [67]:
answer_3

Unnamed: 0_level_0,name
code,Unnamed: 1_level_1
8,Human development
11,Environment and natural resources management
1,Economic management
6,Social protection and risk management
5,Trade and integration
2,Public sector governance
11,Environment and natural resources management
6,Social protection and risk management
7,Social dev/gender/inclusion
7,Social dev/gender/inclusion
