# Data Wrangling - JSON

## The Data
- This is a dataset of projects funded by the World Bank.
- The original data can be found here: http://jsonstudio.com/resources/

## The Goal
There are three parts to this assignment:
- Compute the 10 countries with the most World Bank funded projects.
- Compute the 10 themes most common among the projects.
- Fill in missing descriptive data for the themes.

## Import and Inspect the Data

In [35]:
# Import pandas, numpy, and json packages:
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize

cd Desktop/Springboard/data_wrangling_json/data

In [36]:
# The local directory should contain the data folder found in the hyperlink.
# Read the json file into a pandas data frame:
json_df=pd.read_json('data/world_bank_projects.json')

In [37]:
json_df.shape

(500, 50)

In [38]:
json_df.columns

Index(['_id', 'approvalfy', 'board_approval_month', 'boardapprovaldate',
       'borrower', 'closingdate', 'country_namecode', 'countrycode',
       'countryname', 'countryshortname', 'docty', 'envassesmentcategorycode',
       'grantamt', 'ibrdcommamt', 'id', 'idacommamt', 'impagency',
       'lendinginstr', 'lendinginstrtype', 'lendprojectcost',
       'majorsector_percent', 'mjsector_namecode', 'mjtheme',
       'mjtheme_namecode', 'mjthemecode', 'prodline', 'prodlinetext',
       'productlinetype', 'project_abstract', 'project_name', 'projectdocs',
       'projectfinancialtype', 'projectstatusdisplay', 'regionname', 'sector',
       'sector1', 'sector2', 'sector3', 'sector4', 'sector_namecode',
       'sectorcode', 'source', 'status', 'supplementprojectflg', 'theme1',
       'theme_namecode', 'themecode', 'totalamt', 'totalcommamt', 'url'],
      dtype='object')

In [39]:
# Set pandas to show all of the columns
pd.options.display.max_columns=50

In [40]:
json_df.head()

Unnamed: 0,_id,approvalfy,board_approval_month,boardapprovaldate,borrower,closingdate,country_namecode,countrycode,countryname,countryshortname,docty,envassesmentcategorycode,grantamt,ibrdcommamt,id,idacommamt,impagency,lendinginstr,lendinginstrtype,lendprojectcost,majorsector_percent,mjsector_namecode,mjtheme,mjtheme_namecode,mjthemecode,prodline,prodlinetext,productlinetype,project_abstract,project_name,projectdocs,projectfinancialtype,projectstatusdisplay,regionname,sector,sector1,sector2,sector3,sector4,sector_namecode,sectorcode,source,status,supplementprojectflg,theme1,theme_namecode,themecode,totalamt,totalcommamt,url
0,{'$oid': '52b213b38594d8a2be17c780'},1999,November,2013-11-12T00:00:00Z,FEDERAL DEMOCRATIC REPUBLIC OF ETHIOPIA,2018-07-07T00:00:00Z,Federal Democratic Republic of Ethiopia!$!ET,ET,Federal Democratic Republic of Ethiopia,Ethiopia,"Project Information Document,Indigenous People...",C,0,0,P129828,130000000,MINISTRY OF EDUCATION,Investment Project Financing,IN,550000000,"[{'Percent': 46, 'Name': 'Education'}, {'Perce...","[{'code': 'EX', 'name': 'Education'}, {'code':...",[Human development],"[{'code': '8', 'name': 'Human development'}, {...",811,PE,IBRD/IDA,L,{'cdata': 'The development objective of the Se...,Ethiopia General Education Quality Improvement...,"[{'DocDate': '28-AUG-2013', 'EntityID': '09022...",IDA,Active,Africa,"[{'Name': 'Primary education'}, {'Name': 'Seco...","{'Percent': 46, 'Name': 'Primary education'}","{'Percent': 26, 'Name': 'Secondary education'}","{'Percent': 16, 'Name': 'Public administration...","{'Percent': 12, 'Name': 'Tertiary education'}","[{'code': 'EP', 'name': 'Primary education'}, ...","ET,BS,ES,EP",IBRD,Active,N,"{'Percent': 100, 'Name': 'Education for all'}","[{'code': '65', 'name': 'Education for all'}]",65,130000000,130000000,http://www.worldbank.org/projects/P129828/ethi...
1,{'$oid': '52b213b38594d8a2be17c781'},2015,November,2013-11-04T00:00:00Z,GOVERNMENT OF TUNISIA,,Republic of Tunisia!$!TN,TN,Republic of Tunisia,Tunisia,"Project Information Document,Integrated Safegu...",C,4700000,0,P144674,0,MINISTRY OF FINANCE,Specific Investment Loan,IN,5700000,"[{'Percent': 70, 'Name': 'Public Administratio...","[{'code': 'BX', 'name': 'Public Administration...","[Economic management, Social protection and ri...","[{'code': '1', 'name': 'Economic management'},...",16,RE,Recipient Executed Activities,L,,TN: DTF Social Protection Reforms Support,"[{'DocDate': '29-MAR-2013', 'EntityID': '00033...",OTHER,Active,Middle East and North Africa,[{'Name': 'Public administration- Other social...,"{'Percent': 70, 'Name': 'Public administration...","{'Percent': 30, 'Name': 'General public admini...",,,"[{'code': 'BS', 'name': 'Public administration...","BZ,BS",IBRD,Active,N,"{'Percent': 30, 'Name': 'Other economic manage...","[{'code': '24', 'name': 'Other economic manage...",5424,0,4700000,http://www.worldbank.org/projects/P144674?lang=en
2,{'$oid': '52b213b38594d8a2be17c782'},2014,November,2013-11-01T00:00:00Z,MINISTRY OF FINANCE AND ECONOMIC DEVEL,,Tuvalu!$!TV,TV,Tuvalu,Tuvalu,"Resettlement Plan,Environmental Assessment,Int...",B,0,0,P145310,6060000,MINISTRY OF TRANSPORT AND COMMUNICATIONS,Investment Project Financing,IN,6060000,"[{'Percent': 100, 'Name': 'Transportation'}]","[{'code': 'TX', 'name': 'Transportation'}]","[Trade and integration, Public sector governan...","[{'code': '5', 'name': 'Trade and integration'...",52116,PE,IBRD/IDA,L,,Tuvalu Aviation Investment Project - Additiona...,"[{'DocDate': '21-OCT-2013', 'EntityID': '00033...",IDA,Active,East Asia and Pacific,[{'Name': 'Rural and Inter-Urban Roads and Hig...,"{'Percent': 100, 'Name': 'Rural and Inter-Urba...",,,,"[{'code': 'TI', 'name': 'Rural and Inter-Urban...",TI,IBRD,Active,Y,"{'Percent': 46, 'Name': 'Regional integration'}","[{'code': '47', 'name': 'Regional integration'...",52812547,6060000,6060000,http://www.worldbank.org/projects/P145310?lang=en
3,{'$oid': '52b213b38594d8a2be17c783'},2014,October,2013-10-31T00:00:00Z,MIN. OF PLANNING AND INT'L COOPERATION,,Republic of Yemen!$!RY,RY,Republic of Yemen,"Yemen, Republic of","Procurement Plan,Project Information Document,...",C,1500000,0,P144665,0,LABOR INTENSIVE PUBLIC WORKS PROJECT PMU,Technical Assistance Loan,IN,1500000,"[{'Percent': 100, 'Name': 'Health and other so...","[{'code': 'JX', 'name': 'Health and other soci...","[Social dev/gender/inclusion, Social dev/gende...","[{'code': '7', 'name': 'Social dev/gender/incl...",77,RE,Recipient Executed Activities,L,,Gov't and Civil Society Organization Partnership,"[{'DocDate': '15-MAY-2013', 'EntityID': '00035...",OTHER,Active,Middle East and North Africa,[{'Name': 'Other social services'}],"{'Percent': 100, 'Name': 'Other social services'}",,,,"[{'code': 'JB', 'name': 'Other social services'}]",JB,IBRD,Active,N,"{'Percent': 50, 'Name': 'Participation and civ...","[{'code': '57', 'name': 'Participation and civ...",5957,0,1500000,http://www.worldbank.org/projects/P144665?lang=en
4,{'$oid': '52b213b38594d8a2be17c784'},2014,October,2013-10-31T00:00:00Z,MINISTRY OF FINANCE,2019-04-30T00:00:00Z,Kingdom of Lesotho!$!LS,LS,Kingdom of Lesotho,Lesotho,"Project Information Document,Integrated Safegu...",B,0,0,P144933,13100000,MINISTRY OF TRADE AND INDUSTRY,Investment Project Financing,IN,15000000,"[{'Percent': 50, 'Name': 'Industry and trade'}...","[{'code': 'YX', 'name': 'Industry and trade'},...","[Trade and integration, Financial and private ...","[{'code': '5', 'name': 'Trade and integration'...",54,PE,IBRD/IDA,L,{'cdata': 'The development objective of the Se...,Second Private Sector Competitiveness and Econ...,"[{'DocDate': '06-SEP-2013', 'EntityID': '09022...",IDA,Active,Africa,[{'Name': 'General industry and trade sector'}...,"{'Percent': 50, 'Name': 'General industry and ...","{'Percent': 40, 'Name': 'Other industry'}","{'Percent': 10, 'Name': 'SME Finance'}",,"[{'code': 'YZ', 'name': 'General industry and ...","FH,YW,YZ",IBRD,Active,N,"{'Percent': 30, 'Name': 'Export development an...","[{'code': '45', 'name': 'Export development an...",4145,13100000,13100000,http://www.worldbank.org/projects/P144933/seco...


It appears that `countryshortname` contains the most discernible country names.

## Find the 10 Countries with the Most Projects
This can easily be done using `value_counts()` to compute how often a country name appears.

In [41]:
# Compute a list of the top 10 countries with the most projects:
json_df['countryshortname'].value_counts().head(10)

Indonesia             19
China                 19
Vietnam               17
India                 16
Yemen, Republic of    13
Nepal                 12
Morocco               12
Bangladesh            12
Mozambique            11
Africa                11
Name: countryshortname, dtype: int64

`Africa` is not actually a country, so in the future one should further investigate these entries.

## Find the 10 Most Common Project Themes

Let's first further inspect the `mjtheme_namecode` feature we were told to use.

In [42]:
json_df['mjtheme_namecode'][0]

[{'code': '8', 'name': 'Human development'}, {'code': '11', 'name': ''}]

We see that the first project actually had two themes.  Indeed, most projects have multiple themes.
Further, the name of the second theme is missing (as the instructions warned).
This information is nested in dictionaries that are not easily accessible as we currently have the data stored.
Instead, read the data as a list of dictionaries utilizing its natural structure.

In [44]:
# Read the json file into a list of dictionaries:
with open('data/world_bank_projects.json', 'r') as json_file:
    json_dict = json.load(json_file)

In [61]:
# Create a dataframe of values for the nested element 'mjtheme_namecode':
themes_df=json_normalize(json_dict,'mjtheme_namecode')
themes_df.head(10)

Unnamed: 0,code,name
0,8,Human development
1,11,
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration
5,2,Public sector governance
6,11,Environment and natural resources management
7,6,Social protection and risk management
8,7,Social dev/gender/inclusion
9,7,Social dev/gender/inclusion


Since `name` has some missing entries, instead compute the most common project themes using `code`.

In [60]:
# Compute a list of the top 10 project themes by code number:
themes_df['code'].value_counts().head(10)

11    250
10    216
8     210
2     199
6     168
4     146
7     130
5      77
9      50
1      38
Name: code, dtype: int64

This information is not very enlightening without knowing what each `code` value represents, so let's go back and redo this calculation using `name`.

## Fill in Missing `name` Values
In order to perform the calculation using `name`, there cannot be any missing values.  To fix this problem, three steps will be taken:
1. Sort the data by `code` and then `name` so all like missing values are together.
2. Change all missing values to `NaN`.
3. Change all missing values to the correct value using surrounding data.

In [56]:
# Sort project themes by code and then by name:
themes_df=themes_df.sort_values(['code','name'])
themes_df.head(10)

Unnamed: 0,code,name
212,1,
363,1,
1024,1,
1114,1,
1437,1,
2,1,Economic management
88,1,Economic management
175,1,Economic management
204,1,Economic management
205,1,Economic management


Notice that this results in all projects of a fixed `code` with missing `name` appearing before the projects of the same `code` whose `name` entry is not missing. 

In [57]:
# Change all blank project theme names to NaN
themes_df.name[themes_df['name']=='']=np.nan

Since the missing `name` values appear before the actual values, we can backfill them.

In [58]:
# backfill the NaNs created to give proper names:
themes_df=themes_df.fillna(method='bfill')

In [59]:
# Recompute the most common project themes, this time by their descriptive names.
themes_df['name'].value_counts()

Environment and natural resources management    250
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              38
Rule of law                                      15
Name: name, dtype: int64