# JSON Mini Project
*****
This exercise uses the World Bank dataset from a school quality improvement project in Ethiopia and addresses the following three tasks: 
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2., some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [1]:
# Import required libraries
import pandas as pd
import json
from pandas.io.json import json_normalize

# load as pandas dataframe
df = pd.read_json('data/world_bank_projects.json')

Each row provides the details of each project done, as shown by the names of the columns. 

In [2]:
df.columns

Index(['_id', 'approvalfy', 'board_approval_month', 'boardapprovaldate',
       'borrower', 'closingdate', 'country_namecode', 'countrycode',
       'countryname', 'countryshortname', 'docty', 'envassesmentcategorycode',
       'grantamt', 'ibrdcommamt', 'id', 'idacommamt', 'impagency',
       'lendinginstr', 'lendinginstrtype', 'lendprojectcost',
       'majorsector_percent', 'mjsector_namecode', 'mjtheme',
       'mjtheme_namecode', 'mjthemecode', 'prodline', 'prodlinetext',
       'productlinetype', 'project_abstract', 'project_name', 'projectdocs',
       'projectfinancialtype', 'projectstatusdisplay', 'regionname', 'sector',
       'sector1', 'sector2', 'sector3', 'sector4', 'sector_namecode',
       'sectorcode', 'source', 'status', 'supplementprojectflg', 'theme1',
       'theme_namecode', 'themecode', 'totalamt', 'totalcommamt', 'url'],
      dtype='object')

### Task 1:  Top 10 countries with most projects

To find the top 10 countries, we use apply the value_counts() method to the column 'countryname'.

In [3]:
top_countries=df['countryname'].value_counts(ascending=False).head(10)

In [4]:
Top_Countries=pd.DataFrame(top_countries)
Top_Countries.columns=['Number of Projects']
Top_Countries

Unnamed: 0,Number of Projects
Republic of Indonesia,19
People's Republic of China,19
Socialist Republic of Vietnam,17
Republic of India,16
Republic of Yemen,13
People's Republic of Bangladesh,12
Nepal,12
Kingdom of Morocco,12
Africa,11
Republic of Mozambique,11


### Task 2: Top 10 major project themes

In this task, using the column 'mjtheme_namecode', we want to find the top 10 project themes. Each entry of the column 'mjtheme_namecode' consists of JSON string, which contains the code and the name of the themes of each project. Since some of the names are missing, we are going to find the top 10 project themes by their code first. In the next task, we are going to map them to their corresponding name.

In [5]:
# From each JSON string, we are going to form a dataframe containing the code and name of the project themes
# For the same project, some of the themes are being repeated.To avoid double counting of the themes for the same project,
# we check each row of 'mjtheme_namecode' column and remove any duplicates.
df_themes=json_normalize(df.mjtheme_namecode[0])
for proj in df.mjtheme_namecode[1:]:
    df_proj=json_normalize(proj).drop_duplicates()
    df_themes=df_themes.append(df_proj)

The dataframe df_themes contains the names and codes of all project themes.

In [6]:
df_themes.head(8)

Unnamed: 0,code,name
0,8,Human development
1,11,
0,1,Economic management
1,6,Social protection and risk management
0,5,Trade and integration
1,2,Public sector governance
2,11,Environment and natural resources management
3,6,Social protection and risk management


To find the top 10 project themes by their code, we apply the value_counts method.

In [7]:
top_themes=df_themes['code'].value_counts(ascending=False).head(10)
Top_Themes=pd.DataFrame(top_themes)
Top_Themes.columns=['Number of Projects']
Top_Themes.index.name='code'
Top_Themes

Unnamed: 0_level_0,Number of Projects
code,Unnamed: 1_level_1
11,162
10,149
2,141
8,131
4,120
6,120
7,114
5,61
9,40
1,33


### Task 3: Filling the missing names

To fill the missing names of the porject themes, we are going to form a lookup table for the code and name of the themes. For this sake, we are going to work on the dataframe df_themes by removing any duplicate rows and any rows with empty names.

In [8]:
import numpy as np
df_themes['name'].replace('',np.nan,inplace=True)
lookup_table=df_themes.dropna()
lookup_table=lookup_table.drop_duplicates()
lookup_table=lookup_table.set_index('code')
lookup_table

Unnamed: 0_level_0,name
code,Unnamed: 1_level_1
8,Human development
1,Economic management
6,Social protection and risk management
5,Trade and integration
2,Public sector governance
11,Environment and natural resources management
7,Social dev/gender/inclusion
4,Financial and private sector development
10,Rural development
9,Urban development


After having the lookup table, we now merge it with the Top_Themes table to obtain the names of the top 10 project themes.

In [9]:
Top_Themes=pd.merge(Top_Themes,lookup_table,on='code')

In [10]:
Top_Themes=Top_Themes[['name','Number of Projects']]
Top_Themes.columns=['Name','Number of Projects']
Top_Themes.index.name='Code'
Top_Themes

Unnamed: 0_level_0,Name,Number of Projects
Code,Unnamed: 1_level_1,Unnamed: 2_level_1
11,Environment and natural resources management,162
10,Rural development,149
2,Public sector governance,141
8,Human development,131
4,Financial and private sector development,120
6,Social protection and risk management,120
7,Social dev/gender/inclusion,114
5,Trade and integration,61
9,Urban development,40
1,Economic management,33


#### Filling the missing names back to the original dataframe

We now use the lookup table again to fill the missing names in the columns 'mjtheme_namecode' and 'mjtheme'

In [11]:
for index, row in df.iterrows():
    for theme in row['mjtheme_namecode']:
        if (theme['name']==''):
            theme['name']=lookup_table.loc[theme['code'],'name']
    df.at[index,'mjtheme']=list(json_normalize(row['mjtheme_namecode']).name)