# JSON exercise

Using data in file 'data/world_bank_projects.json',
Q1. Find the 10 countries with most projects
Q2. Find the top 10 major project themes (using column 'mjtheme_namecode')
Q3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [1]:
import pandas as pd
from google.cloud import storage

import datetime as dt
from datetime import datetime
from pytz import timezone

import uuid

#Reading Google Buckets for files
client = storage.Client()
bucket=client.get_bucket('capstone_project_sr')
blob = storage.Blob('world_bank_projects.json',bucket)
with open('world_bank_projects.json', 'wb') as file_obj:
    blob.download_to_file(file_obj)

In [2]:
import json
from pandas.io.json import json_normalize

#Reading the JSON file

In [3]:
df = pd.read_json('world_bank_projects.json')

In [4]:
df.columns #checking all the columns

Index(['_id', 'approvalfy', 'board_approval_month', 'boardapprovaldate',
       'borrower', 'closingdate', 'country_namecode', 'countrycode',
       'countryname', 'countryshortname', 'docty', 'envassesmentcategorycode',
       'grantamt', 'ibrdcommamt', 'id', 'idacommamt', 'impagency',
       'lendinginstr', 'lendinginstrtype', 'lendprojectcost',
       'majorsector_percent', 'mjsector_namecode', 'mjtheme',
       'mjtheme_namecode', 'mjthemecode', 'prodline', 'prodlinetext',
       'productlinetype', 'project_abstract', 'project_name', 'projectdocs',
       'projectfinancialtype', 'projectstatusdisplay', 'regionname', 'sector',
       'sector1', 'sector2', 'sector3', 'sector4', 'sector_namecode',
       'sectorcode', 'source', 'status', 'supplementprojectflg', 'theme1',
       'theme_namecode', 'themecode', 'totalamt', 'totalcommamt', 'url'],
      dtype='object')

Q1. To find the 10 countries with most projects. Please note that Africa is listed as country. 

In [5]:
df.countryname.value_counts()[:11] # or add .drop('Africa')[:10] #can drop africa if you want.

People's Republic of China         19
Republic of Indonesia              19
Socialist Republic of Vietnam      17
Republic of India                  16
Republic of Yemen                  13
People's Republic of Bangladesh    12
Nepal                              12
Kingdom of Morocco                 12
Republic of Mozambique             11
Africa                             11
Federative Republic of Brazil       9
Name: countryname, dtype: int64

Q2. Find the top 10 major project themes (using column 'mjtheme_namecode')

To find the major themes we have to handle missing values. In this case we can drop them assuming that they are unique  or not associated with major project themes.

In [6]:
data=json.load(open('world_bank_projects.json'))

In [7]:
df_themes=json_normalize(data,'mjtheme_namecode') #we can also use regex to do the same job

In [8]:
df_themes.sample(10) #to check the dataset for missing values

Unnamed: 0,code,name
105,11,Environment and natural resources management
733,11,Environment and natural resources management
367,2,Public sector governance
1033,11,Environment and natural resources management
644,6,Social protection and risk management
1330,8,Human development
75,10,Rural development
1359,8,Human development
899,6,Social protection and risk management
1063,11,Environment and natural resources management


Handling missing values.

In [9]:
#handling missing values = Question 3
theme=df_themes[df_themes.name!=''].drop_duplicates()

Top 10 major project themes.

In [10]:
#converting df to dict for lookup
theme_dict=theme.set_index('code')['name'].to_dict()

In [11]:
theme_dict #storing the project and code for filling missing values

{'1': 'Economic management',
 '10': 'Rural development',
 '11': 'Environment and natural resources management',
 '2': 'Public sector governance',
 '3': 'Rule of law',
 '4': 'Financial and private sector development',
 '5': 'Trade and integration',
 '6': 'Social protection and risk management',
 '7': 'Social dev/gender/inclusion',
 '8': 'Human development',
 '9': 'Urban development'}

In [12]:
df_themes.sample(15) #Missing values. 

Unnamed: 0,code,name
241,4,Financial and private sector development
333,9,
3,6,Social protection and risk management
322,6,Social protection and risk management
635,11,Environment and natural resources management
739,9,Urban development
1335,10,Rural development
1224,6,
639,6,Social protection and risk management
1228,6,Social protection and risk management


### Answering Q3 - Handling missing values

In [13]:
#updating missing names using dict 
df_themes.name = df_themes.code.map(lambda x: theme_dict[x])

In [14]:
#Answer to question 2
df_themes.name.value_counts()[:10]

Environment and natural resources management    250
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              38
Name: name, dtype: int64