****
## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated to,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

## Importing the data 

In [50]:
# importing packages and methods
import pandas as pd
import json
from pandas.io.json import json_normalize

# load world bank projects file as a data frame 
world_bank = pd.read_json('data/world_bank_projects.json')

## Finding the 10 countries with the most projects

In [51]:
# finding the top 15 locations with the most projects
top_15 = world_bank['countryshortname'].value_counts().head(15)

# displaying results
top_15

China                 19
Indonesia             19
Vietnam               17
India                 16
Yemen, Republic of    13
Nepal                 12
Morocco               12
Bangladesh            12
Africa                11
Mozambique            11
Brazil                 9
Burkina Faso           9
Pakistan               9
Tajikistan             8
Tanzania               8
Name: countryshortname, dtype: int64

**NOTE:** 
* A quick review of the data shows the inclusion of Africa which is not a country but a continent. Africa will thus be removed from the series.
* There happen to be three countries with 9 projects. Each of these countries will qualify for the 10th position increasing the top ten list to a total of 12 countries.

In [52]:
# removing Africa from the list
top_10 = top_15.drop(labels = 'Africa').head(12)

# displaying top 10 countries and their respective number of projects
top_10

China                 19
Indonesia             19
Vietnam               17
India                 16
Yemen, Republic of    13
Nepal                 12
Morocco               12
Bangladesh            12
Mozambique            11
Brazil                 9
Burkina Faso           9
Pakistan               9
Name: countryshortname, dtype: int64

## Finding the top 10 major project themes

In [53]:
# load world bank projects data as a string
world_bank_string = json.load((open('data/world_bank_projects.json')))

# normalizing semistructured json data into a flat table
project_themes = json_normalize(data=world_bank_string, record_path='mjtheme_namecode')

# viewing first five columns 
project_themes.head()

Unnamed: 0,code,name
0,8,Human development
1,11,
2,1,Economic management
3,6,Social protection and risk management
4,5,Trade and integration


In [54]:
# viewing descriptions of both columns
project_themes.describe()

Unnamed: 0,code,name
count,1499,1499
unique,11,12
top,11,Environment and natural resources management
freq,250,223


**NOTE:**
* A look at the first five columns of the dataframe revealed that the name column has missing entries which are empty strings. This is further verified by the fact the code column has 11 unique values while the name column has 12. Consequently, the code column will be used to identify the top 10 ranked themes.

In [55]:
# Ranking the codes for the various project themes
theme_codes = pd.DataFrame(project_themes['code'].value_counts())

# Naming columns and indexes
theme_codes.columns = ['Theme count']
theme_codes.index.name = 'code'

Now that the codes have been ranked, each code will be matched with its corresponding project themes. The top 10 project themes will be identified based on code rankings.

In [57]:
# Filter out rows with empty strings as names
no_empty_names = project_themes[project_themes['name'] != '']

# Removing duplicate rows
no_empty_names = no_empty_names.drop_duplicates()

# Making the 'code' column the index
no_empty_names = no_empty_names.set_index(keys = 'code')

# merging dataframe with top_11_themes
themes = no_empty_names.join(theme_codes, how='inner')

# displaying the results for the top 10 themes
themes.sort_values(by ='Theme count', ascending = False).head(10)


Unnamed: 0_level_0,name,Theme count
code,Unnamed: 1_level_1,Unnamed: 2_level_1
11,Environment and natural resources management,250
10,Rural development,216
8,Human development,210
2,Public sector governance,199
6,Social protection and risk management,168
4,Financial and private sector development,146
7,Social dev/gender/inclusion,130
5,Trade and integration,77
9,Urban development,50
1,Economic management,38


## Filling in Missing Project Names

In [58]:
# creating a dictionary with codes as key and names as values 
code_to_name = themes['name'].to_dict()

def name_filler(row):
    '''Creating a function that adds missing names'''
    
    # defining which elements are theme codes and theme names 
    code = row[0]
    name = row[1]
    
    # changing name if it is an empty string
    if name == '':
        name = code_to_name[code]
    
    return name

# Changing the names of empty strings in the project_themes dataframe
project_themes['name'] = project_themes.apply(func=name_filler, axis=1)

In [59]:
# Checking unique value count to verify there are no empty strings left
project_themes['name'].value_counts()

Environment and natural resources management    250
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              38
Rule of law                                      15
Name: name, dtype: int64

In [60]:
# Checking that the value counts for the names matches the value counts for the theme codes
project_themes['name'].value_counts().tolist() == theme_codes['Theme count'].tolist()

True