# JSON examples and exercise
****
+ get familiar with packages for dealing with JSON
+ study examples with JSON strings and files 
+ work on exercise to be completed and submitted 
****
+ reference: http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-reader
+ data source: http://jsonstudio.com/resources/
****

In [3]:
import pandas as pd

## imports for Python, Pandas

In [4]:
import json
from pandas.io.json import json_normalize

****
## JSON exercise

Using data in file 'data/world_bank_projects.json' and the techniques demonstrated above,
1. Find the 10 countries with most projects
2. Find the top 10 major project themes (using column 'mjtheme_namecode')
3. In 2. above you will notice that some entries have only the code and the name is missing. Create a dataframe with the missing names filled in.

In [39]:
## load the json data set as a string and create a data frame
json.load((open('data/world_bank_projects.json')))
json_df = pd.read_json('data/world_bank_projects.json')

##count the countries that appear the most often in the project listing
most_projects = json_df['countryshortname'].value_counts()[0:10]
most_projects

China                 19
Indonesia             19
Vietnam               17
India                 16
Yemen, Republic of    13
Morocco               12
Bangladesh            12
Nepal                 12
Mozambique            11
Africa                11
Name: countryshortname, dtype: int64

In [156]:
import numpy as np

#create a temporary dataframe for the column with the project themes
temp = json_df['mjtheme_namecode']

#unlists (or unnest) the dictionary. The dictionary was nested in a list within the data frame. We can remove this
#by summing the data frame with a list.
temp1 = sum(temp,[])

#normalize or flatten the list to create a table with the codes and the respective labels
mjtheme_df = json_normalize(temp1)

#select all values in the data frame that DOES NOT have a missing value ('') in the second column and create a new data frame
mjtheme1 = mjtheme_df[mjtheme_df.iloc[:,1] != '']

#create an empty dictionary, which will be populated with key value pairs
mjtheme_dictkey = {}

#iterate through the dataframe with no missing values to create a dictionary wuith the key value pairs
for i in range(len(mjtheme1)):

    mjtheme_dictkey[mjtheme1.iloc[i,0]] = mjtheme1.iloc[i,1]

#now select the values in the data frame that did have missing values so we can populate them with the newly created dictionary
mjtheme2 = mjtheme_df[mjtheme_df.iloc[:,1] == '']

#replace the missing values using the dictionary, and merge it back into the original dataset
mjtheme3 = mjtheme2.replace({'code' : mjtheme_dictkey})
mjtheme2.iloc[:,1] = mjtheme3
mjtheme_df.update(mjtheme2)

#count the most frequently occuring project themes (or column label 'names')
mjtheme_df.iloc[:,1].value_counts()[0:10]
mjtheme_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


Environment and natural resources management    250
Rural development                               216
Human development                               210
Public sector governance                        199
Social protection and risk management           168
Financial and private sector development        146
Social dev/gender/inclusion                     130
Trade and integration                            77
Urban development                                50
Economic management                              38
Name: name, dtype: int64