In [120]:
import os
from collections import OrderedDict

import numpy as np
import pandas as pd
import requests
import seaborn as sns
import xarray as xr

% matplotlib inline

### WB Data Analysis

**<span style="color:red; background:yellow;"></span>**

**Data & Scope:** downloaded on (date/time)
Two types of datasets:

1) Population (easy)
* years 1960-2016
* definition: total # residents (Regardless of citizenship/legal status)
* sources: (mid-year value)

    1) United Nations Population Division. World Population Prospects
    
    2) Census reports and other statistical publications from national statistical offices
    
    3) Eurostat: Demographic Statistics
    
    4) United Nations Statistical Division. Population and Vita

2) GDP (slightly more tricky)
- years: 1960 - 2016
- four types of `gdp per capita` I looked at:

    * current
    * constant
    * PPP current
    * PPP constant

**Source:** World Bank

**Assumptions/Expectations**
-

**Analysis Goals**
- Task 3 Deliverable: ADM0 population & real income estimates from `1950-2017`  **<span style="color:gray; background:lime;">DONE</span>**

- Find out which countries are missing data for which years. 
    Population **<span style="color:gray; background:lime;">DONE</span>**
    GDP
- Or alternatively, which years are missing certain *countries*

Possible convenience functions:
* Find years missing for a given country (to check important countries)
* Create filter for population > 10 mil: look at countries exit/enter data on which years. (Mike's suggesiton)
* From metadata, get a list of country code that are not *countries* to filter them from the list **<span style="color:gray; background:lime;">DONE</span>**


**Conclusion**
1. Population. Population data exists from **1960 to 2016.**
    
```  
Countries out of 217 that are missing population data (nan):

West Bank and Gaza: 1960 - 1989
Serbia: 1960 - 1989
Sint Maarten (Dutch part): 1960 -1998
Kuwait: 1992-1994
Eritrea: 2012-2016

```

2. GDP 

```
Countries out of 217 that are missing gdp data

a. GDP market constant

b. GDP PPP constant
```


**Questions**
1. How are different sources of population data used to result in one final set? (are there any overlaps between 4 sources?) i.e. what is the methodology for compilation?

2. How often are population/income data updated?



# A. Population data

Total population (absolute units) - based on national census and extrapolation and interpolation for missing values (based on data from United Nations, other census organizations, Eurostats and WB methodology). Subject to undercounting/biases for both high and low/mid income countries. 

Interpolation and extrapolation done by World Bank/UN (??-confirm the responsible party) for certain years/countries that are missing census data, or missing pre/post census information for given time frame. Uses demographic models, etc. 

In [121]:
df2 = pd.read_csv('population/API_SP.POP.TOTL_DS2_en_csv_v2.csv', skiprows=4)
#del df2['Unnamed: 62'] # remove extraneous data
df2.columns
df2.tail(3)


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
261,South Africa,ZAF,"Population, total",SP.POP.TOTL,17396367.0,17850045.0,18322335.0,18809939.0,19308166.0,19813947.0,...,49557573.3,50255813.11,50979432.36,51729345.36,52506515.08,53311955.61,54146734.74,55011976.68,55908865.0,
262,Zambia,ZMB,"Population, total",SP.POP.TOTL,3044846.0,3140264.0,3240587.0,3345145.0,3452942.0,3563407.0,...,13082517.0,13456417.0,13850033.0,14264756.0,14699937.0,15153210.0,15620974.0,16100587.0,16591390.0,
263,Zimbabwe,ZWE,"Population, total",SP.POP.TOTL,3747369.0,3870756.0,3999419.0,4132756.0,4269863.0,4410212.0,...,13558469.0,13810599.0,14086317.0,14386649.0,14710826.0,15054506.0,15411675.0,15777451.0,16150362.0,


In [122]:
df2
df2['2017'].unique() # nan -> all 2017 values nan

array([ nan])

In [123]:
cols = ['Country Name'] + ['Country Code'] + [str(yr) for yr in range(1960, 2017)]
#print (cols)
df_pop = df2[cols]
df_pop # includes Country Name and all years (1960-2016)
df_pop.shape # 264 countries???, 58 columns 

(264, 59)

In [124]:
df_pop.head(5)   # Show top 5 rows
df_pop.isnull().sum()
# note that nan values per year changes from 1-4, 1989 and prior are all 4,
# after that there is a shifts six times:
# 4 (60-89) -> 2 (90-91)-> 3 (92-94) -> 2 (95-97) -> 1 (98-2011)-> 2 (2012-2016)-> 264 (2017)

# Compare two years (diff countries)

df_pop['1989'].isnull()
pd.isnull(df_pop).any(1).nonzero()[0]

array([ 67, 108, 125, 194, 212, 223])

NOTE on nans:

* `df.isnull().sum()`: returns total number of nan values for each column (whole df) <- what I used
* comparison of nan == np.nan to find nan values does not work (returns False)
* can use np.isnan(val) for a single value OR possibly use `apply` method for whole col (did not try)

In [125]:
df_pop.iloc[67] # Eritrea missing 2012-2016
df_pop.iloc[108] # Not classified missing all # what is this....?
df_pop.iloc[194] # West Bank and Gaza missing 1960 - 1989
df_pop.iloc[212] # Serbia missing 1960 - 1989
df_pop.iloc[223] # Sint Maarten (Dutch part) 1960 -1998
df_pop.iloc[125] # Kuwait missing 1992-1994

# doing this manual painful way for now
df_pop['1960'].isnull().sum() # 4 
df_pop['1990'].isnull().sum() # expect 2 (excluding West Bank/Serbia)
df_pop['1993'].isnull().sum() # expect 3 (Kuwait missing too)
df_pop['1995'].isnull().sum() # expect 2 (Kuwait back again)
df_pop['1999'].isnull().sum() # expect 1 Sint Maarten back
df_pop['2012'].isnull().sum() # expect 2 (Eritrea goes missing)
df_pop['2016'].isnull().sum() # expect 2

# drop unclassified country since it's contributing 0 population 
# df_pop.drop(df_pop.index[108])


2

# B. Meta data (Filter non-countries)

Both population and income dataset includes list of countries (217) plus various classifications as entry data. These include:

Task:

- get a list of country codes for *non-countries* from metadata
- use that list to filter those country codes from income or population data



In [126]:
# 263 - 46 # total # 217 countries

# WARNING: had to re-download the file because I modified it by
# opening and interacting with in via Excel. Produced error in read_csv call.

# Only Open a copy version in Excel

meta_country = pd.read_csv('./population/Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv')

meta_country.columns
del meta_country['Unnamed: 5']

In [127]:
def filter_non_countries(_df, _metadata):
    '''
    _df : pd.DataFrame
        either income or population data
        
    _metadata : pd.DataFrame
        metadata on a list of entries including countries and non-countries 
        data source is from the World Bank
        has IncomeGroup column that is not null for countries (217)
    
    '''
    _merged = _df.merge(_metadata, on='Country Code')
    
    non_country_mask = _merged['IncomeGroup'].isnull()
    merged_country_only = _merged[~non_country_mask]
    return merged_country_only

def select_relevant_cols(_df):
    _cols = ['Country Name'] + ['Country Code'] + [str(yr) for yr in range(1960, 2017)]
    return _df[_cols]

df_country_only = filter_non_countries(df2, meta_country)
select_relevant_cols(df_country_only)[200:205]

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
244,Uganda,UGA,6788214.0,7006633.0,7240174.0,7487429.0,7746198.0,8014401.0,8292776.0,8580676.0,...,30590487.0,31663896.0,32771895.0,33915133.0,35093648.0,36306796.0,37553726.0,38833338.0,40144870.0,41487965.0
245,Ukraine,UKR,42662149.0,43203635.0,43749470.0,44285899.0,44794327.0,45261935.0,45682308.0,46060452.0,...,46509350.0,46258200.0,46053300.0,45870700.0,45706100.0,45593300.0,45489600.0,45271947.0,45154029.0,45004645.0
247,Uruguay,URY,2538651.0,2571690.0,2603887.0,2635129.0,2665390.0,2694537.0,2722877.0,2750093.0,...,3339741.0,3350824.0,3362755.0,3374415.0,3385624.0,3396777.0,3408005.0,3419546.0,3431552.0,3444006.0
248,United States,USA,180671000.0,183691000.0,186538000.0,189242000.0,191889000.0,194303000.0,196560000.0,198712000.0,...,301231207.0,304093966.0,306771529.0,309348193.0,311663358.0,313998379.0,316204908.0,318563456.0,320896618.0,323127513.0
249,Uzbekistan,UZB,8549493.0,8837349.0,9138097.0,9454250.0,9788986.0,10143740.0,10520879.0,10917446.0,...,26868000.0,27302800.0,27767400.0,28562400.0,29339400.0,29774500.0,30243200.0,30757700.0,31298900.0,31848200.0


In [128]:
meta_country['Region'].unique() # 7 excluding nan


'''['Latin America & Caribbean' 
 'South Asia' 
 'Sub-Saharan Africa'
 'Europe & Central Asia' /// nan 
 'Middle East & North Africa'
 'East Asia & Pacific' 
 'North America']

'''

#print(meta_country['IncomeGroup'].nunique()) #4
meta_country['IncomeGroup'].unique() 
'''['High income', 'Low income', 'Lower middle income', 'Upper middle income', nan]'''

meta_country['Country Code'].nunique() # 263 entries 
meta_country['IncomeGroup'].isnull().value_counts() # 217 False (countries) # 46 nan
# False values are countries -- i.e. all countries belong to an IncomeGroup
# Validation: we know from (World Bank 2017) that total of 217 countries were included


False    217
True      46
Name: IncomeGroup, dtype: int64

##  Playing around with data

### Getting data for country (not really used)

In [226]:
get_missing_years(df_gdp, 'Aruba')

# Count number of null values per year
df_gdp.isnull().sum()

# Count number of null values for each country
df_gdp.apply(lambda row: row.isnull(),axis=1)

df_gdp.iloc[0].isnull().sum()

56

In [202]:
df = df_pop


#     country_row = _df.loc[_df['Country Name'] == country_name]
    #df_years_only = select_relevant_cols(_df)
    
def get_missing_years(_df, country_name):
    '''
    Calculate the number of years the country has no
    data points (ie. has nan value)
    
    Parameters
    ----------
    country_name : Str
        name of given country
        
    Returns 
    -------
    Missing years in default range (1960-2017), where 
    each year is a string : List
    
    ex. ['1990',  '2005'']

'''
    # print list of years with missing data
    #print (_df.isnull().any().tolist())

    years_missing_data = _df.columns[_df.isnull().any()]
    return _df.loc[:, years_missing_data]
    #return country_row[country_row.isnull().any()].tolist()
    #return country

In [190]:
# Get countries as dataframe (subset of original) by first letter
def get_by_first_letter(df, first_letter):
    return df.loc[df['Country Name'].str.startswith(first_letter)]

# Get country by name
def get_row_by_country_name(_df, country_name):
    return _df[_df['Country Name'] == country_name]

# Get country row from dataframe by index
def get_by_index(df, country_idx):
    return df.iloc[country_idx]

# Get all countries that start with S
s_countries = get_by_first_letter(df, 'S')
#s_countries
    
# Get_by_first_letter(df, 'K') # example South Korea is index 124
# get_by_index(df, 124)


In [130]:
# Returns all values of the 'Country Name' column as an array
#df2['Country Name'].values #lists all countries
# United States
# China
# India

china = get_row_by_country_name(df2, 'China')
india = get_row_by_country_name(df2, 'India')
usa = get_row_by_country_name(df2, 'United States')
def get_first_and_last_years(_country):
    # _country is a dataframe (1 row)
    return _country[['1960', '2016']]
    
get_first_and_last_years(china)
get_first_and_last_years(usa)
get_first_and_last_years(india)

# manual confirmation: these values match online WB values (https://data.worldbank.org/indicator/SP.POP.TOTL?year_high_desc=true)

Unnamed: 0,1960,2016
107,449480608.0,1324171000.0


## Filter countries by 

# C. GDP data

### 1) PICK ONE: GDP (constant 2010 USD)


In [131]:
# Open files
# NY.GDP.PCAP.PP.KD is WDI indicator
'''
NY = national accounts: income
MKTP = market prices
PCAP = per capita
PP = purchasing power (no PP means not PP)
KD = constant (vs CD = current)
'''
# df_gdp is gdp market constant
gdp_mkt_const = pd.read_csv('./gdp/gdp_constant/API_NY.GDP.MKTP.KD_DS2_en_csv_v2.csv', skiprows=3)
gdp_meta = pd.read_csv('./gdp/gdp_constant/Metadata_Country_API_NY.GDP.MKTP.KD_DS2_en_csv_v2.csv')

In [132]:
# Get variables (column names)
gdp_mkt_const.columns # years 1960-2017
gdp_mkt_const.shape

(264, 63)

### Filter non countries (should return 217 entries) and subset only interested columns

In [133]:
gdp_country_only = filter_non_countries(gdp_mkt_const, gdp_meta)

df_gdp = select_relevant_cols(gdp_country_only)
df_gdp.shape # 217, 59

(217, 59)

### TASK: Get missing years/countries


### Ideas

1. Idea Red
    
    1) Get a list of countries that has any missing data, including # years missing
    
    2) Sort by years missing (country with largest missing data first)

2. Idea Orange (outcome data need)
    1) Check interesting countries
        Mexico
        Philipines
        USA
        Ethiopia
        Germany/West Germany
        Indonesia
        Soviet Union/Russia
        France
        Brazil
        Spain....
    
3. Idea Yellow (population as guage for important countries)
    1) Filter by population


In [134]:
df_gdp.iloc[1:5]
pd.isnull(df_gdp).any(1).nonzero() 

(array([  0,   1,   2,   3,   4,   5,   7,   8,   9,  12,  18,  19,  21,
         22,  24,  27,  28,  29,  33,  34,  42,  43,  45,  46,  47,  48,
         49,  50,  51,  52,  58,  60,  61,  65,  66,  69,  71,  72,  73,
         74,  75,  77,  78,  80,  82,  84,  85,  86,  88,  90,  91,  92,
         96,  97,  99, 101, 102, 103, 104, 106, 107, 108, 110, 111, 112,
        113, 115, 117, 118, 119, 120, 121, 122, 124, 126, 127, 128, 129,
        131, 132, 133, 134, 136, 139, 140, 147, 148, 149, 154, 155, 156,
        157, 158, 161, 162, 163, 164, 165, 167, 171, 173, 174, 175, 176,
        177, 178, 179, 180, 181, 183, 184, 186, 187, 191, 192, 193, 194,
        196, 198, 199, 200, 201, 204, 206, 207, 208, 209, 210, 211, 212, 213]),)

### TASK: Get list of years missing for a country

Super naive EDA



In [135]:
# Count nan
print(sorted(df_gdp.isnull().sum(axis=1).unique().tolist()))

[0, 1, 2, 3, 5, 6, 7, 8, 10, 11, 13, 14, 15, 16, 17, 19, 20, 21, 22, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 40, 41, 42, 43, 44, 47, 49, 56, 57]


#### Number of missing years range from 0 to 57.. T.T

There are 17 unique values

In [174]:
get_missing_years('Aruba')

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Aruba,ABW,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,58055.0,...,101220.0,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0


In [172]:
list_of_countries = [x for x in df_gdp['Country Name'].values]
num_countries = len(list_of_countries) # 217

get_missing_years('Aruba')

countries = [{'name': list_of_countries[i], 
              'num_missing': num_nan_per_country[i],
              'years_missing': []} for i in range(0, num_countries)]
# countries

# def sort_by_missing_years(dd_gdp):
#     for 

In [160]:
# df_gdp_yrs = df_gdp.loc[:, '1960':'2016']

# df_gdp_yrs

# #df_gdp_yrs.apply(lambda s: s.value_counts(), axis=0)


# df.sum(axis=0)
# #print(len(df_1960_2017.isnull().sum(axis=1).tolist())) # 264 = # countries


In [137]:
# Write a function that counts values for each column
# Purpose: to count how many NA values are there in each row
df['1960'].value_counts()
type(df.iloc[0]['1960']) # numpy.float64


# single value check
np.isnan(df.iloc[0]['1960']) # True 



# find the countries that have nan values in 1989 and prior years, 
# but don't have 


False

### Fetching data via API

This approach is abandoned because json data is harder to manipulate for this analysis.
Pandas rocks.

In [138]:
#r = requests.get('http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json', auth=('user', 'pass'))

In [139]:
#r.headers['Content-Type'] # 'application/json;charset=utf-8'
#r.json()