In [64]:
import os
from collections import OrderedDict

import numpy as np
import pandas as pd
import seaborn as sns
import xarray as xr

% matplotlib inline

### WB Data Analysis

**<span style="color:red; background:yellow;"></span>**

**Data & Scope:** downloaded on (date/time)
Two types of datasets:

1) Population (easy)
* years 1960-2016
* definition: total # residents (Regardless of citizenship/legal status)
* sources: (mid-year value)

    1) United Nations Population Division. World Population Prospects
    
    2) Census reports and other statistical publications from national statistical offices
    
    3) Eurostat: Demographic Statistics
    
    4) United Nations Statistical Division. Population and Vita

2) GDP (slightly more tricky)
- years: 1960 - 2016
- choosing to look at `gdp` (market, constant) first.
- Other types that one can look at:
    * MKT current
    * PPP current
    * PPP constant

**Source:** World Bank

**Assumptions/Expectations**

**Analysis Goals**
- Task 3 Deliverable: ADM0 population & real income estimates from `1950-2017`  **<span style="color:gray; background:lime;">DONE</span>**

- Find out which countries are missing data for which years. 
    Population **<span style="color:gray; background:lime;">DONE</span>**
    GDP
- Or alternatively, which years are missing certain *countries*

Possible convenience functions:
* Find years missing for a given country (to check important countries)
* Create filter for population > 10 mil: look at countries exit/enter data on which years. (Mike's suggesiton)
* From metadata, get a list of country code that are not *countries* to filter them from the list **<span style="color:gray; background:lime;">DONE</span>**


**Conclusion**
1. Population. Population data exists from **1960 to 2016.**
    
```  
Countries out of 217 that are missing population data (nan):

West Bank and Gaza: 1960 - 1989
Serbia: 1960 - 1989
Sint Maarten (Dutch part): 1960 -1998
Kuwait: 1992-1994
Eritrea: 2012-2016

```

2. GDP 

```
Countries out of 217 that are missing gdp data

a. GPP market constant
b. GDO PPP constant

```


**Questions**
1. How are different sources of population data used to result in one final set? (are there any overlaps between 4 sources?) i.e. what is the methodology for compilation?

2. How often are population/income data updated?


**TOC**
* A. Methodology for printing missing years

* B. Convenience functions for filtering relevant columns and country data only

* C. Population Data Analysis

* D. Income Data Analysis

## A. Methodology for printing missing years

0) Download dataframe and metadata

1) Filter irrelevant data (section B)
* `select_relevant_cols(df)`

* `filter_non_countries(df, metadata)`

1) Print missing years (nans) for all countries

* `list_of_country_idx_with_nans = get_country_idx_with_nans(df)` to get index list of countries that have any
nan data points in the years (1960..2016)

* `print_missing_years_for_all_countries(_df, _list_of_country_idx_with_nans)` to print out missing years (via `print_missing_years_for_country` for each country using the provided index list and dataframe

## B. Convenience functions for filtering relevant columns and country data only


Both population and income dataset includes list of countries (217) plus various other groups (46 - regional, income or political) as entry data. 217 is the expected number of countries listed in World Bank, 2017.

Convenience functions:

1) From the original WB dataset, select only columns of interest

2) Get only country-specific GDP and population data (filter out other groups^^, mentioned above, data)

### 1) Select interested columns from data

In [65]:
def select_relevant_cols(_df):
    # returns df with 'Country Name', 'Country Code' and all years (1960 to 2016)
    _cols = ['Country Name'] + ['Country Code'] + [str(yr) for yr in range(1960, 2017)]
    return _df[_cols]

### 2) Filter out non_countries (i.e. get only country data)

In [66]:
def filter_non_countries(_df, _metadata):
    '''
    _df : pd.DataFrame
        either income or population data
        
    _metadata : pd.DataFrame
        metadata on a list of entries including countries and non-countries 
        data source is from the World Bank
        has IncomeGroup column that is not null for countries (217)
    '''
    # Country Code in both dataframes
    _merged = _df.merge(_metadata, on='Country Code')
    
    # non-countries have no IncomeGroup
    non_country_mask = _merged['IncomeGroup'].isnull()
    merged_country_only = _merged[~non_country_mask]
    return merged_country_only

# C. Population data

Total population (absolute units) - based on national census and extrapolation and interpolation for missing values (based on data from United Nations, other census organizations, Eurostats and WB methodology). Subject to undercounting/biases for both high and low/mid income countries. 

Interpolation and extrapolation done by World Bank/UN (??-confirm the responsible party) for certain years/countries that are missing census data, or missing pre/post census information for given time frame. Uses demographic models, etc. 

### Initial survey of data

#### Download population data

In [67]:
# Have to run jupyter from WB directory
df2 = pd.read_csv('population/API_SP.POP.TOTL_DS2_en_csv_v2.csv', skiprows=4)

#del df2['Unnamed: 62'] # remove extraneous data
df2.columns
df2.tail(3)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,Unnamed: 62
261,South Africa,ZAF,"Population, total",SP.POP.TOTL,17396367.0,17850045.0,18322335.0,18809939.0,19308166.0,19813947.0,...,50255810.0,50979430.0,51729350.0,52506520.0,53311960.0,54146730.0,55011980.0,55908865.0,,
262,Zambia,ZMB,"Population, total",SP.POP.TOTL,3044846.0,3140264.0,3240587.0,3345145.0,3452942.0,3563407.0,...,13456420.0,13850030.0,14264760.0,14699940.0,15153210.0,15620970.0,16100590.0,16591390.0,,
263,Zimbabwe,ZWE,"Population, total",SP.POP.TOTL,3747369.0,3870756.0,3999419.0,4132756.0,4269863.0,4410212.0,...,13810600.0,14086320.0,14386650.0,14710830.0,15054510.0,15411680.0,15777450.0,16150362.0,,


#### Download metadata

**WARNING**: had to re-download the file because I modified it by
opening and interacting with in via Excel. Produced error in read_csv call.

**TIP for myself**: Only Open a copy version in Excel


In [68]:
meta_country = pd.read_csv('./population/Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv')

del meta_country['Unnamed: 5'] 
meta_country.columns

Index(['Country Code', 'Region', 'IncomeGroup', 'SpecialNotes', 'TableName'], dtype='object')

In [69]:
df2
df2['2017'].unique() # nan -> all 2017 values nan

array([ nan])

### Initial survey of metadata

In [70]:
meta_country['Region'].unique() # 7 excluding nan

'''['Latin America & Caribbean' 
 'South Asia' 
 'Sub-Saharan Africa'
 'Europe & Central Asia' /// nan 
 'Middle East & North Africa'
 'East Asia & Pacific' 
 'North America']

'''

#print(meta_country['IncomeGroup'].nunique()) #4
meta_country['IncomeGroup'].unique() 
'''['High income', 'Low income', 'Lower middle income', 'Upper middle income', nan]'''

meta_country['Country Code'].nunique() # 263 entries 
meta_country['IncomeGroup'].isnull().value_counts() # 217 False (countries) # 46 nan
# False values are countries -- i.e. all countries belong to an IncomeGroup
# Validation: we know from (World Bank 2017) that total of 217 countries were included

False    217
True      46
Name: IncomeGroup, dtype: int64

### Select relevant columns

In [71]:
df2 # includes Country Name and all years (1960-2016)
df2.shape # 264 entries, 58 columns 

(264, 63)

In [72]:
df2.head(5)   # Show top 5 rows

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,Unnamed: 62
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,...,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,,
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,...,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,,
2,Angola,AGO,"Population, total",SP.POP.TOTL,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,...,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,,
3,Albania,ALB,"Population, total",SP.POP.TOTL,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,...,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,,
4,Andorra,AND,"Population, total",SP.POP.TOTL,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,...,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0,,


In [73]:
df2_countries_only = filter_non_countries(df2, meta_country)
df2_countries_only[200:205] # few U- countires

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,Unnamed: 62,Region,IncomeGroup,SpecialNotes,TableName
244,Uganda,UGA,"Population, total",SP.POP.TOTL,6788214.0,7006633.0,7240174.0,7487429.0,7746198.0,8014401.0,...,37553726.0,38833338.0,40144870.0,41487965.0,,,Sub-Saharan Africa,Low income,Fiscal year end: June 30; reporting period for...,Uganda
245,Ukraine,UKR,"Population, total",SP.POP.TOTL,42662149.0,43203635.0,43749470.0,44285899.0,44794327.0,45261935.0,...,45489600.0,45271947.0,45154029.0,45004645.0,,,Europe & Central Asia,Lower middle income,The new base year is 2010.,Ukraine
247,Uruguay,URY,"Population, total",SP.POP.TOTL,2538651.0,2571690.0,2603887.0,2635129.0,2665390.0,2694537.0,...,3408005.0,3419546.0,3431552.0,3444006.0,,,Latin America & Caribbean,High income,,Uruguay
248,United States,USA,"Population, total",SP.POP.TOTL,180671000.0,183691000.0,186538000.0,189242000.0,191889000.0,194303000.0,...,316204908.0,318563456.0,320896618.0,323127513.0,,,North America,High income,Fiscal year end: September 30; reporting perio...,United States
249,Uzbekistan,UZB,"Population, total",SP.POP.TOTL,8549493.0,8837349.0,9138097.0,9454250.0,9788986.0,10143740.0,...,30243200.0,30757700.0,31298900.0,31848200.0,,,Europe & Central Asia,Lower middle income,,Uzbekistan


In [74]:
df_pop = select_relevant_cols(df2_countries_only)
df_pop.shape

(217, 59)

### [F] Function: Get country indexes with nans

F for function

In [94]:
get_country_idx_with_nans = lambda _df: pd.isnull(_df).any(1).nonzero()[0]

### Find countries/years missing population data

In [99]:
# df_pop.isnull().sum()
# note that nan values per year changes from 1-4, 1989 and prior are all 4,
# after that there is a shifts six times:
# 4 (60-89) -> 2 (90-91)-> 3 (92-94) -> 2 (95-97) -> 1 (98-2011)-> 2 (2012-2016)-> 264 (2017)

# Compare two years (diff countries)


In [101]:
list_of_country_idx_with_nans = get_country_idx_with_nans(df_pop) # same as above
list_of_country_idx_with_nans

array([ 58, 106, 161, 176, 184])

NOTE on nans:

* `df.isnull().sum()`: returns total number of nan values for each column (whole df) <- what I used
* comparison of nan == np.nan to find nan values does not work (returns False)
* can use np.isnan(val) for a single value OR possibly use `apply` method for whole col (did not try)

In [76]:
def print_country_name_and_index(_df, _list_of_indexes_with_nans):
    _countries = [(_df.iloc[i]['Country Name'], i) for i in _list_of_indexes_with_nans]
    print (_countries)

print_country_name_and_index(df_pop, list_of_country_idx_with_nans)

[('Eritrea', 58), ('Kuwait', 106), ('West Bank and Gaza', 161), ('Serbia', 176), ('Sint Maarten (Dutch part)', 184)]


### Define `get_missing_years`

In [102]:
def print_missing_years_for_country(_df, idx):
    '''
    Calculate the number of years the country has no
    data points (ie. has nan value)
    
    Parameters
    ----------
    _df : pandas.DataFrame
        dataframe containing countries and these columns
        Country Name, Country Code, '1960'... '2016' 
        where year values can be either population, income, or NaN
        
    idx : Python list 
        list of of row indexes (countries) in the dataframe that has missing years
        
    Prints 
    -------
    For each country with missing years, prints country name, 
    followed by a Pandas.Index of missing years (each year is
    a String)
    
    ex. Eritrea Index(['2012', '2013', '2014', '2015', '2016'], dtype='object')
        Kuwait Index ...
        
    Note (additional complexity)
    ----
    Can return start/end missing years, and # missing years instead, 
    if the years are consecutive
    '''
    _country = _df.iloc[idx]
    _nan_years = _country.loc[_country.isnull()]
    print(_country['Country Name'], _nan_years.keys())

def print_missing_years_for_all_countries(_df, _list_of_idx):
    for _i in _list_of_idx:
        print_missing_years_for_country(_df, _i)
        
print_missing_years_for_all_countries(df_pop, list_of_indexes_with_nans)

Eritrea Index(['2012', '2013', '2014', '2015', '2016'], dtype='object')
Kuwait Index(['1992', '1993', '1994'], dtype='object')
West Bank and Gaza Index(['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989'],
      dtype='object')
Serbia Index(['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989'],
      dtype='object')
Sint Maarten (Dutch part) Index(['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
  

### Manual validation

In [78]:
#df_pop.iloc[58] # Eritrea missing 2012-2016
#df_pop.iloc[106] # Kuwait missing 1992-1994
#df_pop.iloc[161] # West Bank and Gaza missing 1960 - 1989
#df_pop.iloc[176] # Serbia missing 1960 - 1989
# df_pop.iloc[184] # Sint Maarten (Dutch part) 1960 -1997

In [79]:
#df_pop['1990'].isnull().sum() == 1 # expect 1 (excluding West Bank/Serbia)
#df_pop['1993'].isnull().sum() == 2 # expect 2 (Kuwait missing too)
#df_pop['1995'].isnull().sum() == 1 # expect 1 (Kuwait back again)
#df_pop['1999'].isnull().sum() # expect 0 Sint Maarten back
#df_pop['2012'].isnull().sum() # expect 1 (Eritrea goes missing)
#df_pop['2016'].isnull().sum() # expect 1

In [80]:
# doing this manual painful way for now

((df_pop['1960'].isnull().sum() == 3) 
 and (df_pop['1990'].isnull().sum() == 1) 
 and (df_pop['1993'].isnull().sum() == 2) 
 and (df_pop['1995'].isnull().sum() == 1)
 and df_pop['1999'].isnull().sum()==0
 and df_pop['2012'].isnull().sum()==1
 and df_pop['2016'].isnull().sum()==1)

True

##  Playing around with data

#### Not used / unhelpful

In [93]:
# Get countries as dataframe (subset of original) by first letter
def get_by_first_letter(df, first_letter):
    return df.loc[df['Country Name'].str.startswith(first_letter)]

# Get country by name
def get_row_by_country_name(_df, country_name):
    return _df[_df['Country Name'] == country_name]

# Get country row from dataframe by index
def get_by_index(df, country_idx):
    return df.iloc[country_idx]

# Get all countries that start with S
s_countries = get_by_first_letter(df, 'S')
#s_countries
    
# Get_by_first_letter(df, 'K') # example South Korea is index 124
# get_by_index(df, 124)


In [83]:
# Returns all values of the 'Country Name' column as an array
#df2['Country Name'].values #lists all countries
# United States
# China
# India

china = get_row_by_country_name(df2, 'China')
india = get_row_by_country_name(df2, 'India')
usa = get_row_by_country_name(df2, 'United States')
def get_first_and_last_years(_country):
    # _country is a dataframe (1 row)
    return _country[['1960', '2016']]
    
get_first_and_last_years(china)
get_first_and_last_years(usa)
get_first_and_last_years(india)

# manual confirmation: these values match online WB values (https://data.worldbank.org/indicator/SP.POP.TOTL?year_high_desc=true)

Unnamed: 0,1960,2016
107,449480608.0,1324171000.0


# D. GDP data

### 1) PICK ONE: GDP (constant 2010 USD)


In [84]:
# Open files
# NY.GDP.PCAP.PP.KD is WDI indicator
'''
NY = national accounts: income
MKTP = market prices
PCAP = per capita
PP = purchasing power (no PP means not PP)
KD = constant (vs CD = current)
'''
# df_gdp is gdp market constant
gdp_mkt_const = pd.read_csv('./gdp/gdp_constant/API_NY.GDP.MKTP.KD_DS2_en_csv_v2.csv', skiprows=3)
gdp_meta = pd.read_csv('./gdp/gdp_constant/Metadata_Country_API_NY.GDP.MKTP.KD_DS2_en_csv_v2.csv')

In [85]:
# Get variables (column names)
gdp_mkt_const.columns # years 1960-2017
gdp_mkt_const.shape

(264, 63)

### Filter non countries (should return 217 entries) and subset only interested columns

In [86]:
gdp_country_only = filter_non_countries(gdp_mkt_const, gdp_meta)

df_gdp = select_relevant_cols(gdp_country_only)
df_gdp.shape # 217, 59

(217, 59)

### TASK: Get missing years/countries


### Ideas

1. Idea Red
    
    1) Get a list of countries that has any missing data, including # years missing
    
    2) Sort by years missing (country with largest missing data first)

2. Idea Orange (outcome data need)

    1) Check interesting countries
        Mexico
        Philipines
        USA
        Ethiopia
        Germany/West Germany
        Indonesia
        Soviet Union/Russia
        France
        Brazil
        Spain....
    
3. Idea Yellow (population as guage for important countries)
    1) Filter by population


In [87]:
df_gdp.iloc[1:5]
pd.isnull(df_gdp).any(1).nonzero() 

(array([  0,   1,   2,   3,   4,   5,   7,   8,   9,  12,  18,  19,  21,
         22,  24,  27,  28,  29,  33,  34,  42,  43,  45,  46,  47,  48,
         49,  50,  51,  52,  58,  60,  61,  65,  66,  69,  71,  72,  73,
         74,  75,  77,  78,  80,  82,  84,  85,  86,  88,  90,  91,  92,
         96,  97,  99, 101, 102, 103, 104, 106, 107, 108, 110, 111, 112,
        113, 115, 117, 118, 119, 120, 121, 122, 124, 126, 127, 128, 129,
        131, 132, 133, 134, 136, 139, 140, 147, 148, 149, 154, 155, 156,
        157, 158, 161, 162, 163, 164, 165, 167, 171, 173, 174, 175, 176,
        177, 178, 179, 180, 181, 183, 184, 186, 187, 191, 192, 193, 194,
        196, 198, 199, 200, 201, 204, 206, 207, 208, 209, 210, 211, 212, 213]),)

### TASK: Get list of years missing for a country

Super naive EDA



In [88]:
# Count nan
print(sorted(df_gdp.isnull().sum(axis=1).unique().tolist()))

[0, 1, 2, 3, 5, 6, 7, 8, 10, 11, 13, 14, 15, 16, 17, 19, 20, 21, 22, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 40, 41, 42, 43, 44, 47, 49, 56, 57]


#### Number of missing years range from 0 to 57.. T.T

There are 17 unique values

In [89]:
get_missing_years(df_gdp, 'Aruba')

# Count number of null values per year
df_gdp.isnull().sum()

# Count number of null values for each country
df_gdp.apply(lambda row: row.isnull(),axis=1)

df_gdp.iloc[0].isnull().sum()

56

### Approach 1: Create a filter for countries with > 10 mil pop (Mike's Suggestion)

In [90]:
df_pop.shape

(217, 59)

In [91]:
df_pop[df_pop > 10**6].shape

(217, 59)

In [92]:
list_of_countries = [x for x in df_gdp['Country Name'].values]
num_countries = len(list_of_countries) # 217

get_missing_years('Aruba')

countries = [{'name': list_of_countries[i], 
              'num_missing': num_nan_per_country[i],
              'years_missing': []} for i in range(0, num_countries)]
# countries

# def sort_by_missing_years(dd_gdp):
#     for 

TypeError: get_missing_years() missing 1 required positional argument: 'country_name'

In [None]:
# df_gdp_yrs = df_gdp.loc[:, '1960':'2016']

# df_gdp_yrs

# #df_gdp_yrs.apply(lambda s: s.value_counts(), axis=0)


# df.sum(axis=0)
# #print(len(df_1960_2017.isnull().sum(axis=1).tolist())) # 264 = # countries


In [None]:
# Write a function that counts values for each column
# Purpose: to count how many NA values are there in each row
df['1960'].value_counts()
type(df.iloc[0]['1960']) # numpy.float64


# single value check
np.isnan(df.iloc[0]['1960']) # True 



# find the countries that have nan values in 1989 and prior years, 
# but don't have 
