# General Indicators Data Processing (WDI)
## Data Dictionary
| **Code**                     | **Indicator Name**                                                                                                                               |
|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| SP.DYN.LE00.IN           | Life expectancy at birth, total (years)                                                                                                      |
| EG.ELC.ACCS.ZS           | Access to electricity (% of population)                                                                                                      |
| SH.STA.HYGN.ZS           | People with basic handwashing facilities including soap and water (% of population)                                                          |
| SH.H2O.SMDW.ZS           | People using safely managed drinking water services (% of population)                                                                        |
| SH.H2O.BASW.ZS           | People using at least basic drinking water services (% of population)                                                                        |
| EN.ATM.CO2E.PC           | CO2 emissions (metric tons per capita)                                                                                                       |
| EN.CO2.TRAN.ZS           | CO2 emissions from transport (% of total fuel combustion)                                                                                    |
| ER.LND.PTLD.ZS           | Terrestrial protected areas (% of total land area)                                                                                           |
| DT.DOD.PVLX.GN.ZS        | Present value of external debt (% of GNI)                                                                                                    |
| FB.ATM.TOTL.P5           | Automated teller machines (ATMs) (per 100,000 adults)                                                                                        |
| FB.CBK.BRCH.P5           | Commercial bank branches (per 100,000 adults)                                                                                                |
| FB.CBK.DPTR.P3           | Depositors with commercial banks (per 1,000 adults)                                                                                          |
| FB.CBK.BRWR.P3           | Borrowers from commercial banks (per 1,000 adults)                                                                                           |
| SG.VAW.1549.ZS           | Proportion of women subjected to physical and/or sexual violence in the last 12 months (% of women age 15-49)                                |
| SG.DMK.ALLD.FN.ZS        | Women participating in the three decisions (own health care, major household purchases, and visiting family) (% of women age 15-49)          |
| SG.DMK.SRCR.FN.ZS        | Women making their own informed decisions regarding sexual relations, contraceptive use and reproductive health care  (% of women age 15-49) |
| SH.STA.SUIC.P5           | Suicide mortality rate (per 100,000 population)                                                                                              |
| SH.STA.WASH.P5           | Mortality rate attributed to unsafe water, unsafe sanitation and lack of hygiene (per 100,000 population)                                    |
| SP.DYN.IMRT.IN           | Mortality rate, infant (per 1,000 live births)                                                                                               |
| SH.STA.BRTW.ZS           | Low-birthweight babies (% of births)                                                                                                         |
| SH.ANM.CHLD.ZS           | Prevalence of anemia among children (% of children under 5)                                                                                  |
| SH.STA.ANVC.ZS           | Pregnant women receiving prenatal care (%)                                                                                                   |
| SH.STA.FGMS.ZS           | Female genital mutilation prevalence (%)                                                                                                     |
| SH.PRV.SMOK              | Smoking prevalence, total (ages 15+)                                                                                                         |
| SH.ALC.PCAP.LI           | Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)                                         |
| IT.NET.BBND.P2           | Fixed broadband subscriptions (per 100 people)                                                                                               |
| IT.NET.USER.ZS           | Individuals using the Internet (% of population)                                                                                             |
| IT.CEL.SETS.P2           | Mobile cellular subscriptions (per 100 people)                                                                                               |
| SM.POP.REFG.OR           | Refugee population by country or territory of origin                                                                                         |
| per_si_allsi.cov_pop_tot | Coverage of social insurance programs (% of population)                                                                                      |
| VC.IHR.PSRC.P5           | Intentional homicides (per 100,000 people)                                                                                                   |
| MS.MIL.TOTL.TF.ZS        | Armed forces personnel (% of total labor force)                                                                                              |
| HD.HCI.OVRL              | Human capital index (HCI) (scale 0-1)                                                                                                        |

In [1]:
import re

import numpy as np
import pandas as pd
import pycountry

%matplotlib inline

pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Load The File

In [5]:
df = pd.read_excel("../data/external/General Indicators/WDI/Data_Extract_From_World_Development_Indicators.xlsx")

NameError: name 'pd' is not defined

In [2]:
df.sample(5)

NameError: name 'df' is not defined

## Standardize Country Codes

In [6]:
""" Only Select rows with valid country codes
"""
country_locations = []
for country in df['Country Code']:
    try:
        pycountry.countries.lookup(country)
        country_locations.append(True)
    except LookupError:
        country_locations.append(False)
df = df[country_locations]

NameError: name 'df' is not defined

## Standardize Indexes

In [5]:
df.rename(
    {
        "Time": "Year"
    },
    axis='columns',
    inplace=True)

In [6]:
df.set_index(["Country Code", "Year"], inplace=True)

## Clean Data

### Header

In [7]:
df.drop(["Time Code", "Country Name"],
        axis='columns',
        inplace=True)

In [8]:
c = [ re.search(r"\[(\w+\.)+\w+\]",d)[0].replace("[","").replace("]","") for d in df.columns ]

In [9]:
c_names = {}
for x in range(len(c)):
    c_names[df.columns[x]] = c[x]

In [10]:
df.rename(c_names,axis='columns',inplace=True)

### Data Types

In [11]:
""" Replace '..' with np.nan for better parsing
"""
df = df.replace('..', np.NaN)

In [12]:
df = df.astype(float)

In [13]:
df.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,SL.TLF.ACTI.1524.FE.ZS,SL.TLF.CACT.ZS,SL.TLF.CACT.FM.ZS,SL.TLF.0714.SW.TM,SL.TLF.0714.WK.TM,SL.EMP.MPYR.ZS,SL.AGR.EMPL.ZS,SL.IND.EMPL.ZS,SL.SRV.EMPL.ZS,SL.ISV.IFRM.ZS,SL.UEM.TOTL.ZS,SL.EMP.SELF.ZS,per_lm_alllm.cov_pop_tot
Country Code,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
SVK,2000,41.751,59.75,77.098,,,2.528,6.935,37.252,55.813,,19.062,7.905,
BOL,2001,44.784,70.81,73.179,,,2.183,44.204,15.692,40.104,,3.418,66.392,
JPN,2018,45.106,60.732,72.709,,,1.975,3.409,24.501,72.089,,2.445,10.368,
CUW,2011,,,,,,,,,,,,,
SWZ,2007,28.465,49.305,55.907,,,1.841,14.376,25.587,60.037,,28.24,28.431,


## Save Data

In [14]:
df.to_pickle("../data/processed/General_WDI.pickle")