# Inequality Data Processing (WDI)

## Data Dictionary

| Code           | Indicator Name                                  |
|----------------|-------------------------------------------------|
| SI.POV.GINI    | GINI index (World Bank estimate)                |
| SI.POV.RUGP    | Rural poverty gap at national poverty lines (%) |
| SI.POV.URGP    | Urban poverty gap at national poverty lines (%) |
| SI.POV.NAGP    | Poverty gap at national poverty lines (%)       |
| SI.DST.10TH.10 | Income share held by highest 10%                |
| SI.DST.FRST.10 | Income share held by lowest 10%                 |

In [1]:
import re

import numpy as np
import pandas as pd
import pycountry

%matplotlib inline

pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Load The File

In [2]:
df = pd.read_excel('../data/external/Inequality/WDI/Data_Extract_From_Poverty_and_Equity.xlsx')

## Standardize Country Codes

In [3]:
""" Only Select rows with valid country codes
"""
country_locations = []
for country in df['Country Code']:
    try:
        pycountry.countries.lookup(country)
        country_locations.append(True)
    except LookupError:
        country_locations.append(False)
df = df[country_locations]

## Standardize Indexes

In [4]:
df.set_index(["Country Code", "Year"], inplace=True)

## Clean Data

### Header

In [5]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Year Code,Country,GINI index (World Bank estimate) [SI.POV.GINI],Rural poverty gap at national poverty lines (%) [SI.POV.RUGP],Urban poverty gap at national poverty lines (%) [SI.POV.URGP],Poverty gap at national poverty lines (%) [SI.POV.NAGP],Income share held by highest 10% [SI.DST.10TH.10],Income share held by lowest 10% [SI.DST.FRST.10]
Country Code,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AFG,1994,YR1994,Afghanistan,..,..,..,..,..,..
ALB,1994,YR1994,Albania,..,..,..,..,..,..
DZA,1994,YR1994,Algeria,..,..,..,..,..,..
AGO,1994,YR1994,Angola,..,..,..,..,..,..
ARG,1994,YR1994,Argentina,45.900,..,..,..,34.400,1.500


In [6]:
df.drop(["Year Code", "Country"],
        axis='columns',
        inplace=True)

In [7]:
c = [ re.search(r"\[(\w+\.)+\w+\]",d)[0].replace("[","").replace("]","") for d in df.columns ]
c_names = {}
for x in range(len(c)):
    c_names[df.columns[x]] = c[x]
df.rename(c_names,axis='columns',inplace=True)

### Data Types

In [8]:
""" Replace '..' with np.nan for better parsing
"""
df = df.replace('..', np.NaN)
df = df.astype(float)

## Save Data

In [9]:
df.to_pickle("../data/processed/Inequality_WDI.pickle")