# Urbanization Data Processing (UNPD)
## Data Dictionary
| Code              | Indicator Name                                                                    |
|-------------------|-----------------------------------------------------------------------------------|
| SP.URB.TOTL.IN.ZS | Urban population (% of total)                                                     |
| SP.URB.GROW       | Urban population growth (annual %)                                                |
| EN.POP.SLUM.UR.ZS | Population living in slums (% of urban population)                                |
| EN.URB.MCTY.TL.ZS | Population in urban agglomerations of more than 1 million (% of total population) |
| EN.URB.LCTY.UR.ZS | Population in the largest city (% of urban population)                            |

In [1]:
import re

import numpy as np
import pandas as pd
import pycountry

%matplotlib inline

pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Load The File

In [2]:
df = pd.read_excel("../data/external/Urbanization/UNPD/Data_Extract_From_World_Development_Indicators.xlsx")

In [3]:
df.sample(5)

Unnamed: 0,Time,Time Code,Country Name,Country Code,Urban population (% of total) [SP.URB.TOTL.IN.ZS],Urban population growth (annual %) [SP.URB.GROW],Population living in slums (% of urban population) [EN.POP.SLUM.UR.ZS],Population in urban agglomerations of more than 1 million (% of total population) [EN.URB.MCTY.TL.ZS],Population in the largest city (% of urban population) [EN.URB.LCTY.UR.ZS]
355,1995,YR1995,Greenland,GRL,80.902,0.785,..,..,..
1948,2001,YR2001,"Hong Kong SAR, China",HKG,100.0,0.737,..,100,100
4993,2012,YR2012,Least developed countries: UN classification,LDC,30.462,4.133,..,11.703,33.354
600,1996,YR1996,Dominican Republic,DOM,58.441,3.17,..,21.897,37.469
4832,2012,YR2012,Ethiopia,ETH,18.16,4.996,..,3.684,20.285


## Standardize Country Codes

In [4]:
""" Only Select rows with valid country codes
"""
country_locations = []
for country in df['Country Code']:
    try:
        pycountry.countries.lookup(country)
        country_locations.append(True)
    except LookupError:
        country_locations.append(False)
df = df[country_locations]

## Standardize Indexes

In [5]:
df.rename(
    {
        "Time": "Year"
    },
    axis='columns',
    inplace=True)

In [6]:
df.set_index(["Country Code", "Year"], inplace=True)

## Clean Data

### Header

In [7]:
df.drop(["Time Code", "Country Name"],
        axis='columns',
        inplace=True)

In [8]:
c = [ re.search(r"\[(\w+\.)+\w+\]",d)[0].replace("[","").replace("]","") for d in df.columns ]

In [9]:
c_names = {}
for x in range(len(c)):
    c_names[df.columns[x]] = c[x]

In [10]:
df.rename(c_names,axis='columns',inplace=True)

### Data Types

In [11]:
""" Replace '..' with np.nan for better parsing
"""
df = df.replace('..', np.NaN)

In [12]:
df = df.astype(float)

In [13]:
df.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,SP.URB.TOTL.IN.ZS,SP.URB.GROW,EN.POP.SLUM.UR.ZS,EN.URB.MCTY.TL.ZS,EN.URB.LCTY.UR.ZS
Country Code,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
CHN,1998,33.867,3.908,,15.235,3.062
ARM,2003,64.137,-0.686,,36.314,56.62
JPN,2016,91.457,-0.032,,64.125,32.138
GRL,1998,81.351,0.331,,,
ROU,1998,53.311,-0.492,,8.782,16.473


## Save Data

In [14]:
df.to_pickle("../data/processed/Urbanization_UNPD.pickle")