# Labor Data Processing (WDI/ILO)
## Data Dictionary
**Code**|**Indicator Name**
:-----:|:-----:
SL.TLF.ACTI.1524.FE.ZS|Labor force participation rate for ages 15-24, female (%) (modeled ILO estimate)
SL.TLF.CACT.ZS|Labor force participation rate, total (% of total population ages 15+) (modeled ILO estimate)
SL.TLF.CACT.FM.ZS|Ratio of female to male labor force participation rate (%) (modeled ILO estimate)
SL.TLF.0714.SW.TM|Average working hours of children, study and work, ages 7-14 (hours per week)
SL.TLF.0714.WK.TM|Average working hours of children, working only, ages 7-14 (hours per week)
SL.EMP.MPYR.ZS|Employers, total (% of total employment) (modeled ILO estimate)
SL.AGR.EMPL.ZS|Employment in agriculture (% of total employment) (modeled ILO estimate)
SL.IND.EMPL.ZS|Employment in industry (% of total employment) (modeled ILO estimate)
SL.SRV.EMPL.ZS|Employment in services (% of total employment) (modeled ILO estimate)
SL.ISV.IFRM.ZS|Informal employment (% of total non-agricultural employment)
SL.UEM.TOTL.ZS|Unemployment, total (% of total labor force) (modeled ILO estimate)
SL.EMP.SELF.ZS|Self-employed, total (% of total employment) (modeled ILO estimate)
per\_lm\_alllm.cov\_pop\_tot|Coverage of unemployment benefits and ALMP (% of population)

In [1]:
import re

import numpy as np
import pandas as pd
import pycountry

%matplotlib inline

pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Load The File

In [2]:
df = pd.read_excel("../data/external/Labor/WDI/Data_Extract_From_World_Development_Indicators.xlsx")

In [3]:
df.sample(5)

Unnamed: 0,Time,Time Code,Country Name,Country Code,"Labor force participation rate for ages 15-24, female (%) (modeled ILO estimate) [SL.TLF.ACTI.1524.FE.ZS]","Labor force participation rate, total (% of total population ages 15+) (modeled ILO estimate) [SL.TLF.CACT.ZS]",Ratio of female to male labor force participation rate (%) (modeled ILO estimate) [SL.TLF.CACT.FM.ZS],"Average working hours of children, study and work, ages 7-14 (hours per week) [SL.TLF.0714.SW.TM]","Average working hours of children, working only, ages 7-14 (hours per week) [SL.TLF.0714.WK.TM]","Employers, total (% of total employment) (modeled ILO estimate) [SL.EMP.MPYR.ZS]",Employment in agriculture (% of total employment) (modeled ILO estimate) [SL.AGR.EMPL.ZS],Employment in industry (% of total employment) (modeled ILO estimate) [SL.IND.EMPL.ZS],Employment in services (% of total employment) (modeled ILO estimate) [SL.SRV.EMPL.ZS],Informal employment (% of total non-agricultural employment) [SL.ISV.IFRM.ZS],"Unemployment, total (% of total labor force) (modeled ILO estimate) [SL.UEM.TOTL.ZS]","Self-employed, total (% of total employment) (modeled ILO estimate) [SL.EMP.SELF.ZS]",Coverage of unemployment benefits and ALMP (% of population) [per_lm_alllm.cov_pop_tot]
702,1996,YR1996,Seychelles,SYC,..,..,..,..,..,..,..,..,..,..,..,..,..
1439,1999,YR1999,Lebanon,LBN,17.994,44.148,29.245,..,..,5.175,15.266,23.729,61.005,..,8.407,36.207,..
6246,2017,YR2017,Seychelles,SYC,..,..,..,..,..,..,..,..,..,..,..,..,..
602,1996,YR1996,"Egypt, Arab Rep.",EGY,18.956,45.917,28.172,..,..,15.830,32.702,21.966,45.333,..,9,41.453,..
6200,2017,YR2017,Malawi,MWI,61.779,77.212,89.012,..,..,1.138,72.066,8.224,19.709,..,5.468,60.813,..


## Standardize Country Codes

In [4]:
""" Only Select rows with valid country codes
"""
country_locations = []
for country in df['Country Code']:
    try:
        pycountry.countries.lookup(country)
        country_locations.append(True)
    except LookupError:
        country_locations.append(False)
df = df[country_locations]

## Standardize Indexes

In [5]:
df.rename(
    {
        "Time": "Year"
    },
    axis='columns',
    inplace=True)

In [6]:
df.set_index(["Country Code", "Year"], inplace=True)

## Clean Data

### Header

In [7]:
df.drop(["Time Code", "Country Name"],
        axis='columns',
        inplace=True)

In [8]:
c = [ re.search(r"\[(\w+\.)+\w+\]",d)[0].replace("[","").replace("]","") for d in df.columns ]

In [9]:
c_names = {}
for x in range(len(c)):
    c_names[df.columns[x]] = c[x]

In [10]:
df.rename(c_names,axis='columns',inplace=True)

### Data Types

In [11]:
""" Replace '..' with np.nan for better parsing
"""
df = df.replace('..', np.NaN)

In [12]:
df = df.astype(float)

In [13]:
df.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,SL.TLF.ACTI.1524.FE.ZS,SL.TLF.CACT.ZS,SL.TLF.CACT.FM.ZS,SL.TLF.0714.SW.TM,SL.TLF.0714.WK.TM,SL.EMP.MPYR.ZS,SL.AGR.EMPL.ZS,SL.IND.EMPL.ZS,SL.SRV.EMPL.ZS,SL.ISV.IFRM.ZS,SL.UEM.TOTL.ZS,SL.EMP.SELF.ZS,per_lm_alllm.cov_pop_tot
Country Code,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
SVK,2000,41.751,59.75,77.098,,,2.528,6.935,37.252,55.813,,19.062,7.905,
BOL,2001,44.784,70.81,73.179,,,2.183,44.204,15.692,40.104,,3.418,66.392,
JPN,2018,45.106,60.732,72.709,,,1.975,3.409,24.501,72.089,,2.445,10.368,
CUW,2011,,,,,,,,,,,,,
SWZ,2007,28.465,49.305,55.907,,,1.841,14.376,25.587,60.037,,28.24,28.431,


## Save Data

In [14]:
df.to_pickle("../data/processed/Labor_WDI.pickle")