# Costa Rican Household Poverty Level Prediction

Tutorial : EASY EDA + Baseline Modeling for starters 

## *This is a work in progress.*
  
   
### Contents 
0. Introduction  
    0.0 Goal  
    0.1 Core Data description from Kaggle  
    0.2 reference  
1. Load libraries and dataset
2. An overview of dataset  
    2.1 Glimpse of train and test dataset  
    2.2 Summary statistics  
    2.3 Check for missing values  
    2.4 Target variables  
3. Data exploration  
    3.1 Visualization by datatype  
    3.2 Household  
    3.3 
    
---
## 0. Introduction 
---

### 0.0 Goal
The goal is to predict the poverty level of households.

### 0.1  Core Data description from Kaggle
  
**Id** - a unique identifier for each row.  
**Target** - the target is an **ordinal** variable indicating groups of income levels.   
1 = extreme poverty  
2 = moderate poverty   
3 = vulnerable households  
4 = non vulnerable households  
  
**idhogar** - this is a unique identifier for each household. This can be used to create household-wide features, etc.  All rows in a given household will have a matching value for this identifier.  
**parentesco1** - indicates if this person is the head of the household.  

### 0.2 reference
I referred to several fabulous kernels. Thanks for the authors!  
https://www.kaggle.com/shivamb/costa-rica-poverty-exploration-baseline-model
https://www.kaggle.com/willkoehrsen/a-complete-introduction-and-walkthrough



---
## 1. Load libraries & dataset
---


Load libraries and dataset.

In [None]:
#Load libraries
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(font_scale=2.2)
plt.style.use('seaborn')

from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, train_test_split, ShuffleSplit
from sklearn.metrics import f1_score
import itertools
import lightgbm as lgb
import xgboost as xgb
from xgboost import XGBClassifier
import shap
from tqdm import tqdm
import featuretools as ft
import time
from datetime import date
import random 
import warnings
import operator

from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.plotly as py
from plotly import tools
import plotly.figure_factory as ff

import warnings 
warnings.filterwarnings('ignore')
init_notebook_mode(connected=True)

#Load dataset
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

---
## 2. An overview of dataset
---


### 2.1 Glimpse of Train and Test Dataset

In [None]:
print('Train Dataset shape:', df_train.shape)
print('Test Dataset shape shape: ', df_test.shape)

 **Train Dataset** : ID + 141 independent variables + Target variable  
**Test Dataset** : ID + 141 independent variables

In [None]:
print ("Train Dataset: ")
df_train.head()

In [None]:
print ("Test Dataset: ")
df_test.head()

### 2.2 Summary statistics

In [None]:
print ("Summary Statistics of Train Dataset: ")
df_train.describe()

### 2.3 Check for missing values

Check the counts and percent of missing values both.

In [None]:
print("Total Training Features with NaN values = " + str(df_train.columns[df_train.isnull().sum() != 0].size))
if (df_train.columns[df_train.isnull().sum() != 0].size):
    print("Features with NaN => {}".format(list(df_train.columns[df_train.isnull().sum() != 0])))
    df_train[df_train.columns[df_train.isnull().sum() != 0]].isnull().sum().sort_values(ascending = False)

5 columns( `v2a1`, `v18q1`, `rez_esc`, `meaneduc`, and `SQBmeaned`)are having missing values. Let's get the count and percentage of null values.

In [None]:
print ("Top Columns having missing values")
count = df_train.isnull().sum().sort_values(ascending = False)
percent = 100 * (df_train.isnull().sum() / df_train.isnull().count()).sort_values(ascending=False)
missing_df = pd.concat([count, percent], axis=1, keys=['Count', 'Percent'])
missing_df.head(5)


In [None]:
import missingno as msno
msno.matrix(df_train[['v2a1', 'v18q1', 'rez_esc', 'meaneduc', 'SQBmeaned']], color=(0.42, 0.6, 0.4))
plt.show()

### 2.4 Target variables 
  
The target is an ordinal variables representing poverty levels as follows:   
1 = extreme poverty  
2 = moderate poverty  
3 = vulnerable households  
4 = non vulnerable households


In [None]:
# Value counts of target
print ("Value counts of target")
df_train_target_counts = df_train['Target'].value_counts().sort_index()
df_train_target_counts

In [None]:
# Value counts of target - bar plot
levels = ["Extereme Poverty", "Moderate Poverty", "Vulnerable", "Non vulnerable"]
trace = go.Bar(y=df_train_target_counts, x=levels, marker=dict(color='orange', opacity=0.6))
layout = dict(title="Household Poverty Levels", margin=dict(l=200), width=800, height=400)
data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)

Next, `idhogar` is a unique identifier for each household and `parentesco1` indicates if this person is the head of the household.  

In [None]:
df_train['idhogar'].nunique()

In [None]:
df_train['idhogar'].value_counts()

In [None]:
df_train['parentesco1'].value_counts()

Consider the subset where the variable `parentesco1` is 1.

In [None]:
# the subset with parentesco1 == 1
print ("Value counts of target")
df_train_head = df_train.loc[(df_train['Target'].notnull()) & (df_train['parentesco1'] == 1), ['Target', 'idhogar']]

# Value counts of target when parentesco1 == 1
df_train_target_counts = df_train_head['Target'].value_counts().sort_index()
df_train_target_counts

In [None]:
levels = ["Extereme Poverty", "Moderate Poverty", "Vulnerable", "Non vulnerable"]
trace = go.Bar(y=df_train_target_counts, x=levels, marker=dict(color='orange', opacity=0.6))
layout = dict(title="Household Poverty Levels", margin=dict(l=100), width=800, height=400)
data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)

---
## 3. Data exploration
---

We already know that there are 141 independent variables except `Id` and `Target` in train set and 140 independent variables except `Id` in test set. Let's start the exploration checking the type of variables.

In [None]:
df_train.info()

In [None]:
df_test.info()

There are 130 integer columns, 8 float columns and 5 object columns in train set and 129 integer columns, 8 float columns and 5 object columns in test set. The integer columns would consist of boolean variables and ordinal variables.   
 
 To sum up,
 
**Train Dataset** : total 143 columns  
- 130 integer columns  (Target + other columns)
- 8 float columns  
- 5 object columns  (Id + other columns)
  
**Test Dataset** :  total 142 columns   
- 129 integer column
- 8 float columns
- 5 object columns (Id + other columns)
  
 

Let's start with some overview plots for each data type to explore the data set. 
- Integer columns  
- Float columns  
- Object columns  
  
### 3.1 Integer Columns

In [None]:
# Count of Unique Values in Integer Columns
df_train_int_count = df_train.select_dtypes(np.int64).nunique().value_counts().sort_index()
df_train_int_count

trace = go.Bar(y=df_train_int_count, marker=dict(color='blue', opacity=0.8))
layout = dict(title="Count of Unique Values in Integer Columns", margin=dict(l=100), width=800, height=400,
              xaxis=dict(title='Number of Unique Values'), yaxis=dict(title="Count")
             )
data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)

### 3.2 Float Columns

Floats columns represent continuous variables. We can create distribution plots to see if there is a significant difference in the variables depending on the household poverty level.

In [None]:
# distributions of the float columns by the target 
from collections import OrderedDict # fix the keys and values in the same order

plt.figure(figsize = (20, 16))
plt.style.use('fivethirtyeight')

# Color mapping
colors = OrderedDict({1: 'red', 2: 'orange', 3: 'purple', 4: 'green'})
poverty_mapping = OrderedDict({1: 'Extreme', 2: 'Moderate', 3: 'Vulnerable', 4: 'Non Vulnerable'})

# Iterate through the float columns
for i, col in enumerate(df_train.select_dtypes('float')):
    ax = plt.subplot(4, 2, i + 1)
    # Iterate through the poverty levels
    for poverty_level, color in colors.items():
        # Plot each poverty level as a separate line
        sns.kdeplot(df_train.loc[df_train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = poverty_mapping[poverty_level])
        
    plt.title(f'{col.capitalize()} Distribution'); plt.xlabel(f'{col}'); plt.ylabel('Density')

plt.subplots_adjust(top = 2)

We can guess the relationship between the variables and the Target on the graph. For example, the `meaneduc` which represents the average education of the adults in the household appears to be related to the Target: poverty level. This graph shows that a higher average adult education leads to higher values of the target which are less severe levels of poverty. 

### 3.3 Object Columns

In [None]:
df_train.select_dtypes('object').head()

The `Id` and `idhogar` columns are identifying variables. However others seem to be mixed columns of strings and numbers. We need to preprocess them before applying any machine learning techniques.  

According to the data description from Kaggle, 

- dependency: Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
- edjefe: years of education of male head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0
- edjefa: years of education of female head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0

These explanations clear up the issue. For these three variables, "yes" = 1 and "no" = 0. We will correct the variables using a mapping and convert to floats.

In [None]:
mapping = {"yes": 1, "no": 0}

# Apply same operation to both train and test
for df in [df_train, df_test]:
    # Fill in the values with the correct mapping
    df['dependency'] = df['dependency'].replace(mapping).astype(np.float64)
    df['edjefa'] = df['edjefa'].replace(mapping).astype(np.float64)
    df['edjefe'] = df['edjefe'].replace(mapping).astype(np.float64)

df_train[['dependency', 'edjefa', 'edjefe']].describe()

In [None]:
plt.figure(figsize = (16, 12))

# Iterate through the float columns
for i, col in enumerate(['dependency', 'edjefa', 'edjefe']):
    ax = plt.subplot(3, 1, i + 1)
    # Iterate through the poverty levels
    for poverty_level, color in colors.items():
        # Plot each poverty level as a separate line
        sns.kdeplot(df_train.loc[df_train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = poverty_mapping[poverty_level])
        
    plt.title(f'{col.capitalize()} Distribution'); plt.xlabel(f'{col}'); plt.ylabel('Density')

plt.subplots_adjust(top = 2)

To make operations like that above a little easier, we'll join together the training and testing dataframes. This is important once we start feature engineering because we want to apply the same operations to both dataframes so we end up with the same features. Later we can separate out the sets based on the Target.

In [None]:
# Add null Target column to test
df_test['Target'] = np.nan
data = df_train.append(df_test, ignore_index = True)