# Costa Rican Household Poverty Level Prediction
*From Kaggle ([competition link](https://www.kaggle.com/c/costa-rican-household-poverty-prediction))*
  
**By Nema Sobhani & David LaCharite**

## Summary

Income qualification for poor families in Costa Rica to determing need for aid. Data gathered from the *Inter-American Development Bank.*

## Imports

In [6]:
# General tools
import pandas as pd
import numpy as np

# Functions
from functions import *

# Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from IPython.display import display
pd.options.display.max_columns = None
from pprint import pprint

# Classification
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression, LogisticRegressionCV
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from sklearn.feature_selection import SelectFromModel

# Feature Engineering

## Rent Prediction

We decided to use regression models to predict **rent** as an approach to filling missing values to increase our power in predicting **poverty level**.  


After testing with tree-style classifiers (Random Forrest, XGBoost) and linear models (Linear Regression, RidgeCV, LassoCV, ElasticNetCV), we found that XGBoost gave the highest scores in predicting rent.

### DataFrame Setup

In [7]:
# Setting up new dataframe (including rent data)
df_rent = dataframe_generator(rent=True)

In [8]:
print("Missing values of explanatory variables:", df_rent.drop(columns='v2a1').isna().sum().sum())
print("Missing values of target variable (rent):", df_rent.v2a1.isna().sum())

Missing values of explanatory variables: 0
Missing values of target variable (rent): 6860


### Classification Setup

In [9]:
# Rent Prediction Function
def dataframe_generator_rent():
    
    #_______________________________
    # DATAFRAME SETUP
    #_______________________________
    
    # Setting up new dataframe (including rent data)
    df_rent = dataframe_generator(rent=True)
    
    # Remove missing values for target (rent)
    df_rent_predict = df_rent.dropna()

    
    #_______________________________
    # CLASSIFICATION SETUP
    #_______________________________
    
    # Partition explanatory and response variables
    X = df_rent_predict.drop(columns='v2a1')
    y = df_rent_predict['v2a1']

    # Split into training and test data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=12345)
    
    
    #_______________________________
    # CLASSIFICATION 
    # (using random forest because it consistently gave highest score)
    #_______________________________
    
    # XGB
    # clf = xgb.XGBClassifier(max_depth=6,n_estimators=100, n_jobs=-1, subsample=.7)
    # clf.fit(X_train, y_train)
    # print(clf.score(X_test, y_test))
    
    # Random Forest
    clf = RandomForestRegressor()
    clf.fit(X_train, y_train)
    # print(clf.score(X_test, y_test))
    
    
    #_______________________________
    # FILL NAN USING PREDICTED VALUES FROM MODEL
    #_______________________________
    
    # Prepare data to fill in predicted values for rent
    df_rent_nan = df_rent[df_rent.v2a1.isna()]
    
    # Predict using model
    rent_pred = clf.predict(df_rent_nan.drop(columns='v2a1'))
    
    # Fill NaN
    df_rent_nan['v2a1'] = pd.DataFrame(rent_pred).values
    
    # Update full dataframe
    df_rent[df_rent.v2a1.isna()] = df_rent_nan
    
    
    return df_rent

In [10]:
df_rent = dataframe_generator_rent()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [11]:
df_rent.to_pickle("df_rent.pkl")

# Transformations

In [12]:
top_features = ['meaneduc', 'v2a1', 'SQBedjefe', 'overcrowding', 'SQBdependency', 'age', 'rooms', 'qmobilephone']
top_features_SQ = ["SQ_" + i for i in top_features]
top_features_LOG = ["LOG_" + i for i in top_features]

top_features + top_features_SQ + top_features_LOG

['meaneduc',
 'v2a1',
 'SQBedjefe',
 'overcrowding',
 'SQBdependency',
 'age',
 'rooms',
 'qmobilephone',
 'SQ_meaneduc',
 'SQ_v2a1',
 'SQ_SQBedjefe',
 'SQ_overcrowding',
 'SQ_SQBdependency',
 'SQ_age',
 'SQ_rooms',
 'SQ_qmobilephone',
 'LOG_meaneduc',
 'LOG_v2a1',
 'LOG_SQBedjefe',
 'LOG_overcrowding',
 'LOG_SQBdependency',
 'LOG_age',
 'LOG_rooms',
 'LOG_qmobilephone']

### Feature Selection Map

ID | Original | Squares | Logs
--- | --- | --- | --- 
1 | X | - | -
2 | - | X | -
3 | - | - | X
4 | X | X | -
5 | X | - | X
6 | - | X | X