# Costa Rican Household Poverty Level Prediction

Problem and Data Explanation
The data for this competition is provided in two files: train.csv and test.csv. The training set has 9557 rows and 143 columns while the testing set has 23856 rows and 142 columns. Each row represents one individual and each column is a feature, either unique to the individual, or for the household of the individual. The training set has one additional column, Target, which represents the poverty level on a 1-4 scale and is the label for the competition. A value of 1 is the most extreme poverty.

This is a supervised multi-class classification machine learning problem:

Supervised: provided with the labels for the training data
Multi-class classification: Labels are discrete values with 4 classes
The Target values represent poverty levels as follows:

1 = extreme poverty 
2 = moderate poverty 
3 = vulnerable households 
4 = non vulnerable households

Objectives:
Objective of this kernel is to perform modeling with the following estimators with default parameters & get accuracy
        
        GradientBoostingClassifier
        RandomForestClassifier
        KNeighborsClassifier
        ExtraTreesClassifier
        XGBoost
        LightGBM
        
        Then perform tuning using Bayesian Optimization & compare the accuracy of the estimators. 
        In this kerenal very simple code is used so that beginners can understand the code.

## Calling required libraries for the work

In [3]:

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib qt5

from imblearn.over_sampling import SMOTE, ADASYN

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import  OneHotEncoder as ohe
from sklearn.preprocessing import StandardScaler as ss
from sklearn.compose import ColumnTransformer as ct
from sklearn.decomposition import PCA

#from sklearn.impute import SimpleImpute

from sklearn.metrics import confusion_matrix
from sklearn.metrics import average_precision_score
import sklearn.metrics as metrics
from sklearn.preprocessing import StandardScaler as ss

from sklearn.decomposition import PCA
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier as rf
import lightgbm as lgb
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

from sklearn.metrics import accuracy_score
from sklearn.metrics import auc, roc_curve
from sklearn.metrics import f1_score

import matplotlib.pyplot as plt
from xgboost import plot_importance

from sklearn.model_selection import cross_val_score

from bayes_opt import BayesianOptimization

from eli5.sklearn import PermutationImportance

# 1.12 Misc
import time
import os
import gc
import random
from scipy.stats import uniform


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt
import seaborn as sns
import random 
import warnings


This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


## Reading the data

In [4]:
pd.options.display.max_columns = 150

# Read in data
train = pd.read_csv('/Users/nagesh/BIGDATA/Costarica/train.csv')
test = pd.read_csv('/Users/nagesh/BIGDATA/Costarica/test.csv')
train.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,r4h2,r4h3,r4m1,r4m2,r4m3,r4t1,r4t2,r4t3,tamhog,tamviv,escolari,rez_esc,hhsize,paredblolad,paredzocalo,paredpreb,pareddes,paredmad,paredzinc,paredfibras,paredother,pisomoscer,pisocemento,pisoother,pisonatur,pisonotiene,pisomadera,techozinc,techoentrepiso,techocane,techootro,cielorazo,abastaguadentro,abastaguafuera,abastaguano,public,planpri,noelec,coopele,sanitario1,sanitario2,sanitario3,sanitario5,sanitario6,energcocinar1,energcocinar2,energcocinar3,energcocinar4,elimbasu1,elimbasu2,elimbasu3,elimbasu4,elimbasu5,elimbasu6,epared1,epared2,epared3,etecho1,etecho2,etecho3,eviv1,eviv2,eviv3,dis,male,female,estadocivil1,estadocivil2,estadocivil3,estadocivil4,estadocivil5,estadocivil6,estadocivil7,parentesco1,parentesco2,parentesco3,parentesco4,parentesco5,parentesco6,parentesco7,parentesco8,parentesco9,parentesco10,parentesco11,parentesco12,idhogar,hogar_nin,hogar_adul,hogar_mayor,hogar_total,dependency,edjefe,edjefa,meaneduc,instlevel1,instlevel2,instlevel3,instlevel4,instlevel5,instlevel6,instlevel7,instlevel8,instlevel9,bedrooms,overcrowding,tipovivi1,tipovivi2,tipovivi3,tipovivi4,tipovivi5,computer,television,mobilephone,qmobilephone,lugar1,lugar2,lugar3,lugar4,lugar5,lugar6,area1,area2,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,1,1,0,0,0,0,1,1,1,1,10,,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,21eb7fcc1,0,1,0,1,no,10,no,10.0,0,0,0,1,0,0,0,0,0,1,1.0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,1,0,43,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,1,1,0,0,0,0,1,1,1,1,12,,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0e5d7a658,0,1,1,1,8,12,no,12.0,0,0,0,0,0,0,0,1,0,1,1.0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,1,0,67,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,0,0,0,1,1,0,1,1,1,1,11,,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2c7317ea8,0,1,1,1,8,no,11,11.0,0,0,0,0,1,0,0,0,0,2,0.5,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,92,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,2,2,1,1,2,1,3,4,4,4,9,1.0,4,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,2b58d945f,2,2,0,4,yes,11,no,11.0,0,0,0,1,0,0,0,0,0,3,1.333333,0,0,1,0,0,0,0,1,3,1,0,0,0,0,0,1,0,17,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,2,2,1,1,2,1,3,4,4,4,11,,4,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2b58d945f,2,2,0,4,yes,11,no,11.0,0,0,0,0,1,0,0,0,0,3,1.333333,0,0,1,0,0,0,0,1,3,1,0,0,0,0,0,1,0,37,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


In [5]:
for col in train.columns: 
    print(col) 

Id
v2a1
hacdor
rooms
hacapo
v14a
refrig
v18q
v18q1
r4h1
r4h2
r4h3
r4m1
r4m2
r4m3
r4t1
r4t2
r4t3
tamhog
tamviv
escolari
rez_esc
hhsize
paredblolad
paredzocalo
paredpreb
pareddes
paredmad
paredzinc
paredfibras
paredother
pisomoscer
pisocemento
pisoother
pisonatur
pisonotiene
pisomadera
techozinc
techoentrepiso
techocane
techootro
cielorazo
abastaguadentro
abastaguafuera
abastaguano
public
planpri
noelec
coopele
sanitario1
sanitario2
sanitario3
sanitario5
sanitario6
energcocinar1
energcocinar2
energcocinar3
energcocinar4
elimbasu1
elimbasu2
elimbasu3
elimbasu4
elimbasu5
elimbasu6
epared1
epared2
epared3
etecho1
etecho2
etecho3
eviv1
eviv2
eviv3
dis
male
female
estadocivil1
estadocivil2
estadocivil3
estadocivil4
estadocivil5
estadocivil6
estadocivil7
parentesco1
parentesco2
parentesco3
parentesco4
parentesco5
parentesco6
parentesco7
parentesco8
parentesco9
parentesco10
parentesco11
parentesco12
idhogar
hogar_nin
hogar_adul
hogar_mayor
hogar_total
dependency
edjefe
edjefa
meaneduc
instlevel

In [14]:
train['area_type'] = train['area1'].apply(lambda x: "urbal" if x==1 else "rural")

cols = ['area_type', 'Target']
colmap = sns.light_palette("yellow", as_cmap=True)
pd.crosstab(train[cols[1]], train[cols[0]]).style.background_gradient(cmap = colmap)

area_type,rural,urbal
Target,Unnamed: 1_level_1,Unnamed: 2_level_1
1,255,500
2,545,1052
3,428,781
4,1500,4496


In [17]:
cols = ['area2', 'Target']
colmap = sns.light_palette("orange", as_cmap=True)
pd.crosstab(train[cols[0]], train[cols[1]]).style.background_gradient(cmap = colmap)

Target,1,2,3,4
area2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,500,1052,781,4496
1,255,545,428,1500


## Explore data and perform data visualization

In [6]:
train.info()   

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9557 entries, 0 to 9556
Columns: 143 entries, Id to Target
dtypes: float64(8), int64(130), object(5)
memory usage: 10.4+ MB


There 143 columns including 8-float64, 130-int64 and 5 objects.
The object type data need to be converted into numerical before feeding to estimators.
The objects ID & Idhogar are not required for predicting & may be dropped
The 3 category data types (dependency, edjefe & edjefa ) need to be converted into numerical type.

In [8]:
train.select_dtypes('object').head()

Unnamed: 0,Id,idhogar,dependency,edjefe,edjefa
0,ID_279628684,21eb7fcc1,no,10,no
1,ID_f29eb3ddd,0e5d7a658,8,12,no
2,ID_68de51c94,2c7317ea8,8,no,11
3,ID_d671db89c,2b58d945f,yes,11,no
4,ID_d56d6f5f5,2b58d945f,yes,11,no


## Converting categorical objects into numericals 

### Fill in missing values (NULL values)  using 1 for yes and 0 for no

In [12]:
 # Number of missing in each column
missing = pd.DataFrame(train.isnull().sum()).rename(columns = {0: 'total'})

# Create a percentage missing
missing['percent'] = missing['total'] / len(train)

missing.sort_values('percent', ascending = False).head(10)


Unnamed: 0,total,percent
rez_esc,7928,0.829549
v18q1,7342,0.768233
v2a1,6860,0.717798
SQBmeaned,5,0.000523
meaneduc,5,0.000523
Id,0,0.0
hogar_adul,0,0.0
parentesco10,0,0.0
parentesco11,0,0.0
parentesco12,0,0.0


In [5]:
train['v18q1'] = train['v18q1'].fillna(0)
test['v18q1'] = test['v18q1'].fillna(0)

In [6]:
train['v2a1'] = train['v2a1'].fillna(0)
test['v2a1'] = test['v2a1'].fillna(0)

In [7]:
train['rez_esc'] = train['rez_esc'].fillna(0)
test['rez_esc'] = test['rez_esc'].fillna(0)

### Dropping unnecesary columns

In [29]:
train.drop(['Id','idhogar','SQBescolari', 'SQBage', 'SQBhogar_total', 'SQBedjefe', 
        'SQBhogar_nin', 'SQBovercrowding', 'SQBdependency', 'SQBmeaned', 'agesq', "dependency","edjefe","edjefa"], inplace = True, axis =1)

test.drop(['Id','idhogar','SQBescolari', 'SQBage', 'SQBhogar_total', 'SQBedjefe', 
        'SQBhogar_nin', 'SQBovercrowding', 'SQBdependency', 'SQBmeaned', 'agesq',"dependency","edjefe","edjefa"], inplace = True, axis =1)

In [30]:
train.shape

(9557, 130)

In [31]:
test.shape

(23856, 128)

### Dividing the data into predictors & target

In [34]:
y = train.iloc[:,129]
y.unique()


array(['urbal', 'rural'], dtype=object)

In [36]:
X = train.iloc[:,1:128]
X.shape


(9557, 127)

### Scaling  numeric features & applying PCA to reduce features

In [37]:
scale = ss()
X = scale.fit_transform(X)
pca = PCA(0.95)
X = pca.fit_transform(X)


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

### Final features selected for modeling

In [12]:
X.shape, y.shape

((9557, 61), (9557,))

### Splitting the data into train & test 

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
                                                    X,
                                                    y,
                                                    test_size = 0.2)


# Modelling

## Modelling with Random Forest

In [21]:
 
model-rf = rf()

In [22]:
start = time.time()
model-rf = model-rf.fit(X_train, y_train)
end = time.time()
(end-start)/60



0.015074316660563152

In [22]:
classes = model-rf.predict(X_test)

In [23]:
(classes == y_test).sum()/y_test.size 

0.729602510460251

## Performing tuning using Bayesian Optimization.

In [24]:
# Bayes Optimization -- One method

from bayes_opt import BayesianOptimization

#  Bayes optimization--IInd method
# SKOPT is a parameter-optimisation framewor

from skopt import BayesSearchCV



In [25]:
params = { 'n_estimators': (50, 100)   }

In [26]:
bayes_cv_tuner = BayesSearchCV( estimator = model-rf,    # rf, lgb, xgb, nn etc--Black box
                              search_spaces = params)



In [27]:
# 22.4 Start learning using Bayes tuner
start = time.time()
result = bayes_cv_tuner.fit(
                           X_train,       # Note that we use normal train data
                           y_train       #  rather than lgb train-data matrix
                           #callback=status_print
                           )

end = time.time()
(end - start)/60







14.339911806583405

In [28]:
#  So what are the results?
#      Use the following estimator in future
bayes_cv_tuner.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [29]:
#  What parameters the best estimator was using?
best_params = pd.Series(bayes_cv_tuner.best_params_)
best_params


n_estimators    100
dtype: int64

### Random Forest with best parameters

In [30]:
model-rf=rf(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [31]:
start = time.time()
model-rf = model-rf.fit(X_train, y_train)
end = time.time()
(end-start)/60

0.14395851294199627

In [34]:
classes = clf.predict(X_test)
classes

array([4, 4, 4, ..., 4, 4, 3])

In [33]:
(classes == y_test).sum()/y_test.size 

0.7604602510460251

### Accuracy improved from 72.96% to 76.04%

## Modelling with ExtraTreeClassifier

In [None]:

from sklearn.ensemble import ExtraTreesClassifier

In [None]:
model-etf = ExtraTreesClassifier()

In [None]:
start = time.time()
model-etf = model-etf.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = model-etf.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size

## Performing tuning using Bayesian Optimization.

## Modelling with KNeighborsClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
model-neigh = KNeighborsClassifier(n_neighbors=4)

In [None]:
start = time.time()
model-neigh = model-neigh.fit(X_train, y_train)
end = time.time()
(end-start)/60



In [None]:
classes = model-clf.predict(X_test)

classes

In [None]:
(classes == y_test).sum()/y_test.size 

## Performing tuning using Bayesian Optimization.

## Modelling with GradientBoostingClassifier

In [23]:
from sklearn.ensemble import GradientBoostingClassifier as gbm


In [24]:
model-gbm=gbm()

In [25]:
start = time.time()
model-gbm = model-gbm.fit(X_train, y_train)
end = time.time()
(end-start)/60


0.5576042850812276

In [26]:
classes = model-gbm.predict(X_test)

classes

array([4, 4, 4, ..., 2, 2, 4])

In [27]:
(classes == y_test).sum()/y_test.size 

0.6929916317991632

## Performing tuning using Bayesian Optimization.

## Modelling with XGBClassifier

In [13]:
model-xgb=XGBClassifier()

In [58]:
start = time.time()
model-xgb = model-xgb.fit(X_train, y_train)
end = time.time()
(end-start)/60

0.351248820622762

In [59]:
classes = model-xgb.predict(X_test)

classes

array([4, 4, 4, ..., 2, 2, 4])

In [60]:
(classes == y_test).sum()/y_test.size 

0.6814853556485355

## Performing tuning using Bayesian Optimization.

In [18]:
# Bayes Optimization -- One method

from bayes_opt import BayesianOptimization

#  Bayes optimization--IInd method
# SKOPT is a parameter-optimisation framewor

from skopt import BayesSearchCV



In [23]:
# 22.4 Start learning using Bayes tuner
start = time.time()
result = bayes_cv_tuner.fit(
                           X_train,       # Note that we use normal train data
                           y_train       #  rather than lgb train-data matrix
                           #callback=status_print
                           )

end = time.time()
(end - start)/60







37.979203498363496

In [24]:
bayes_cv_tuner.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [25]:
best_params = pd.Series(bayes_cv_tuner.best_params_)
best_params



n_estimators    100
dtype: int64

## Modelling with Light Gradient Booster

In [35]:
import lightgbm as lgb

In [43]:
model-lgb = lgb.LGBMClassifier(max_depth=-1, learning_rate=0.1, objective='multiclass',
                             random_state=None, silent=True, metric='None', 
                             n_jobs=4, n_estimators=5000, class_weight='balanced',
                             colsample_bytree =  0.93, min_child_samples = 95, num_leaves = 14, subsample = 0.96)

In [44]:
start = time.time()
model-lgb = model-lgb.fit(X_train, y_train)
end = time.time()
(end-start)/60

1.6763604283332825

In [45]:
classes = model-lgb.predict(X_test)

classes

array([4, 4, 1, ..., 4, 2, 4])

In [46]:
(classes == y_test).sum()/y_test.size 

0.7850418410041841

## Performing tuning using Bayesian Optimization.