# Costa Rican Household Poverty Level Prediction

Costa Rican Household Poverty Level Prediction is a Kaggle competition which is currently open for late submissions. This competition is for a social cause to help Inter-American Development Bank to identify which households have the highest need for social welfare assistance. Many social programs have a hard time making sure the right people are given enough aid. It’s especially tricky when a program focuses on the poorest segment of the population. The world’s poorest typically can’t provide the necessary income and expense records to prove that they qualify. Other than Costa Rica, many other countries also face this same problem of inaccurately assessing social need. If data scientists can generate an improvement, many countries can take benefit out of it.

The dataset available in Kaggle is a set of household characteristics from a representative sample of Costa Rican Households. The dataset has observations for each member of the household but the classification is done at the household level. This is a multi-class classification problem using a supervised machine learning approach. We classify households according to their income levels (1 = extreme poverty,  2 = moderate poverty, 3 = vulnerable households, 4 = non-vulnerable households). We have identified the multiclass classification algorithms which we think apt to use this type of problems like, KNN, Logistic Regression, Naive Bayes, Random Forest etc. We will fine-tune the hyperparameters for each model and evaluate the different models using f1-score and accuracy metrics to find out the top best among them. Will do significance tests to figure out the optimal model to perform prediction.

Kaggle competition link:

https://www.kaggle.com/c/costa-rican-household-poverty-prediction (Links to an external site.)Links to an external site.
  

# Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline.

In [28]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Download the data and do some verifications

The data is located locally within this folder. 

In [29]:
%ls -l all/train.csv

-rwxr-xr-x 1 root root 3237288 Jul 19 03:51 [0m[01;32mall/train.csv[0m*


In [30]:
import pandas as pd

In [31]:
full_train_data = pd.read_csv('all/train.csv')

In [32]:
train_data = pd.read_csv('all/train.csv')
test_data = pd.read_csv('all/test.csv')

In [33]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9557 entries, 0 to 9556
Columns: 143 entries, Id to Target
dtypes: float64(8), int64(130), object(5)
memory usage: 10.4+ MB


In [34]:
pd.options.display.max_columns = 150
train_data.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,r4h2,r4h3,r4m1,r4m2,r4m3,r4t1,r4t2,r4t3,tamhog,tamviv,escolari,rez_esc,hhsize,paredblolad,paredzocalo,paredpreb,pareddes,paredmad,paredzinc,paredfibras,paredother,pisomoscer,pisocemento,pisoother,pisonatur,pisonotiene,pisomadera,techozinc,techoentrepiso,techocane,techootro,cielorazo,abastaguadentro,abastaguafuera,abastaguano,public,planpri,noelec,coopele,sanitario1,sanitario2,sanitario3,sanitario5,sanitario6,energcocinar1,energcocinar2,energcocinar3,energcocinar4,elimbasu1,elimbasu2,elimbasu3,elimbasu4,elimbasu5,elimbasu6,epared1,epared2,epared3,etecho1,etecho2,etecho3,eviv1,eviv2,eviv3,dis,male,female,estadocivil1,estadocivil2,estadocivil3,estadocivil4,estadocivil5,estadocivil6,estadocivil7,parentesco1,parentesco2,parentesco3,parentesco4,parentesco5,parentesco6,parentesco7,parentesco8,parentesco9,parentesco10,parentesco11,parentesco12,idhogar,hogar_nin,hogar_adul,hogar_mayor,hogar_total,dependency,edjefe,edjefa,meaneduc,instlevel1,instlevel2,instlevel3,instlevel4,instlevel5,instlevel6,instlevel7,instlevel8,instlevel9,bedrooms,overcrowding,tipovivi1,tipovivi2,tipovivi3,tipovivi4,tipovivi5,computer,television,mobilephone,qmobilephone,lugar1,lugar2,lugar3,lugar4,lugar5,lugar6,area1,area2,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,1,1,0,0,0,0,1,1,1,1,10,,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,21eb7fcc1,0,1,0,1,no,10,no,10.0,0,0,0,1,0,0,0,0,0,1,1.0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,1,0,43,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,1,1,0,0,0,0,1,1,1,1,12,,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0e5d7a658,0,1,1,1,8,12,no,12.0,0,0,0,0,0,0,0,1,0,1,1.0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,1,0,67,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,0,0,0,1,1,0,1,1,1,1,11,,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,2c7317ea8,0,1,1,1,8,no,11,11.0,0,0,0,0,1,0,0,0,0,2,0.5,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,92,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,2,2,1,1,2,1,3,4,4,4,9,1.0,4,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,2b58d945f,2,2,0,4,yes,11,no,11.0,0,0,0,1,0,0,0,0,0,3,1.333333,0,0,1,0,0,0,0,1,3,1,0,0,0,0,0,1,0,17,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,2,2,1,1,2,1,3,4,4,4,11,,4,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2b58d945f,2,2,0,4,yes,11,no,11.0,0,0,0,0,1,0,0,0,0,3,1.333333,0,0,1,0,0,0,0,1,3,1,0,0,0,0,0,1,0,37,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


In [35]:
train_data.corr()

Unnamed: 0,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,r4h2,r4h3,r4m1,r4m2,r4m3,r4t1,r4t2,r4t3,tamhog,tamviv,escolari,rez_esc,hhsize,paredblolad,paredzocalo,paredpreb,pareddes,paredmad,paredzinc,paredfibras,paredother,pisomoscer,pisocemento,pisoother,pisonatur,pisonotiene,pisomadera,techozinc,techoentrepiso,techocane,techootro,cielorazo,abastaguadentro,abastaguafuera,abastaguano,public,planpri,noelec,coopele,sanitario1,sanitario2,sanitario3,sanitario5,sanitario6,energcocinar1,energcocinar2,energcocinar3,energcocinar4,elimbasu1,elimbasu2,elimbasu3,elimbasu4,elimbasu5,elimbasu6,epared1,epared2,epared3,etecho1,etecho2,etecho3,eviv1,eviv2,eviv3,dis,male,female,estadocivil1,estadocivil2,estadocivil3,estadocivil4,estadocivil5,estadocivil6,estadocivil7,parentesco1,parentesco2,parentesco3,parentesco4,parentesco5,parentesco6,parentesco7,parentesco8,parentesco9,parentesco10,parentesco11,parentesco12,hogar_nin,hogar_adul,hogar_mayor,hogar_total,meaneduc,instlevel1,instlevel2,instlevel3,instlevel4,instlevel5,instlevel6,instlevel7,instlevel8,instlevel9,bedrooms,overcrowding,tipovivi1,tipovivi2,tipovivi3,tipovivi4,tipovivi5,computer,television,mobilephone,qmobilephone,lugar1,lugar2,lugar3,lugar4,lugar5,lugar6,area1,area2,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
v2a1,1.0,-0.091732,0.443461,-0.073509,0.033551,0.08897,0.278364,0.302292,-0.0819,-0.002401,-0.059548,-0.072791,0.048169,-0.007806,-0.103959,0.032631,-0.043574,-0.064566,-0.051396,0.288006,-0.086238,-0.064566,0.290062,-0.101152,-0.174037,-0.037311,-0.134678,-0.053664,-0.025562,-0.015824,0.296623,-0.2736,,,-0.031581,-0.085837,-0.001022,0.011072,-0.029893,,0.300238,0.055717,-0.055717,,0.121159,,,-0.121159,-0.0278,0.141261,-0.132268,-0.035019,-0.023344,-0.017993,0.247867,-0.23898,-0.051818,0.122205,-0.066894,-0.093337,-0.041608,,,-0.129924,-0.242453,0.298101,-0.081504,-0.187258,0.221301,-0.124955,-0.202315,0.254528,-0.009161,-0.01641,0.01641,-0.044223,-0.072324,0.132792,0.03823,-0.061174,-0.02692,-0.007141,0.015678,0.042345,-0.010566,-0.059103,-0.028405,-0.05851,0.020017,-0.005813,-0.009135,0.024459,-0.00479,0.004421,-0.113882,0.048189,-0.03653,-0.064566,0.426544,-0.052032,-0.105092,-0.10663,-0.098785,-0.027489,-0.001094,0.023615,0.289269,0.227575,0.297657,-0.233023,,0.226177,-0.226177,,,0.272942,0.041303,0.06536,0.108504,0.231539,-0.019377,-0.03903,-0.12115,-0.128111,-0.143357,0.112759,-0.112759,0.078897,0.358305,0.062343,-0.061309,0.36429,-0.082246,-0.191915,-0.061352,0.402561,0.062343,0.273559
hacdor,-0.091732,1.0,-0.233369,0.652594,-0.175011,-0.101965,-0.08468,-0.049262,0.232508,0.059313,0.184857,0.268978,0.142458,0.26462,0.328901,0.134909,0.304282,0.304884,0.350948,-0.122134,0.047466,0.304884,-0.136055,0.074927,-0.027202,0.147449,0.08362,0.15512,-0.007622,-0.007622,-0.168472,0.077334,-0.006109,0.078103,0.232319,0.055087,-0.042053,-0.026698,-0.011166,-0.009338,-0.177257,-0.053838,0.060714,-0.01189,0.018548,-0.003526,-0.009338,-0.016256,0.102032,-0.018513,0.011347,-0.024871,-0.009984,-0.008644,-0.074592,0.052816,0.052003,-0.039592,-0.002982,0.049567,-0.007622,,-0.007055,0.186977,0.053383,-0.165127,0.15078,0.046047,-0.144617,0.248834,0.035195,-0.18885,-0.013962,-0.010899,0.010899,0.101245,0.048446,-0.083512,-0.036129,-0.004175,-0.024646,-0.004303,-0.059319,-0.03526,0.01465,0.052828,0.019894,0.09521,-0.003598,-0.009773,-0.003996,-0.001394,0.03125,0.0344,0.316924,0.092055,-0.047062,0.304884,-0.108783,0.084845,0.043483,0.001882,-0.003503,-0.043254,-0.018611,-0.007391,-0.070758,-0.024871,-0.209735,0.670727,-0.063173,-0.066533,0.07356,0.151115,0.010075,-0.067109,-0.075837,-0.059923,0.004046,-0.042086,0.017444,0.025546,-0.010172,0.037182,0.005289,0.027721,-0.027721,-0.118168,-0.109862,-0.102725,0.350546,-0.082229,0.388043,0.794699,0.005278,-0.099153,-0.102725,-0.191714
rooms,0.443461,-0.233369,1.0,-0.213368,0.129183,0.130531,0.254256,0.208919,-0.066578,0.267627,0.195222,-0.032558,0.241989,0.168503,-0.064789,0.349206,0.245784,0.240137,0.254473,0.220284,-0.084857,0.240137,0.260981,-0.05774,-0.151392,-0.096855,-0.111619,-0.070527,-0.002567,-0.021199,0.266992,-0.203644,0.003253,-0.0409,-0.142969,-0.076674,0.044452,-0.033245,0.002973,0.066864,0.310513,0.078825,-0.066509,-0.04846,0.052554,-0.003486,-0.05489,-0.04266,-0.125488,0.076446,-0.034281,-0.059912,-0.032653,-0.067716,0.173359,-0.147766,-0.046279,0.116747,-0.026822,-0.111188,-0.032379,,-0.023074,-0.184362,-0.134222,0.240154,-0.149426,-0.08239,0.177085,-0.194106,-0.140775,0.250314,-0.028929,0.005056,-0.005056,-0.07493,-0.121746,0.097307,0.008742,-0.057087,0.011181,0.069616,-0.075551,-0.00897,0.046391,-0.038294,0.049933,0.048432,0.010197,0.016032,0.005382,0.019535,0.006617,0.003603,-0.002149,0.367481,0.142582,0.240137,0.284654,-0.083246,-0.102221,-0.050018,-0.007555,0.050232,0.007313,-0.001366,0.178667,0.092962,0.80407,-0.382364,0.170745,0.093062,-0.171118,-0.103335,-0.114921,0.243198,0.226308,0.07554,0.404278,0.223771,-0.073862,-0.130994,-0.051941,-0.099481,-0.050428,0.130286,-0.130286,0.077046,0.233679,0.068288,0.221595,0.19889,0.007952,-0.355526,-0.027575,0.250061,0.068288,0.226208
hacapo,-0.073509,0.652594,-0.213368,1.0,-0.150986,-0.124506,-0.067529,-0.037414,0.226378,0.126645,0.240056,0.241452,0.095545,0.212527,0.306722,0.152968,0.305857,0.306289,0.354389,-0.104846,0.081574,0.306289,-0.12655,0.057683,-0.036119,0.164714,0.083705,0.169976,-0.005961,-0.005961,-0.166538,0.073848,-0.004778,-0.005037,0.256174,0.055785,-0.069665,-0.020881,-0.008733,-0.007303,-0.169545,-0.056118,0.062234,-0.009299,-7.5e-05,-0.002758,-0.007303,0.0021,0.134452,-0.032506,-0.001171,0.047696,-0.007809,-0.00676,-0.078034,0.029575,0.112584,-0.067394,0.033138,0.05914,-0.005961,,-0.005518,0.14046,0.049903,-0.133336,0.117304,0.010489,-0.089237,0.247084,0.004543,-0.159887,-0.011846,0.006543,-0.006543,0.085617,0.042195,-0.075613,-0.028257,0.002304,-0.021135,-0.003549,-0.051028,-0.036599,-0.004663,0.026771,0.027286,0.11075,-0.008771,-0.007644,0.012008,0.003577,0.056286,0.037384,0.274774,0.143574,-0.008307,0.306289,-0.121145,0.079814,0.030126,0.008589,-0.008689,-0.040219,-0.015811,-0.008468,-0.058766,-0.019452,-0.12631,0.530401,-0.026624,-0.052036,0.035628,0.101824,0.006048,-0.052486,-0.064587,-0.024046,0.044267,-0.038896,0.023858,-0.005963,0.000737,0.023113,0.024699,0.008402,-0.008402,-0.087773,-0.092703,-0.075528,0.37372,-0.07117,0.367025,0.640096,0.014411,-0.103324,-0.075528,-0.138008
v14a,0.033551,-0.175011,0.129183,-0.150986,1.0,0.143143,0.036396,0.011255,-0.054769,0.018133,-0.015552,-0.00637,0.038997,0.02651,-0.039803,0.038295,0.007615,0.007328,0.010613,0.036796,-0.022039,0.007328,0.058186,-0.000611,0.005202,-0.134787,-0.059958,0.008349,0.002778,0.002778,0.086897,-0.02041,0.002227,-0.221964,-0.05909,-0.068411,0.063877,0.00973,0.00407,0.003403,0.080269,0.151276,-0.119326,-0.11747,0.014807,0.001285,-0.089524,0.02098,-0.532663,0.006009,0.086154,-0.049865,0.003639,-0.164112,0.053617,-0.014786,-0.056463,0.041178,0.01276,-0.056164,0.002778,,0.002571,-0.0616,-0.026673,0.063014,-0.067555,-0.021129,0.065251,-0.057482,-0.041248,0.073727,-0.013219,-0.014017,0.014017,-0.017293,-0.016929,0.017739,0.004915,0.006811,0.005333,0.00041,-0.00453,0.000759,-0.008685,-0.005207,0.007111,0.010258,0.007305,0.003562,0.008416,0.00407,0.008212,0.006788,-0.026192,0.041823,0.017499,0.007328,0.035245,-0.005382,-0.032671,-0.005774,0.004742,0.012084,0.009789,-0.002581,0.025052,0.009064,0.102891,-0.141635,0.014706,0.024248,-0.007214,-0.057659,-0.015246,0.010088,0.045742,0.053883,0.049947,0.027642,-0.001824,0.000709,-0.020328,-0.021421,-0.005146,-0.007297,0.007297,0.027193,0.036483,0.023831,0.0091,0.018897,-0.015193,-0.174969,0.005712,0.034711,0.023831,0.063382
refrig,0.08897,-0.101965,0.130531,-0.124506,0.143143,1.0,0.086002,-0.070318,-0.047087,-0.022819,-0.04686,-0.023502,0.027832,0.008038,-0.046136,0.001607,-0.025979,-0.026784,-0.02206,0.097733,-0.116235,-0.026784,0.134937,-0.035913,0.022801,-0.087432,-0.162714,-0.053502,0.008057,0.008057,0.169537,-0.068315,0.006459,-0.073521,-0.099486,-0.140878,0.014877,0.020343,0.011805,0.009872,0.171503,0.111854,-0.083479,-0.100841,0.064289,-0.025594,-0.223078,-0.011786,-0.07889,0.003371,0.066721,-0.180521,-0.051713,-0.158579,0.091731,0.015277,-0.213536,0.121866,-0.060374,-0.105971,0.008057,,0.007459,-0.155046,-0.01594,0.110077,-0.082299,-0.031353,0.084648,-0.146578,-0.060492,0.147383,-0.010466,-0.018935,0.018935,-0.010833,-0.038047,0.078181,0.011599,-0.018359,-0.024847,-0.024312,-0.016844,0.031554,-0.005736,-0.043091,-0.006117,0.011133,0.015982,0.010332,0.015341,-0.006766,-0.01799,-0.002702,-0.080347,0.053429,0.024403,-0.026784,0.150734,-0.028089,-0.073088,-0.030578,0.022845,0.035546,0.016639,0.018088,0.062335,0.026293,0.103223,-0.113158,0.005873,0.058247,0.012889,-0.080603,-0.054244,0.060654,0.080885,0.12718,0.122118,0.094955,-0.007976,-0.014484,-0.006559,-0.064096,-0.075008,0.078661,-0.078661,0.029801,0.097128,0.025846,-0.052195,0.082159,-0.108718,-0.123054,-0.03408,0.117406,0.025846,0.126792
v18q,0.278364,-0.08468,0.254256,-0.067529,0.036396,0.086002,1.0,,-0.024318,-0.014489,-0.026559,0.03825,0.013104,0.032096,0.009481,-0.001938,0.004031,0.002951,-0.004694,0.209119,-0.167031,0.002951,0.196557,-0.081683,-0.091061,-0.043031,-0.097789,-0.026127,-0.001587,-0.021038,0.194831,-0.168423,0.015474,-0.017776,-0.057329,-0.049979,0.026613,-0.009725,0.009074,-0.015184,0.243237,0.049887,-0.047023,-0.01616,0.077358,-0.009733,-0.025775,-0.071448,-0.034242,0.122161,-0.098631,-0.044471,-0.012696,-0.02386,0.124788,-0.080311,-0.097757,0.137801,-0.077757,-0.108982,-0.001587,,-0.019475,-0.133219,-0.148066,0.22195,-0.10999,-0.138019,0.201426,-0.138113,-0.143181,0.217204,-0.054806,-0.021221,0.021221,0.016375,-0.056591,0.073598,0.011725,-0.039931,-0.02973,-0.014127,-0.018236,0.015563,0.018008,-0.022385,0.009982,-0.018383,-0.010568,0.003388,-0.009603,-0.008658,0.002121,0.015403,-0.020992,0.029079,-0.06281,0.002951,0.335214,-0.036527,-0.073037,-0.110066,-0.031707,0.038729,-0.02363,-0.003069,0.217645,0.104639,0.16277,-0.12746,-0.024492,0.103293,0.015212,-0.028303,-0.077776,0.304465,0.127822,0.085801,0.163412,0.18054,-0.056727,-0.013468,-0.084894,-0.070486,-0.092596,0.168158,-0.168158,-0.041128,0.250477,-0.05467,-0.01643,0.282619,-0.050562,-0.125936,-0.071504,0.302763,-0.05467,0.238864
v18q1,0.302292,-0.049262,0.208919,-0.037414,0.011255,-0.070318,,1.0,0.111505,0.041959,0.096033,0.059638,0.039038,0.067011,0.118688,0.05443,0.113185,0.11419,0.132466,0.030058,-0.052453,0.11419,0.018374,-0.002916,-0.002357,-0.019504,-0.008882,-0.033015,-0.019504,,0.037956,-0.032769,-0.022526,,-0.029819,-0.005202,0.034117,-0.013179,-0.033827,-0.015921,0.063742,-0.006293,0.014081,-0.022526,0.089603,,,-0.089603,,0.064062,-0.055933,-0.039087,-0.019504,,0.078824,-0.085187,0.027439,0.033824,-0.040692,-0.01495,-0.019504,,,-0.003038,-0.076512,0.07408,-0.044252,-0.067027,0.084938,-0.021213,-0.061237,0.066166,-0.02831,0.016164,-0.016164,0.039097,-0.012454,-0.011787,-0.034846,-0.012558,-0.023493,0.016638,-0.037163,-0.006803,0.04234,0.03479,-0.006171,-0.035262,0.017973,-0.027601,-0.011772,-0.025191,0.043114,0.010634,0.09845,0.057619,0.020635,0.11419,0.125435,0.018376,0.016566,-0.057226,-0.020614,-0.046435,-0.010298,-0.016285,0.052753,0.055903,0.197126,-0.037889,-0.006238,0.054288,-0.004701,-0.042574,-0.051698,0.112456,0.030691,0.011255,0.21854,0.140997,-0.102879,-0.016688,-0.030016,-0.062846,-0.064972,0.086392,-0.086392,-0.035154,0.052477,-0.031046,0.076331,0.198832,0.092212,-0.062806,-0.033226,0.115522,-0.031046,-0.007334
r4h1,-0.0819,0.232508,-0.066578,0.226378,-0.054769,-0.047087,-0.024318,0.111505,1.0,-0.088267,0.495674,0.163783,0.091203,0.164578,0.758136,-0.00444,0.444779,0.445953,0.434021,-0.237926,-0.074824,0.445953,-0.109464,0.031732,0.062762,0.043925,0.031526,0.064618,0.002401,-0.021712,-0.123613,0.110078,0.022685,0.057722,0.050079,0.009221,-0.015006,-0.018912,0.017649,-0.013471,-0.131617,-0.04413,0.031646,0.043576,-0.016045,-0.010045,-0.026602,0.017304,0.009212,-0.017674,0.006309,0.025322,0.017624,-0.024625,-0.106574,0.090507,0.042377,-0.038889,0.022739,0.023866,0.038572,,0.014624,0.030014,0.091889,-0.105486,0.003909,0.06063,-0.058349,0.073034,0.037728,-0.080331,-0.06088,0.096324,-0.096324,0.333451,0.041808,-0.110591,-0.066196,-0.011113,-0.080106,-0.10746,-0.122275,-0.050686,0.083166,0.039636,0.018809,0.128566,-0.02165,0.006664,-0.01075,0.012154,0.027917,0.006578,0.590972,-0.014624,-0.154627,0.445953,-0.078597,0.228897,0.07846,-0.059489,-0.046593,-0.049877,-0.018539,-0.036599,-0.112389,-0.028386,-0.022305,0.413705,-0.126264,-0.000943,0.092525,0.017925,0.085984,-0.070371,-0.068004,0.032755,0.012057,-0.070627,-0.007368,0.026066,0.046167,0.017842,0.04655,-0.046722,0.046722,-0.31699,-0.186017,-0.27269,0.441126,-0.03124,0.565494,0.35566,-0.036977,-0.083552,-0.27269,-0.229889
r4h2,-0.002401,0.059313,0.267627,0.126645,0.018133,-0.022819,-0.014489,0.041959,-0.088267,1.0,0.821367,-0.092466,0.067927,-0.000539,-0.118484,0.763329,0.550945,0.548118,0.514151,0.032409,0.15336,0.548118,-0.036265,0.018298,-0.024242,-0.01735,0.053073,0.067619,-0.044418,0.026848,-0.029163,-0.000979,-0.000107,-0.026827,0.065288,0.025923,0.055648,-0.062424,-0.01764,0.01349,-0.027873,-0.02001,0.029713,-0.025452,-0.007607,-0.00956,-0.014536,0.008631,-0.019002,-0.089969,0.074633,0.058899,-0.016983,-0.039732,-0.03925,-0.005213,0.109722,-0.043347,0.035795,0.031637,-0.007465,,-0.019128,0.036478,0.01664,-0.038117,-0.009688,0.015961,-0.008088,0.017026,-0.001616,-0.009264,-0.031879,0.244961,-0.244961,-0.133086,0.023429,0.048689,-0.083016,-0.096759,-0.103304,0.154147,-0.172142,0.01652,0.097573,0.03608,0.070805,0.029581,-0.013849,0.006467,0.010566,0.040107,0.039132,0.019136,0.147794,0.659879,0.063032,0.548118,-0.037556,-0.095711,-0.028839,0.060979,0.084199,0.008866,0.026944,0.003002,-0.049213,-0.029693,0.360582,0.194403,0.105309,-0.018584,-0.085283,-0.000113,-0.046117,0.037087,0.091737,0.070229,0.478886,0.014907,0.043709,-0.066071,-0.04078,-0.007469,0.034428,-0.033211,0.033211,-0.020306,-0.017251,-0.054712,0.509087,0.07761,0.124701,0.144478,-0.157357,-0.062217,-0.054712,0.101253


### Number of Null Columns.

In [36]:
pd.options.display.max_rows = 150
train_data.isnull().sum()

Id                    0
v2a1               6860
hacdor                0
rooms                 0
hacapo                0
v14a                  0
refrig                0
v18q                  0
v18q1              7342
r4h1                  0
r4h2                  0
r4h3                  0
r4m1                  0
r4m2                  0
r4m3                  0
r4t1                  0
r4t2                  0
r4t3                  0
tamhog                0
tamviv                0
escolari              0
rez_esc            7928
hhsize                0
paredblolad           0
paredzocalo           0
paredpreb             0
pareddes              0
paredmad              0
paredzinc             0
paredfibras           0
paredother            0
pisomoscer            0
pisocemento           0
pisoother             0
pisonatur             0
pisonotiene           0
pisomadera            0
techozinc             0
techoentrepiso        0
techocane             0
techootro             0
cielorazo       

 ### Columns with Null values arranged descending. 

1. rez_esc -- 7928 
2. v18q1   -- 7342
3. v2a1    -- 6860
4. meaneduc -- 5
5. SQBmeaned -- 5
    
    

In [38]:
train_data['target'] = train_data['Target']
train_data['target']=train_data['target'].apply(str)

In [39]:
train_data['target'] = train_data['target'].map({'1': 'extreme poverty' ,'2' : 'moderate poverty', 
                              '3' :'vulnerable households', '4' : 'non vulnerable households'})

In [40]:
train_data.target.value_counts()

non vulnerable households    5996
moderate poverty             1597
vulnerable households        1209
extreme poverty               755
Name: target, dtype: int64

In [44]:
pd.options.display.max_rows = 150
test_data.isnull().sum()

Id                     0
v2a1               17403
hacdor                 0
rooms                  0
hacapo                 0
v14a                   0
refrig                 0
v18q                   0
v18q1              18126
r4h1                   0
r4h2                   0
r4h3                   0
r4m1                   0
r4m2                   0
r4m3                   0
r4t1                   0
r4t2                   0
r4t3                   0
tamhog                 0
tamviv                 0
escolari               0
rez_esc                0
hhsize                 0
paredblolad            0
paredzocalo            0
paredpreb              0
pareddes               0
paredmad               0
paredzinc              0
paredfibras            0
paredother             0
pisomoscer             0
pisocemento            0
pisoother              0
pisonatur              0
pisonotiene            0
pisomadera             0
techozinc              0
techoentrepiso         0
techocane              0


In [None]:
 ### Columns with Null values arranged descending. 

1. v18q1   -- 18126
2. v2a1    -- 17403
3. meaneduc -- 31
4. SQBmeaned -- 31

# DATA PRE_PROCESSING

## Handling Missing Data

v18q --  owns a tablet
v18q1 -- number of tablets household owns

v18q1-- Train data have around 7342 missing values. We can try to compare this missing values with the v18q to see if any of these owns a tables. If none then we can replace the value directly with 0.


In [61]:
## v18q1  --- This is number of tablet owned by a family.

heads_train = train_data.loc[train_data['parentesco1'] == 1].copy()
heads_train.groupby('v18q')['v18q1'].apply(lambda x: x.isnull().sum())

##This shows that every member have null v18q1 have 0 in v18q. So we can safely replace the 



v18q
0    2318
1       0
Name: v18q1, dtype: int64

In [60]:
heads_test = test_data.loc[test_data['parentesco1'] == 1].copy()
heads_test.groupby('v18q')['v18q1'].apply(lambda x: x.isnull().sum())


v18q
0    5726
1       0
Name: v18q1, dtype: int64

In [63]:
train_data['v18q1'] = train_data['v18q1'].fillna(0)
test_data['v18q1'] = test_data['v18q1'].fillna(0)

### rez_esc --- Years behind school. 

In [64]:
train_data['rez_esc'] = train_data['rez_esc'].fillna(0)
test_data['rez_esc'] = test_data['rez_esc'].fillna(0)

In [None]:
### v2a1 -- 'Monthly rent payment' 
There could be known reason for missing monthly payment or missing values here. 

1. The person can have his own home.We will set the value to 0 for such cases. 
2. Other missing due to unknown reason, we will set the value to  

