Project 2 - Income Qualification (New)

DESCRIPTION

Identify the level of income qualification needed for the families in Latin America

# Problem Statement Scenario:
Many social programs have a hard time making sure the right people are given enough aid. It’s tricky when a program focuses on the poorest segment of the population. This segment of population can’t provide the necessary income and expense records to prove that they qualify.

In Latin America, a popular method called Proxy Means Test (PMT) uses an algorithm to verify income qualification. With PMT, agencies use a model that considers a family’s observable household attributes like the material of their walls and ceiling or the assets found in their homes to classify them and predict their level of need. While this is an improvement, accuracy remains a problem as the region’s population grows and poverty declines.

The Inter-American Development Bank (IDB) believes that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, might help improve PMT’s performance.

# Following actions should be performed:
* Identify the output variable.
* Understand the type of data.
* Check if there are any biases in your dataset.
* Check whether all members of the house have the same poverty level.
* Check if there is a house without a family head.
* Set the poverty level of the members and the head of the house same in a family.
* Count how many null values are existing in columns.
* Remove null value rows of the target variable.
* Predict the accuracy using random forest classifier.
* Check the accuracy using a random forest with cross-validation.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
train=pd.read_csv("train.csv")
test=pd.read_csv("test.csv")

In [3]:
train.shape, test.shape

((9557, 143), (23856, 142))

In [4]:
train

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100,1849,1,100,0,1.000000,0.0000,100.0000,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144,4489,1,144,0,1.000000,64.0000,144.0000,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121,8464,1,0,0,0.250000,64.0000,121.0000,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,81,289,16,121,4,1.777778,1.0000,121.0000,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121,1369,16,121,4,1.777778,1.0000,121.0000,1369,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9552,ID_d45ae367d,80000.0,0,6,0,1,1,0,,0,...,81,2116,25,81,1,1.562500,0.0625,68.0625,2116,2
9553,ID_c94744e07,80000.0,0,6,0,1,1,0,,0,...,0,4,25,81,1,1.562500,0.0625,68.0625,4,2
9554,ID_85fc658f8,80000.0,0,6,0,1,1,0,,0,...,25,2500,25,81,1,1.562500,0.0625,68.0625,2500,2
9555,ID_ced540c61,80000.0,0,6,0,1,1,0,,0,...,121,676,25,81,1,1.562500,0.0625,68.0625,676,2


In [13]:
col=train.columns
col.unique()

Index(['Id', 'v2a1', 'hacdor', 'rooms', 'hacapo', 'v14a', 'refrig', 'v18q',
       'v18q1', 'r4h1',
       ...
       'SQBescolari', 'SQBage', 'SQBhogar_total', 'SQBedjefe', 'SQBhogar_nin',
       'SQBovercrowding', 'SQBdependency', 'SQBmeaned', 'agesq', 'Target'],
      dtype='object', length=143)

In [10]:
train.describe()

Unnamed: 0,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,r4h2,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
count,2697.0,9557.0,9557.0,9557.0,9557.0,9557.0,9557.0,2215.0,9557.0,9557.0,...,9557.0,9557.0,9557.0,9557.0,9557.0,9557.0,9557.0,9552.0,9557.0,9557.0
mean,165231.6,0.038087,4.95553,0.023648,0.994768,0.957623,0.231767,1.404063,0.385895,1.559171,...,74.222769,1643.774302,19.132887,53.500262,3.844826,3.249485,3.900409,102.588867,1643.774302,3.302292
std,150457.1,0.191417,1.468381,0.151957,0.072145,0.201459,0.421983,0.763131,0.680779,1.036574,...,76.777549,1741.19705,18.751395,78.445804,6.946296,4.129547,12.511831,93.51689,1741.19705,1.009565
min,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.04,0.0,0.0,0.0,1.0
25%,80000.0,0.0,4.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,16.0,289.0,9.0,0.0,0.0,1.0,0.111111,36.0,289.0,3.0
50%,130000.0,0.0,5.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,36.0,961.0,16.0,36.0,1.0,2.25,0.444444,81.0,961.0,4.0
75%,200000.0,0.0,6.0,0.0,1.0,1.0,0.0,2.0,1.0,2.0,...,121.0,2601.0,25.0,81.0,4.0,4.0,1.777778,134.56001,2601.0,4.0
max,2353477.0,1.0,11.0,1.0,1.0,1.0,1.0,6.0,5.0,8.0,...,441.0,9409.0,169.0,441.0,81.0,36.0,64.0,1369.0,9409.0,4.0


In [14]:
train.Target

0       4
1       4
2       4
3       4
4       4
       ..
9552    2
9553    2
9554    2
9555    2
9556    2
Name: Target, Length: 9557, dtype: int64

In [29]:
train.isnull().sum().sort_values(ascending=False)/train.shape[1]

rez_esc            55.440559
v18q1              51.342657
v2a1               47.972028
SQBmeaned           0.034965
meaneduc            0.034965
                     ...    
abastaguadentro     0.000000
cielorazo           0.000000
techootro           0.000000
techocane           0.000000
Target              0.000000
Length: 143, dtype: float64

we found missing values in 5 columns

In [27]:
train.shape[1]

143

In [33]:
train.rez_esc.describe()

count    1629.000000
mean        0.459791
std         0.946550
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         5.000000
Name: rez_esc, dtype: float64

In [34]:
train.v18q1.describe()

count    2215.000000
mean        1.404063
std         0.763131
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         6.000000
Name: v18q1, dtype: float64

In [35]:
train.v2a1.describe()

count    2.697000e+03
mean     1.652316e+05
std      1.504571e+05
min      0.000000e+00
25%      8.000000e+04
50%      1.300000e+05
75%      2.000000e+05
max      2.353477e+06
Name: v2a1, dtype: float64

In [36]:
train.SQBmeaned.describe()

count    9552.000000
mean      102.588867
std        93.516890
min         0.000000
25%        36.000000
50%        81.000000
75%       134.560010
max      1369.000000
Name: SQBmeaned, dtype: float64

In [39]:
train.meaneduc

0       10.00
1       12.00
2       11.00
3       11.00
4       11.00
        ...  
9552     8.25
9553     8.25
9554     8.25
9555     8.25
9556     8.25
Name: meaneduc, Length: 9557, dtype: float64

In [42]:
train.dtypes

Id                  object
v2a1               float64
hacdor               int64
rooms                int64
hacapo               int64
                    ...   
SQBovercrowding    float64
SQBdependency      float64
SQBmeaned          float64
agesq                int64
Target               int64
Length: 143, dtype: object

In [47]:
flot=train.select_dtypes(include=np.float64)
num=train.select_dtypes(include=np.int64)
obj=train.select_dtypes(include='object')

In [48]:
obj.isnull().sum()

Id            0
idhogar       0
dependency    0
edjefe        0
edjefa        0
dtype: int64

In [52]:
num.isnull().sum().sort_values(ascending=True)

hacdor            0
hogar_total       0
hogar_mayor       0
hogar_adul        0
hogar_nin         0
                 ..
techootro         0
techocane         0
techoentrepiso    0
pisomadera        0
Target            0
Length: 130, dtype: int64

In [53]:
flot.isnull().sum().sort_values(ascending=False)

rez_esc            7928
v18q1              7342
v2a1               6860
meaneduc              5
SQBmeaned             5
overcrowding          0
SQBovercrowding       0
SQBdependency         0
dtype: int64