<p>
The Inter-American Development Bank is asking the Kaggle community for help with income qualification for some of the world's poorest families. Are you up for the challenge?

Here's the backstory: Many social programs have a hard time making sure the right people are given enough aid. It’s especially tricky when a program focuses on the poorest segment of the population. The world’s poorest typically can’t provide the necessary income and expense records to prove that they qualify.

In Latin America, one popular method uses an algorithm to verify income qualification. It’s called the Proxy Means Test (or PMT). With PMT, agencies use a model that considers a family’s observable household attributes like the material of their walls and ceiling, or the assets found in the home to classify them and predict their level of need.

While this is an improvement, accuracy remains a problem as the region’s population grows and poverty declines.

To improve on PMT, the IDB (the largest source of development financing for Latin America and the Caribbean) has turned to the Kaggle community. They believe that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, might help improve PMT’s performance.

Beyond Costa Rica, many countries face this same problem of inaccurately assessing social need. If Kagglers can generate an improvement, the new algorithm could be implemented in other countries around the world.

This is a Kernels-Only Competition, so you must submit your code through Kernels, rather than uploading .csv predictions. You can create private Kernels and even share/edit your work with teammates by adding them as collaborators.
</p>

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [2]:
# Import custom helper libraries
import utils

#### Basic Analysis

In [3]:
base_path = os.path.join(utils.current_directory_path(),'Data')
train_data_path = os.path.join(base_path, 'train.csv')
test_data_path = os.path.join(base_path, 'test.csv')

In [6]:
# Read the file and get the Data frame
data = utils.get_dataframe(train_data_path)

In [7]:
# Get top 10 rows
data.head(10)

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4
5,ID_ec05b1a7b,180000.0,0,5,0,1,1,1,1.0,0,...,121,1444,16,121,4,1.777778,1.0,121.0,1444,4
6,ID_e9e0c1100,180000.0,0,5,0,1,1,1,1.0,0,...,4,64,16,121,4,1.777778,1.0,121.0,64,4
7,ID_3e04e571e,130000.0,1,2,0,1,1,0,,0,...,0,49,16,81,4,16.0,1.0,100.0,49,4
8,ID_1284f8aad,130000.0,1,2,0,1,1,0,,0,...,81,900,16,81,4,16.0,1.0,100.0,900,4
9,ID_51f52fdd2,130000.0,1,2,0,1,1,0,,0,...,121,784,16,81,4,16.0,1.0,100.0,784,4


- Some of the fetaure has null value as NAN


In [8]:
# Get 10  tail rows
data.tail(10)

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
9547,ID_198be48d1,46500.0,0,5,0,1,1,0,,1,...,121,529,25,4,4,5.444444,0.444444,100.0,529,4
9548,ID_9df63c33e,46500.0,0,5,0,1,1,0,,1,...,121,324,25,4,4,5.444444,0.444444,100.0,324,4
9549,ID_aacac04a2,46500.0,0,5,0,1,1,0,,1,...,0,4,25,4,4,5.444444,0.444444,100.0,4,4
9550,ID_90a399a51,,0,3,0,1,1,0,,0,...,36,3721,4,0,0,4.0,1.0,9.0,3721,2
9551,ID_79d39dddc,,0,3,0,1,1,0,,0,...,0,4489,4,0,0,4.0,1.0,9.0,4489,2
9552,ID_d45ae367d,80000.0,0,6,0,1,1,0,,0,...,81,2116,25,81,1,1.5625,0.0625,68.0625,2116,2
9553,ID_c94744e07,80000.0,0,6,0,1,1,0,,0,...,0,4,25,81,1,1.5625,0.0625,68.0625,4,2
9554,ID_85fc658f8,80000.0,0,6,0,1,1,0,,0,...,25,2500,25,81,1,1.5625,0.0625,68.0625,2500,2
9555,ID_ced540c61,80000.0,0,6,0,1,1,0,,0,...,121,676,25,81,1,1.5625,0.0625,68.0625,676,2
9556,ID_a38c64491,80000.0,0,6,0,1,1,0,,0,...,64,441,25,81,1,1.5625,0.0625,68.0625,441,2


In [9]:
# Get number of data points
data.shape

(9557, 143)

- There are 9557 data points 
- Each data has 143 feature

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9557 entries, 0 to 9556
Columns: 143 entries, Id to Target
dtypes: float64(8), int64(130), object(5)
memory usage: 10.4+ MB


In [13]:
data.columns

Index(['Id', 'v2a1', 'hacdor', 'rooms', 'hacapo', 'v14a', 'refrig', 'v18q',
       'v18q1', 'r4h1',
       ...
       'SQBescolari', 'SQBage', 'SQBhogar_total', 'SQBedjefe', 'SQBhogar_nin',
       'SQBovercrowding', 'SQBdependency', 'SQBmeaned', 'agesq', 'Target'],
      dtype='object', length=143)

In [14]:
data.dtypes

Id                  object
v2a1               float64
hacdor               int64
rooms                int64
hacapo               int64
                    ...   
SQBovercrowding    float64
SQBdependency      float64
SQBmeaned          float64
agesq                int64
Target               int64
Length: 143, dtype: object

- From above it seems that target not continuous value
- Target is discreet numerical value
- It seems that it is a classification problem

### Exploratory data analysis