Following actions should be performed:

1. Identify the output variable.
2. Understand the type of data.
3. Check if there are any biases in your dataset.
4. Check whether all members of the house have the same poverty level.
5. Check if there is a house without a family head.
6. Set poverty level of the members and the head of the house within a family.
7. Count how many null values are existing in columns.
8. Remove null value rows of the target variable.
9. Predict the accuracy using random forest classifier.
10. Check the accuracy using random forest with cross validation

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [4]:
#import test and train data

In [11]:
Income_train_data=pd.read_csv("C:\\Users\\mekhare\\Downloads\\Dataset for the project\\Dataset for the project\\train.csv")


In [16]:
Income_test_data=pd.read_csv("C:\\Users\\mekhare\\Downloads\\Dataset for the project\\Dataset for the project\\test.csv")

In [17]:
Income_test_data.shape

(23856, 142)

In [18]:
Income_train_data.shape

(9557, 143)

Check if there is house without Family Head

Look into Data Dictionary -  parentesco1, =1 if household head. Find count of rows with value not equal to 1

In [15]:
Income_train_data.parentesco1.value_counts()


0    6584
1    2973
Name: parentesco1, dtype: int64

Identify the output variable. Assumption it will be in Train data and not in test data

In [18]:
for i in Income_train_data.columns:
    if i not in Income_test_data.columns:
        print (i)

Target


In [20]:
print (Income_train_data["Target"])

0       4
1       4
2       4
3       4
4       4
       ..
9552    2
9553    2
9554    2
9555    2
9556    2
Name: Target, Length: 9557, dtype: int64


In [22]:
Income_train_data["Target"].value_counts()

4    5996
2    1597
3    1209
1     755
Name: Target, dtype: int64

Understand the type of Data

In [69]:
print(Income_train_data.dtypes.value_counts())

int64      130
float64      8
object       5
dtype: int64


In [70]:
print(Income_train_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9557 entries, 0 to 9556
Columns: 143 entries, Id to Target
dtypes: float64(8), int64(130), object(5)
memory usage: 10.4+ MB
None


In [29]:
#lets explore each different types of datasets
data_type = []
for i in Income_train_data.columns:
    n = Income_train_data[i].dtype
    data_type.append(n)
    if n == 'object':
        print(i)
 
print (data_type)
    

Id
idhogar
dependency
edjefe
edjefa
[dtype('O'), dtype('float64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('float64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('float64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), d

Below is description for above object variables

ID = Unique ID
idhogar, Household level identifier
dependency, Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
edjefe, years of education of male head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0
edjefa, years of education of female head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0

In [30]:
print (Income_train_data["idhogar"])

0       21eb7fcc1
1       0e5d7a658
2       2c7317ea8
3       2b58d945f
4       2b58d945f
          ...    
9552    d6c086aa3
9553    d6c086aa3
9554    d6c086aa3
9555    d6c086aa3
9556    d6c086aa3
Name: idhogar, Length: 9557, dtype: object


In [36]:
print(Income_train_data["Id"].groupby(Income_train_data["idhogar"]).count())

idhogar
001ff74ca    2
003123ec2    4
004616164    2
004983866    2
005905417    3
            ..
ff9343a35    4
ff9d5ab17    3
ffae4a097    2
ffe90d46f    4
fff7d6be1    4
Name: Id, Length: 2988, dtype: int64


In [48]:
print("The data comprises of unique households", format(Income_train_data["idhogar"].nunique()))

The data comprises of unique households 2988


Converting Object datatype in to numerical data

In [53]:
def map(i):
    
    if i=='yes':
        return(float(1))
    elif i=='no':
        return(float(0))
    else:
        return(float(i))
    
Income_train_data['edjefe']=Income_train_data['edjefe'].apply(map)
Income_train_data['edjefa']=Income_train_data['edjefa'].apply(map)    
Income_train_data['dependency']=Income_train_data['dependency'].apply(map)

In [54]:
#check for data type objects

for i in Income_train_data.columns:
    n = Income_train_data[i].dtype
    data_type.append(n)
    if n == 'object':
        print(i)

Id
idhogar


In [30]:
Income_train_data.head(5)

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


In [31]:
var_df=pd.DataFrame(np.var(Income_train_data,0),columns=['variance'])
var_df.sort_values(by='variance').head(15)
print('Below are columns with variance 0.')
col=list((var_df[var_df['variance']==0]).index)
print(col)

Below are columns with variance 0.
['elimbasu5']


In [36]:
Income_train_data["elimbasu5"].unique()

array([0], dtype=int64)

In [41]:
n = len(pd.unique(Income_train_data["elimbasu5"]))
  
print("No.of.unique values :", 
      n)

No.of.unique values : 1


In [52]:
j=0
Column_unique_values = []
for i in Income_train_data.columns:
    n = len(pd.unique(Income_train_data[i]))
    Column_unique_values.append(n)
    j = j+1


In [64]:
print(Column_unique_values)

[9557, 158, 2, 11, 2, 2, 2, 2, 7, 6, 9, 9, 6, 7, 9, 7, 11, 13, 13, 14, 22, 7, 13, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2988, 10, 10, 4, 13, 31, 22, 22, 156, 2, 2, 2, 2, 2, 2, 2, 2, 2, 7, 38, 2, 2, 2, 2, 2, 2, 2, 2, 11, 2, 2, 2, 2, 2, 2, 2, 2, 97, 22, 97, 13, 22, 10, 38, 31, 156, 97, 4]


In [55]:
Column_unique_values.index(1)

62

In [57]:
Income_train_data.columns[62]

'elimbasu5'

Since the value of column elimbasu5 is same for all rows we can drop the column

In [66]:
Income_train_data.drop('elimbasu5',axis=1,inplace=True)

In [67]:
Income_train_data.shape

(9557, 142)

Remove null value rows of the target variable.


In [9]:
Income_train_data['Target'].isna().sum()


0

In [14]:
#No Null values in Target column

Check for families without a family head

In [56]:
pd.crosstab(Income_train_data['edjefa'],Income_train_data['edjefe'])

edjefe,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0
edjefa,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,435,123,194,307,137,222,1845,234,257,486,...,113,103,208,285,134,202,19,14,7,43
1.0,69,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2.0,84,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3.0,152,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4.0,136,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5.0,176,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6.0,947,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7.0,179,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8.0,217,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9.0,237,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Interpretation : Above cross tab shows 0 male head and 0 female head which implies that there are 435 families with no family head.



4
