In [1]:
from data_cleaner import *

df = load_training_df()\
.pipe(clean_targets)\
.pipe(clean_non_numerics)\
.pipe(clean_missing_values)

  return f(*args, **kwds)
  return f(*args, **kwds)


### New features for individuals

Before considering data at a household level there are some new features that may be useful to generate at an individual's level.

There are 9 columns used as a binary one-hot encoding of the individuals level of education. We can compress this down to a single value to represent how far through education this individual has been.

In [2]:
df = df.pipe(compress_columns, new_col='education-level', 
        cols_to_compress=['instlevel1', 'instlevel2', 'instlevel3', 'instlevel6', 'instlevel4', 'instlevel7', 
                          'instlevel5', 'instlevel8', 'instlevel9'])

### New features for households

All our new features from this point on will be descriptions at a household level so we'll append them all to a DataFrame indexed at household level.

In [3]:
hh_df = pd.DataFrame(index=df.index.get_level_values(0).drop_duplicates())

The data given to us calculates a dependency rate which looks at the number of adults between 19 and 64 (working age) vs the number of children or adults of 65+. This is likely to be due to the fact adults of working age will be supporting the household. Let's define a couple of terms:
 - `supporter` : Household member aged 19-64 who has not been marked as having a disability
 - `dependent` : Household member aged 0-19, 65+, or is disabled

We saw when cleaning the data that there are cases in which households have no supporters. We can add a couple of features to indicate whether there are no supporters in the household, or also no dependents in the household.

In [4]:
supporters = df[(df['age']>=18) & (df['age']<=64) & (df['dis']==0)]
dependents = df[(df['age']<=18) | (df['age']>=64) | (df['dis']==1)]

hh_df['num_supporters'] = supporters.groupby(household_id).size()
hh_df['num_supporters'] = hh_df['num_supporters'].fillna(0).astype(int)

hh_df['num_dependents'] = dependents.groupby(household_id).size()
hh_df['num_dependents'] = hh_df['num_dependents'].fillna(0).astype(int)

hh_df['0_supporters'] = (hh_df['num_supporters']==0).astype(int)
hh_df['0_dependents'] = (hh_df['num_dependents']==0).astype(int)

We already have our dependency calculation which was regenerated during the data cleanup, let's add this and the square of it's value as these were already present in the original data and are likely to be useful in this prediction. This value is consistent across all individuals so we'll just take the first one we see for each household.

In [5]:
hh_df['dependency'] = df['dependency'].groupby(household_id).first()
hh_df['SQBdependency'] = df['SQBdependency'].groupby(household_id).first()

It may be useful to know the gender breakdown of supporters since there is a gender driven pay gap in most countries and this may have some effect on the wealth of the family.

In [6]:
m_supporters = supporters[supporters['male']==1] 
f_supporters = supporters[supporters['female']==1] 

hh_df['num_m_supporters'] = m_supporters.groupby(household_id).size()
hh_df['num_m_supporters'] = hh_df['num_m_supporters'].fillna(0).astype(int)

hh_df['num_f_supporters'] = f_supporters.groupby(household_id).size()
hh_df['num_f_supporters'] = hh_df['num_f_supporters'].fillna(0).astype(int)

Education-level of household supporters is likely to have a large impact on the wealth of the family as well. We already have the mean education of adults in the household, but let's make a new value for supporters, and supporters broken down by gender.

In [7]:
hh_df['meaneduc_s'] = supporters['escolari'].groupby(household_id).mean().round(2)
hh_df['meaneduc_s'] = hh_df['meaneduc_s'].fillna(0)

hh_df['meaneduc_m'] = m_supporters['escolari'].groupby(household_id).mean().round(2)
hh_df['meaneduc_m'] = hh_df['meaneduc_m'].fillna(0)

hh_df['meaneduc_f'] = f_supporters['escolari'].groupby(household_id).mean().round(2)
hh_df['meaneduc_f'] = hh_df['meaneduc_f'].fillna(0)

hh_df['ed_lev_ad_s'] = supporters['education-level'].groupby(household_id).mean().round(2)
hh_df['ed_lev_ad_s'] = hh_df['ed_lev_ad_s'].fillna(0)

hh_df['ed_lev_ad_m'] = m_supporters['education-level'].groupby(household_id).mean().round(2)
hh_df['ed_lev_ad_m'] = hh_df['ed_lev_ad_m'].fillna(0)

hh_df['ed_lev_ad_f'] = f_supporters['education-level'].groupby(household_id).mean().round(2)
hh_df['ed_lev_ad_f'] = hh_df['ed_lev_ad_f'].fillna(0)

Since a member of the household has been assigned 'head-of-household' it's possible that details relating this individual offer significant information about the household. We can add extra features from combinations of details about them.

In [8]:
hoh = df[(df[head_of_household]==1)].groupby(household_id).first()

hh_df['male_hoh'] = (hoh['male']==1).astype(int)
hh_df['male_hoh'] = hh_df['male_hoh'].fillna(0)

hh_df['educ_hoh'] = hoh['escolari']
hh_df['educ_hoh'] = hh_df['educ_hoh'].fillna(0)

hh_df['ed_lev_hoh'] = hoh['education-level']
hh_df['ed_lev_hoh'] = hh_df['ed_lev_hoh'].fillna(0)

hh_df['hoh_is_sup'] = ((hoh['age']>=18) & (hoh['age']<=64) & (hoh['dis']==0)).astype(int)
hh_df['hoh_is_sup'] = hh_df['hoh_is_sup'].fillna(0)

Missing education is more significant for children as this indicates that they are falling behind rather than just showing the number of years they have been in education. Let's check for those under 18 who are falling behind in school. We'll only consider children without disabilities else the disability itself might be the cause of falling behind in school, rather than indicating it being due to wealth issues.

In [16]:
minors = df[(df['age']>=18) & (df['dis']==0)]

hh_df['rez_esc'] = minors['rez_esc'].groupby(household_id).mean().round(2)
hh_df['rez_esc'] = hh_df['rez_esc'].fillna(0)

hh_df['rez_esc_m'] = minors[minors['male']==1]['rez_esc'].groupby(household_id).mean().round(2)
hh_df['rez_esc_m'] = hh_df['rez_esc_m'].fillna(0)

hh_df['rez_esc_f'] = minors[minors['female']==1]['rez_esc'].groupby(household_id).mean().round(2)
hh_df['rez_esc_f'] = hh_df['rez_esc_f'].fillna(0)

In [17]:
nulls = hh_df.isnull().sum(axis=0)
nulls[nulls!=0]/len(hh_df)

Series([], dtype: float64)

In [18]:
hh_df.head()

Unnamed: 0_level_0,num_supporters,num_dependents,0_supporters,0_dependents,dependency,SQBdependency,num_m_supporters,num_f_supporters,meaneduc_s,meaneduc_m,...,ed_lev_ad_s,ed_lev_ad_m,ed_lev_ad_f,male_hoh,educ_hoh,ed_lev_hoh,hoh_is_sup,rez_esc,rez_esc_m,rez_esc_f
idhogar,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21eb7fcc1,1,0,0,1,0.0,0.0,1,0,10.0,10.0,...,4.0,4.0,0.0,1.0,10.0,4.0,1.0,1.0,1.0,0.0
0e5d7a658,0,1,1,0,1.0,1.0,0,0,0.0,0.0,...,0.0,0.0,0.0,1.0,12.0,7.0,0.0,0.0,0.0,0.0
2c7317ea8,0,1,1,0,1.0,1.0,0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,11.0,6.0,0.0,0.0,0.0,0.0
2b58d945f,2,2,0,0,0.5,0.25,1,1,11.0,11.0,...,6.0,6.0,6.0,1.0,11.0,6.0,1.0,0.0,0.0,0.0
d6dae86b7,2,2,0,0,0.5,0.25,1,1,10.0,9.0,...,5.0,4.0,6.0,1.0,9.0,4.0,1.0,1.0,2.0,0.0
