# HOMEWORK 4

#### By Matt Youngberg

### Imports

In [1]:
import pandas as pd
import numpy as np

### Test-Transformation Function

This is a record of all the changes I'm making to the training data. I will later apply this function to the test data so I don't have to merge them.

In [2]:
def test_transformation(df):
    df['dependency'] = df['dependency'].replace(to_replace='yes', value=1)
    df['dependency'] = df['dependency'].replace(to_replace='no', value=0)
    df['dependency'] = df['dependency'].astype(float)
    df['edjefe'] = df['edjefe'].replace(to_replace='yes', value=1)
    df['edjefe'] = df['edjefe'].replace(to_replace='no', value=0)
    df['edjefe'] = df['edjefe'].astype(int)
    df['edjefa'] = df['edjefa'].replace(to_replace='yes', value=1)
    df['edjefa'] = df['edjefa'].replace(to_replace='no', value=0)
    df['edjefa'] = df['edjefa'].astype(int)
    df['rez_esc'] = df['rez_esc'].replace(to_replace=np.nan, value=0)
    df['v18q1'] = df['v18q1'].replace(to_replace=np.nan, value=0)
    df['v2a1'] = df['v2a1'].replace(to_replace=np.nan, value=0)
    df['meaneduc'] = df['meaneduc'].replace(to_replace=np.nan, value=df['meaneduc'].mean())
    df['SQBmeaned'] = df['meaneduc'] ** 2

### Loading Data

In [3]:
train = pd.read_csv('train.csv')
train.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4


# Describing and Cleaning the Data

Let's start by looking at the different data types we have in each column. We will likely have to recast anything that isn't a string, integer, or float.

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9557 entries, 0 to 9556
Columns: 143 entries, Id to Target
dtypes: float64(8), int64(130), object(5)
memory usage: 10.2+ MB


There are 5 `object` dtypes within the training data. Let's take a look to see which columns these are.

In [5]:
train.columns[train.dtypes == object]

Index(['Id', 'idhogar', 'dependency', 'edjefe', 'edjefa'], dtype='object')

I'm completely fine with `Id` and `idhogar` remaining as objects since they are unique identifiers. The other three, `dependency`, `edjefe`, and `edjefa`, I will need to take care of. Let's put in a brief description of the three.  

`dependency`: number of members of the household younger than 19 or older than 64  
`edjefe`: years of education of male head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0  
`edjefa`: years of education of female head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0  

Let's start by seeing what values there are in `dependency`.

In [6]:
train['dependency'].value_counts().head()

yes    2192
no     1747
.5     1497
2       730
1.5     713
Name: dependency, dtype: int64

The `yes` and `no` values are what made these objects! Let's take a deeper look at the data to see if we can correlate the yes and no answers with different values in the dataset.

I found that there is another column that deals directly with `dependency`: the `SQBdependency` column. Suprisingly, that doesn't have missing values. There's an interesting result when you look at what is in the rows with the value 'yes' in `dependency`...

In [7]:
train[['dependency', 'SQBdependency']][train['dependency'] == 'yes']['SQBdependency'].value_counts()

1.0    2192
Name: SQBdependency, dtype: int64

All of the rows with 'yes' for `dependency` are 1 in `SQBdependency`! Since the number had to have been filled in, I can trust that the natural result for `dependency` should be 1 (the square root of 1 is 1). Let's apply that transformation to the set.

In [8]:
train['dependency'] = train['dependency'].replace(to_replace='yes', value=1)

Now let's look at the 'no' values in `dependency`. Another interesting result emerges...

In [9]:
train[['dependency', 'SQBdependency']][train['dependency'] == 'no']['SQBdependency'].value_counts()

0.0    1747
Name: SQBdependency, dtype: int64

All of the values are zero! We can safely replace the missing values with zero.

(Note that this is all dependent on the assumption that the 1s and 0s that we see in the 'yes' and 'no' cases are actually true. Since there is no other case to go off of, this is the best assumption I think we could make.)

In [10]:
train['dependency'] = train['dependency'].replace(to_replace='no', value=0)

Let's quickly cast the values in this series to floats so it can be processed easier...

In [11]:
train['dependency'] = train['dependency'].astype(float)

Okay. Now let's look at the column `edjefe`.

In [12]:
train['edjefe'].value_counts().head(15)

no     3762
6      1845
11      751
9       486
3       307
15      285
8       257
7       234
5       222
14      208
17      202
2       194
4       137
16      134
yes     123
Name: edjefe, dtype: int64

The reason why this series is considered an object is for the same reason as `dependency`: because of the 'yes' and 'no' values. Let's see if we can find a commonality.

Funny enough, there is another corresponding column for `edjefe` called `SQBedjefe`. Let's take a look and see if we have the same case.

In [13]:
train[['edjefe', 'SQBedjefe']][train['edjefe'] == 'yes']['SQBedjefe'].value_counts()

1    123
Name: SQBedjefe, dtype: int64

It looks like we do! Let's apply the transformation and do it in the 'no' case as well.

In [14]:
train['edjefe'] = train['edjefe'].replace(to_replace='yes', value=1)

In [15]:
train[['edjefe', 'SQBedjefe']][train['edjefe'] == 'no']['SQBedjefe'].value_counts()

0    3762
Name: SQBedjefe, dtype: int64

In [16]:
train['edjefe'] = train['edjefe'].replace(to_replace='no', value=0)

In [17]:
train['edjefe'] = train['edjefe'].astype(int)

In [18]:
train['edjefa'].value_counts()

no     6230
6       947
11      399
9       237
8       217
15      188
7       179
5       176
3       152
4       136
14      120
16      113
10       96
2        84
17       76
12       72
yes      69
13       52
21        5
19        4
18        3
20        2
Name: edjefa, dtype: int64

In [19]:
train['edjefa'] = train['edjefa'].replace(to_replace='yes', value=1)
train['edjefa'] = train['edjefa'].replace(to_replace='no', value=0)
train['edjefa'] = train['edjefa'].astype(int)

Great. Now that we got the object series out of the way, let's start looking at the series that contain missing data.

In [20]:
train.isnull().sum().sort_values(ascending=False).head(6)

rez_esc      7928
v18q1        7342
v2a1         6860
meaneduc        5
SQBmeaned       5
techozinc       0
dtype: int64

Hmmmm. `rez_esc`, `v18q1`, and `v2a1` all have a lot of missing values. Let me go over briefly what each one is from the documentation.  

`rez_esc`: Years behind in school  
`v18q1`: number of tablets household owns  
`v2a1`: Monthly rent payment  

Let's start by looking at rez_esc. You really could only be behind in school if you're currently in school. You may have gotten held back when you were in school, but if you've graduated, I'd imagine it's not of importance. I have a sneaking suspicion that this is correlated to age. Let's take a look.

In [21]:
train[['rez_esc', 'age']][train['rez_esc'].notnull()].describe()

Unnamed: 0,rez_esc,age
count,1629.0,1629.0
mean,0.459791,12.258441
std,0.94655,3.218325
min,0.0,7.0
25%,0.0,9.0
50%,0.0,12.0
75%,1.0,15.0
max,5.0,17.0


The fact that `rez_esc` only exists in the case of 7-17 year olds confirms my suspsicion. What we'll do in this case then is we'll put zero for everyone else since it must be true in the case of youth or in the case of adults that have left school.

In [22]:
train['rez_esc'] = train['rez_esc'].replace(to_replace=np.nan, value=0)

In the case of `v18q1`, there is another variable that is related to it: `v18q`, that simply says whether or not the household has a tablet or not. Let's take a look at that.

In [23]:
train[['v18q', 'v18q1']][train['v18q1'].isnull()].describe()

Unnamed: 0,v18q,v18q1
count,7342.0,0.0
mean,0.0,
std,0.0,
min,0.0,
25%,0.0,
50%,0.0,
75%,0.0,
max,0.0,


Since all the values are zero in the cases where `v18q1` is null, it must be the case that the household has zero tablets. So let's fill that in.

In [24]:
train['v18q1'] = train['v18q1'].replace(to_replace=np.nan, value=0)

`v2a1` is the amount someone pays monthly for their house. My suspsicion is that those who own their house and have paid it off make up the `NaN`s in this dataset. Let's take a look.

In [25]:
train[['v2a1', 'tipovivi1']][train['v2a1'].isnull()]['tipovivi1'].value_counts()

1    5911
0     949
Name: tipovivi1, dtype: int64

So it makes up the vast majority, but not all. Let's see if we can dig deeper in to the ones that don't own their homes and haven't paid them off.

In [26]:
not_paying = train[['v2a1', 'tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5']][train['v2a1'].isnull()]
not_paying.describe()

Unnamed: 0,v2a1,tipovivi1,tipovivi2,tipovivi3,tipovivi4,tipovivi5
count,0.0,6860.0,6860.0,6860.0,6860.0,6860.0
mean,,0.861662,0.0,0.0,0.023761,0.114577
std,,0.34528,0.0,0.0,0.152315,0.318534
min,,0.0,0.0,0.0,0.0,0.0
25%,,1.0,0.0,0.0,0.0,0.0
50%,,1.0,0.0,0.0,0.0,0.0
75%,,1.0,0.0,0.0,0.0,0.0
max,,1.0,0.0,0.0,1.0,1.0


So it seems to be the case that people with the `NaN` values are in special circumstances where they truly aren't paying rent. `tipovivi2` and `tipovivi3` are those that are either paying off a mortgage or those that are renting. `tipovivi4` and `tipovivi5` are those that are considered in special circumstances or borrowing. I think it's a safe assumption that they aren't paying anything in rent, so let's fill them in with zeros.

In [27]:
train['v2a1'] = train['v2a1'].replace(to_replace=np.nan, value=0)

Now, onto the hard cases of `meaneduc` and `SQBmeaned`. Here are their definitions from the documentation.

`meaneduc`: average years of education for adults (18+)
`SQBmeaned`: square of the mean years of education of adults (>=18) in the household

So the latter is a function of the former. However, both are missing 5 and they are likely to correspond to the same cases. After sifting through this data several times, my best hunch (I thought) was that it corresponds to people who are heads of household but still going through school. So let's see if there is that commonality between the ones with missing `meaneduc`.

In [28]:
train[['age', 'parentesco1', 'rez_esc', 'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 'instlevel9']][train['meaneduc'].isnull()]

Unnamed: 0,age,parentesco1,rez_esc,instlevel4,instlevel5,instlevel6,instlevel7,instlevel8,instlevel9
1291,18,1,0.0,1,0,0,0,0,0
1840,18,0,0.0,0,0,0,0,0,0
1841,18,1,0.0,0,0,0,0,0,0
2049,19,1,0.0,0,0,0,1,0,0
2050,19,0,0.0,0,0,0,1,0,0


I can't seem to track down why these 5 rows would be missing entries. There doesn't appear to be any type of commonality in what I've displayed above, and it definitely disproves my best hypothesis. So what I'm going to do is fill in these values with the mean so that the decision tree models I use later will have the least chance to sort these wrongly.

In [29]:
train['meaneduc'] = train['meaneduc'].replace(to_replace=np.nan, value=train['meaneduc'].mean())

In [30]:
train['SQBmeaned'] = train['meaneduc'] ** 2

Okay. Let's check to see if there is any missing data left.

In [31]:
train.isnull().sum().sort_values(ascending=False).head(1)

Target    0
dtype: int64

Awesome! Any object types?

In [32]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9557 entries, 0 to 9556
Columns: 143 entries, Id to Target
dtypes: float64(9), int32(2), int64(130), object(2)
memory usage: 10.3+ MB


Great. Let's move on.

# Modeling

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

Now that we've cleaned the data, we're ready to model. One important thing to note is that our training data contains people who are not heads of households while the information for the dataset explicitly states that our model would be tested against heads of household. So we'll try modeling with all of the data first, and if I still feel the model is lacking, then we'll work to combine household data.

In [37]:
X_train, X_test, y_train, y_test = train_test_split(train.drop(['Id', 'idhogar', 'Target'], axis=1).values, train['Target'].values, test_size=.3, random_state=42)

## Random Forest Model

For this, I'll largely be borrowing the parameter loop that professor demonstrated in class. I liked how it iterated, especially since I didn't readily understand how to implement the `GridSearchCV` class in the `RandomForestClassifier` instance. However, I've quickly tweaked it where needed to apply nicely to the data we're working with.

In [42]:
n_estimators = [1000, 2500, 5000]
max_depth = [1, 3, 5, 10]
class_weights = ['balanced', None]
best_f1 = 0

for est in n_estimators:
    for depth in max_depth:
        for wgt in class_weights:
            clf = RandomForestClassifier(n_estimators=est, max_depth=depth, oob_score=True, class_weight=wgt, random_state=42)
            clf.fit(X_train, y_train)
            y_pred = clf.predict(X_train)
            f1 = f1_score(y_train, y_pred, average='macro', labels=np.unique(y_pred))
            if f1 > best_f1:
                best_f1 = f1
                best_params = (est, depth, wgt)

In [43]:
print(best_f1)
print(best_params)

0.8732425232545373
(2500, 10, 'balanced')


Okay. That looks decent. Let's see how it does on our validation set.

In [45]:
clf = RandomForestClassifier(n_estimators=best_params[0], max_depth=best_params[1], oob_score=True, class_weight=best_params[2], random_state=42)

clf.fit(X_test, y_test)
test_pred = clf.predict(X_test)
f1 = f1_score(y_test, test_pred, average='macro', labels=np.unique(test_pred))

print(f1)

0.9414815463468462


Doesn't look bad! That's really interesting how it's doing better on data that it *hasn't* seen versus data that it *has* seen. Let's prepare a file for submission.

In [46]:
test = pd.read_csv('test.csv')
test.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,age,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq
0,ID_2f6873615,,0,5,0,1,1,0,,1,...,4,0,16,9,0,1,2.25,0.25,272.25,16
1,ID_1c78846d2,,0,5,0,1,1,0,,1,...,41,256,1681,9,0,1,2.25,0.25,272.25,1681
2,ID_e5442cf6a,,0,5,0,1,1,0,,1,...,41,289,1681,9,0,1,2.25,0.25,272.25,1681
3,ID_a8db26a79,,0,14,0,1,1,1,1.0,0,...,59,256,3481,1,256,0,1.0,0.0,256.0,3481
4,ID_a62966799,175000.0,0,4,0,1,1,1,1.0,0,...,18,121,324,1,0,1,0.25,64.0,,324


In [50]:
test_transformation(test)

ids = test['Id'].values
X_submission = test.drop(['Id', 'idhogar'], axis=1).values
submission_preds = clf.predict(X_submission)

submission_preds

array([4, 4, 4, ..., 2, 2, 3], dtype=int64)

In [51]:
sub = pd.DataFrame({'Id': ids, 'Target': submission_preds})
sub.head()

Unnamed: 0,Id,Target
0,ID_2f6873615,4
1,ID_1c78846d2,4
2,ID_e5442cf6a,4
3,ID_a8db26a79,4
4,ID_a62966799,4


In [52]:
sub.to_csv('submission1.csv', index=False)

## Kaggle Score: .42421

That would put me in 165th out of 619 for all the people that completed the challenge within the time frame. Not bad! I guess a lot of weak models voting does make for a singular strong model.

Note:  

I spent a good part of the day trying to run a boosting model in conjunction with GridSearchCV. However, it's taking forever to find the best parameters. I limited many of the arguments to speed it up, but after a couple hours, it wouldn't run it in a timely way. I also have to run the code on Kaggle to submit, and that would additionally take a long time. Forgive me for only trying one model. I worked for a few hours on a second, but the length of time it took the computer to process things simply wasn't going to cut the deadline.