Feature Selection
=================

30/03/2018

Feature Selection - generate features in preparation of applying machine learning models

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('./data/train-02-cf.csv', index_col=['PassengerId'])
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Surname,Title,Fancy_title,U15_many_siblings,alone
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,male,22.0,1,0,7.25,S,Braund,Mr,False,False,False
2,1,1,female,38.0,1,0,71.2833,C,Cumings,Mrs,False,False,False
3,1,3,female,26.0,0,0,7.925,S,Heikkinen,Miss,False,False,True
4,1,1,female,35.0,1,0,53.1,S,Futrelle,Mrs,False,False,False
5,0,3,male,35.0,0,0,8.05,S,Allen,Mr,False,False,True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 13 columns):
Survived             891 non-null int64
Pclass               891 non-null int64
Sex                  891 non-null object
Age                  714 non-null float64
SibSp                891 non-null int64
Parch                891 non-null int64
Fare                 891 non-null float64
Embarked             889 non-null object
Surname              891 non-null object
Title                891 non-null object
Fancy_title          891 non-null bool
U15_many_siblings    891 non-null bool
alone                891 non-null bool
dtypes: bool(3), float64(2), int64(4), object(4)
memory usage: 79.2+ KB


Note there are missing data in `Age` and `Embarked` columns

# Tokenize Categories

Because sklearn models cannot identify `strings`, we'll need to covert the categories into columns or represent it by numerical values

## Gender

I'm going to represent `Sex` column by `is_male`, where `1` is True

In [4]:
def is_male(x):
    if x == 'male':
        return True
    else:
        return False
    
df['is_male'] = df.Sex.apply(is_male)
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Surname,Title,Fancy_title,U15_many_siblings,alone,is_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,male,22.0,1,0,7.25,S,Braund,Mr,False,False,False,True
2,1,1,female,38.0,1,0,71.2833,C,Cumings,Mrs,False,False,False,False
3,1,3,female,26.0,0,0,7.925,S,Heikkinen,Miss,False,False,True,False
4,1,1,female,35.0,1,0,53.1,S,Futrelle,Mrs,False,False,False,False
5,0,3,male,35.0,0,0,8.05,S,Allen,Mr,False,False,True,True


## Embarked

There are three different categories for embarkment data. Representing different habour with numerical values:
```
0 -- C
1 -- Q
2 -- S
```

There are 2 missing data in the `Embarked` column, going to fill it by the highest occurrence value.

In [5]:
df.Embarked = df.Embarked.fillna(value=df.Embarked.mode()[0])
df.Embarked = pd.Categorical(df.Embarked)
df['Embarked_codes'] = df.Embarked.cat.codes
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Surname,Title,Fancy_title,U15_many_siblings,alone,is_male,Embarked_codes
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,3,male,22.0,1,0,7.25,S,Braund,Mr,False,False,False,True,2
2,1,1,female,38.0,1,0,71.2833,C,Cumings,Mrs,False,False,False,False,0
3,1,3,female,26.0,0,0,7.925,S,Heikkinen,Miss,False,False,True,False,2
4,1,1,female,35.0,1,0,53.1,S,Futrelle,Mrs,False,False,False,False,2
5,0,3,male,35.0,0,0,8.05,S,Allen,Mr,False,False,True,True,2


## Title

Extracting the title `Master` allows us to identify the infants onboard. This information can be used to fill the NAs in the `Age` column.

In [6]:
def is_master(x):
    if x == 'Master':
        return True
    return False

df['is_master'] = df.Title.apply(is_master)

df[df['is_master'] == True].Age.describe()

count    36.000000
mean      4.574167
std       3.619872
min       0.420000
25%       1.000000
50%       3.500000
75%       8.000000
max      12.000000
Name: Age, dtype: float64

The average age of `Master` is about 4 years old

In [7]:
def is_mr(x):
    if x == 'Mr':
        return True
    return False

df['is_mr'] = df.Title.apply(is_mr)

df[df['is_mr'] == True].Age.describe()

count    398.000000
mean      32.368090
std       12.708793
min       11.000000
25%       23.000000
50%       30.000000
75%       39.000000
max       80.000000
Name: Age, dtype: float64

The average age of `Mr` is about 30 years old

## Age

### categorizing age

Categorizing `Age` can simplify the process by using categorized data instead of a continuous one. The ranges were determined using the visulizations in `titanic-02.ipynb`.

```
0 -- Infant -- 0-5
1 -- Child -- 6-15
2 -- Young Adult -- 16-30
3 -- Adult -- 31-60
4 -- Elder -- > 61
```

In [8]:
def cat_age(age):
    if age <= 5:
        return 0
    elif age > 5 and age <= 15:
        return 1
    elif age > 15 and age <=30:
        return 2
    elif age > 30 and age <=60:
        return 3
    elif age > 60:
        return 4
    else:
        return np.nan

df['cat_age'] = df.Age.apply(cat_age)
df['cat_age'].unique()

array([  2.,   3.,  nan,   0.,   1.,   4.])

I've tried using decision tree model to predict and fill in the missing values, but the accuracy was lower than filling the values by median. To simplify the process, I'm going to fill the missing age with the median age other than the passengers with the title `Master`.

In [9]:
df[df['is_master'] == True].cat_age.describe()

count    36.000000
mean      0.361111
std       0.487136
min       0.000000
25%       0.000000
50%       0.000000
75%       1.000000
max       1.000000
Name: cat_age, dtype: float64

Most `Masters` are in the `0` age category

In [10]:
df[['Title', 'cat_age']].groupby('Title').median()

Unnamed: 0_level_0,cat_age
Title,Unnamed: 1_level_1
Capt,4.0
Col,3.0
Don,3.0
Dr,3.0
Jonkheer,3.0
Lady,3.0
Major,3.0
Master,0.0
Miss,2.0
Mlle,2.0


Going to fill the missing ages according to their `Title`

In [11]:
def fillna_median(x):
    return x.fillna(x.median())

df.Age = df[['Age', 'Title']].groupby('Title').transform(fillna_median)

df.cat_age = df[['cat_age', 'Title']].groupby('Title').transform(fillna_median)

In [12]:
df.isnull().sum()

Survived             0
Pclass               0
Sex                  0
Age                  0
SibSp                0
Parch                0
Fare                 0
Embarked             0
Surname              0
Title                0
Fancy_title          0
U15_many_siblings    0
alone                0
is_male              0
Embarked_codes       0
is_master            0
is_mr                0
cat_age              0
dtype: int64

Changing the data type of age category column from float to integer and save a tiny bit of memory

In [13]:
df.cat_age = df.cat_age.astype(int)

In [14]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Surname,Title,Fancy_title,U15_many_siblings,alone,is_male,Embarked_codes,is_master,is_mr,cat_age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0,3,male,22.0,1,0,7.25,S,Braund,Mr,False,False,False,True,2,False,True,2
2,1,1,female,38.0,1,0,71.2833,C,Cumings,Mrs,False,False,False,False,0,False,False,3
3,1,3,female,26.0,0,0,7.925,S,Heikkinen,Miss,False,False,True,False,2,False,False,2
4,1,1,female,35.0,1,0,53.1,S,Futrelle,Mrs,False,False,False,False,2,False,False,3
5,0,3,male,35.0,0,0,8.05,S,Allen,Mr,False,False,True,True,2,False,True,3


In [15]:
# save df
df.to_csv('./data/train-03-cfe.csv')

Up next: finally can start the model building!