# Feature Engineering

My aim in this file is to create useful features for modelling and to put features in a form that a machine learning algorithm can accept. I will do this by:

- Creating categorical columns from text data.
- Change representation of current features (nominal data to ordinal, create dummy variables, etc.).
- Create interaction variables.

In [16]:
import numpy as np
import pandas as pd

from pathlib import Path
import re

In [17]:
DATA_PATH = Path('../data/processed/')
training_df = pd.read_csv(DATA_PATH/'cleaned_training.csv')
test_df = pd.read_csv(DATA_PATH/'cleaned_test.csv')

I will start by producing a deck column based on the cabin the passenger was staying in. From my research I have found there was decks A-G, which I will convert to numeric ordinal data. Any missing values I will value 8. We have one cabin in the training data on deck 'T', this could either refer to the bottom deck (tank top), though this deck didn't have cabins, or it could be a mistake. I will treat this cabin as a missing value.

In [18]:
def get_deck(dataframe):
    """
    Returns dataframe with ordinal column 'Deck' based on Cabin column and removes Cabin column.
    """
    decks = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7}
    dataframe['Deck'] = dataframe['Cabin'].fillna('X').apply(lambda x: re.compile('[a-zA-Z]+').search(x).group())
    dataframe['Deck'] = dataframe['Deck'].map(decks)
    dataframe['Deck'] = dataframe['Deck'].fillna(8).astype('int32')
    
    return dataframe.drop('Cabin', axis=1)

In [19]:
training_df = get_deck(training_df)
test_df = get_deck(test_df)

Now I will combine SibSp and Parch to produce a relatives column.

In [20]:
training_df['Relatives'] = training_df['SibSp'] + training_df['Parch']
test_df['Relatives'] = test_df['SibSp'] + test_df['Parch']

Now let's analyse the name column and see if we can isolate the passengers title. This could give us an idea of a passengers 'importance', marital status, age or perhaps something else.

In [21]:
training_df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

It appears that all the titles are followed by '.', we can use this to try to isolate the title. 

In [22]:
# Isolate all word characters directly before '.'
training_df['Name'].apply(lambda x: re.compile('\w+(?=\.)').search(x).group()).value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Col           2
Major         2
Ms            1
Lady          1
Capt          1
Jonkheer      1
Countess      1
Mme           1
Sir           1
Don           1
Name: Name, dtype: int64

Most of our data is contained within the four most common titles, hence I will only include these as categories and group the other variations as 'Other'.

In [23]:
def get_passenger_title(dataframe):
    """
    Returns dataframe with 'Title' column created from Name and removes Name. Title destinguishes the four 
    most common titles and categorises anything else as Other.
    """
    dataframe['Title'] = dataframe['Name'].apply(lambda x: re.compile('\w+(?=\.)').search(x).group())
    titles = {v:v for v in ['Mr', 'Miss', 'Mrs', 'Master']}
    dataframe['Title'] = dataframe['Title'].map(titles).fillna('Other')
    
    return dataframe.drop('Name', axis=1)

In [24]:
training_df = get_passenger_title(training_df)
test_df = get_passenger_title(test_df)

Let's create an indicator variable for being a child.

In [25]:
def get_child_indicator(dataframe):
    dataframe['Child'] = dataframe['Age'].apply(lambda x: 1 if x < 13 else 0)
    return dataframe

In [26]:
training_df = get_child_indicator(training_df)
test_df = get_child_indicator(test_df)

In [27]:
training_w_dummies = pd.get_dummies(training_df, drop_first=True)
test_w_dummies = pd.get_dummies(test_df, drop_first=True)

In [28]:
all(training_df.drop('Survived', axis=1).columns == test_df.columns)

True

In [29]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
Survived     891 non-null int64
Pclass       891 non-null int64
Sex          891 non-null object
Age          891 non-null float64
SibSp        891 non-null int64
Parch        891 non-null int64
Fare         891 non-null float64
Embarked     891 non-null object
Deck         891 non-null int32
Relatives    891 non-null int64
Title        891 non-null object
Child        891 non-null int64
dtypes: float64(2), int32(1), int64(6), object(3)
memory usage: 80.2+ KB


## Summary

I will leave my feature engineering for now. I have created two different types of dataframes, one with dummy variables from categorgorical data and one without, because I want to test an association between the categories and the target using chi-squared tests using the dataframe without dummy variables. The features I created were:

- Ordinal Deck number from Cabin column.
- Discrete Relatives column from combining Parch and SibSp.
- Nominal Title column extracted from Name column.
- Binary indicator for 'Child' (Age < 13).

I had planned to create interaction variables but for now I won't do this but I will program my modelling to be able to use these terms if I do decide to create these variables.

In [30]:
PROCESSED_PATH = Path('../data/processed')
training_df.to_csv(PROCESSED_PATH/'final_training.csv', index=False)
test_df.to_csv(PROCESSED_PATH/'final_test.csv', index=False)
training_w_dummies.to_csv(PROCESSED_PATH/'final_training_w_dummies.csv', index=False)
test_w_dummies.to_csv(PROCESSED_PATH/'final_test_w_dummies.csv', index=False)