# Feature Engineering

My aim in this file is to create useful features for modelling and to put features in a form that a machine learning algorithm can accept. I will do this by:

- Creating categorical columns from text data.
- Change representation of current features (continuous data to discrete, create dummy variables, etc.).
- Create interaction variables.

In [40]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from pathlib import Path
import re

In [45]:
DATA_PATH = Path('../data/processed/')
training_df = pd.read_csv(DATA_PATH/'cleaned_training.csv')
test_df = pd.read_csv(DATA_PATH/'cleaned_test.csv')

I will start by producing a deck column based on the cabin the passenger was staying in. From my research I have found there was decks A-G, which I will convert to numeric ordinal data. Any missing values I will value 8. We have one cabin in the training data on deck 'T', this could either refer to the bottom deck (tank top), though this deck didn't have cabins, or it could be a mistake. I will treat this cabin as a missing value.

In [50]:
def get_deck(dataframe):
    """
    Returns dataframe with ordinal column 'Deck' based on Cabin column and removes Cabin column.
    """
    decks = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7}
    dataframe['Deck'] = dataframe['Cabin'].fillna('X').apply(lambda x: re.compile('[a-zA-Z]+').search(x).group())
    dataframe['Deck'] = dataframe['Deck'].map(decks)
    dataframe['Deck'] = dataframe['Deck'].fillna(8).astype('int32')
    
    return dataframe.drop('Cabin', axis=1)

In [51]:
training_df = get_deck(training_df)
test_df = get_deck(test_df)

Now I will combine SibSp and Parch to produce a relatives column.

In [53]:
training_df['Relatives'] = training_df['SibSp'] + training_df['Parch']
test_df['Relatives'] = test_df['SibSp'] + test_df['Parch']

Now let's analyse the name column and see if we can isolate the passengers title. This could give us an idea of a passengers 'importance', marital status, age or perhaps something else.

In [87]:
training_df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

It appears that all the titles are followed by '.', we can use this to try to isolate the title. 

In [101]:
# Isolate all word characters before '.'
training_df['Name'].apply(lambda x: re.compile('\w+(?=\.)').search(x).group()).value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Major         2
Col           2
Mlle          2
Countess      1
Lady          1
Ms            1
Mme           1
Sir           1
Jonkheer      1
Don           1
Capt          1
Name: Name, dtype: int64

Most of our data is contained within the four most common titles, hence I will only include these as categories and group the other variations as 'Other'.

In [105]:
def get_passenger_title(dataframe):
    """
    Returns dataframe with 'Title' column created from Name and removes Name. Title destinguishes the four 
    most common titles and categorises anything else as Other.
    """
    dataframe['Title'] = dataframe['Name'].apply(lambda x: re.compile('\w+(?=\.)').search(x).group())
    titles = {v:v for v in ['Mr', 'Miss', 'Mrs', 'Master']}
    dataframe['Title'] = dataframe['Title'].map(titles).fillna('Other')
    
    return dataframe.drop('Name', axis=1)

In [107]:
training_df = get_passenger_title(training_df)
test_df = get_passenger_title(test_df)

In [108]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Survived     891 non-null int64
Pclass       891 non-null int64
Sex          891 non-null object
Age          891 non-null float64
SibSp        891 non-null int64
Parch        891 non-null int64
Fare         891 non-null float64
Embarked     891 non-null object
Deck         891 non-null int32
Relatives    891 non-null int64
Title        891 non-null object
dtypes: float64(2), int32(1), int64(5), object(3)
memory usage: 73.2+ KB
