# Basic Feature Engineering for Titanic Dataset

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re # regex 

In [2]:
train_data = pd.read_csv("/Users/mario/OneDrive/Repositories/Github/Titanic/data/train.csv")
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Titles

There is more information we can extract from the Name feature such as the title of the person. The titles found in the Name feature will all be reduced to Mrs, Miss, Mr, and Master. To do this we will use regex to extract the information. Reguluar expressions (regex) is a technique that can search for patterns in text using a sequence of characters. Wielding a tool like regex can greatly help a data scientist in feature engineering. Python provides a module called 're' to facilitate regex. 

The following function are used to implement the regex logic to get the titles from the Name feature. 

In [3]:
# extract the title and reduce them all to Mrs, Miss, Mr and Master.
def get_title(txt):
    # find items that have ',' followed by some text and ending in '.'
    x = re.search(r"\,.*\.", txt)
    if x:
        value = x.group()
        # find text after ',' and ' '
        x2 = re.search(r"(?<=\,\s).*\.", value)
        value2 = x2.group()
        # find text till the first '.'
        x3 = re.search(r"\b\w+\b(?=\.)", value2)
        return x3.group()

# apply get_title() function to each value from Name feature and then save to new Title feature
train_data['Title']=train_data['Name'].map(lambda x: get_title(x))

In [4]:
# list all unique titles from new Title feature
# these unique titles will be used to construct replace_titles() logic
print(train_data['Title'].unique())

['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Mme' 'Ms' 'Major' 'Lady'
 'Sir' 'Mlle' 'Col' 'Capt' 'Countess' 'Jonkheer']


In [5]:
# add empty column to dataframe
train_data['ReplacedTitles'] = np.nan

# replace the titles with reduced version
def replace_titles(x):
    title = x['Title']
    if title in ['Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
        return 'Mr'
    elif title in ['Countess', 'Mme']:
        return 'Mrs'
    elif title in ['Mlle', 'Ms']:
        return 'Miss'
    elif title =='Dr':
        if x['Sex']=='Male':
            return 'Mr'
        else:
            return 'Mrs'
    else:
        return title

# apply replace_titles() function to
train_data['ReplacedTitles']=train_data.apply(replace_titles, axis=1)

## Deck

Similarily, the he Cabin feature has potential for additional feature engineering. When analyzing the data only 1st class passengers have cabins and the rest of are 'unknown'. A cabin number is represented with a letter and a number, such as 'C85'. Sometimes there are more than one cabin number from each row, such as 'C23 C25 C27'. However, in this instance we will extract the first cabin deck letter. With a quick glance of the data in Excel no row with multiple cabin numbers are on different decks. The assumption moving forward is that we can safely pick the deck from the first cabin number. There are other weird artifacts for the cabin numbers, some don't even have a number next to the deck and for those we will will ignore unless they have number associated with the deck. Example: 'F G73', the 'F' will be ignored but the 'G' will be taken. 

In [6]:
# add empty column to dataframe
train_data['Deck'] = np.nan

def get_deck(txt):
    if type(txt) == str:
        # print("not a nan")
        # print(txt)
        x1 = re.search(r"[A-Z]{1}[0-9]{1,3}", txt)
        if x1 != None:
            value = x1.group()
            x2 = re.search(r"[A-Z]{1}", value)
            return x2.group()
    return "unknown"

train_data['Deck']=train_data['Cabin'].map(lambda x: get_deck(x))

In [7]:
# tesing edge case. It returns 'G' when after processing cabin 'F G73'
x = train_data[train_data['Cabin'] == 'F G73']
print(x)

     PassengerId  Survived  Pclass  \
75            76         0       3   
715          716         0       3   

                                           Name   Sex   Age  SibSp  Parch  \
75                      Moen, Mr. Sigurd Hansen  male  25.0      0      0   
715  Soholt, Mr. Peter Andreas Lauritz Andersen  male  19.0      0      0   

     Ticket  Fare  Cabin Embarked Title ReplacedTitles Deck  
75   348123  7.65  F G73        S    Mr             Mr    G  
715  348124  7.65  F G73        S    Mr             Mr    G  


In [8]:
# list all unique titles from new Title feature
# these unique titles will be used to construct replace_titles() logic
print(train_data['Deck'].unique())

['unknown' 'C' 'E' 'G' 'D' 'A' 'B' 'F']


## Family Size

In a disaster like the Titanic there can be a difference in who survices based on the family size or solo travelers. Such information can extracted by creating a linear combination of features. A new feature for family size may be helpful for decision trees that have a difficult time modeling such relationships. 
