# Kaggle Titanic Competition - Feature Engineering
- Problem Description: The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). 
- Author: Kimberly Gaddie
- Date Last Updated: 17 May 2021

#### Import Libraries and Datasets

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.mode.chained_assignment = None  # default='warn'
"notebook.output.textLineLimit": 500

df_full = read.csv('df_full_cleaned.csv')

## Feature Engineering
- Family Size
- Ticket
    - Group Booking
- Dummy Variables
    - Sex
    - Embarked
- Binned Age
- Adult v. Child
    

##### Family Size
- Data set has number of siblings, parents and children
- We can combine this, it will overcount but the bias will be systematic 
    and should affect all records in the same way...
- We don't actually know if this is important in predicting survival

In [None]:
f_cor = df_full[['Parch', 'SibSp', 'Survived']].corr()

sns.heatmap(f_cor, vmax=1, vmin=1, cmap="YlGnBu", annot=True)

In [None]:
print(df_full['Parch'].value_counts())
print(df_full['SibSp'].value_counts())

In [None]:
sns.barplot(x=df_full['Parch'].astype('str'), y=df_full['Survived'], hue=df_full['Sex'])
plt.show()

sns.barplot(x=df_full['SibSp'].astype('str'), y=df_full['Survived'], hue=df_full['Sex'])
plt.show()

In [None]:
df_full['Total Family'] = df_full['Parch'] + df_full['SibSp']

sns.barplot(x=df_full['Total Family'].astype('str'), y=df_full['Survived'], hue=df_full['Sex'])
plt.show()

##### Ticket -- Group Bookings
- Not all Ticket IDs are unique... We can add a feature to determine if a ticket was accounting for multiple people or not


In [None]:
non_unique = df_full['Ticket'][df_full['Ticket'].duplicated()]
len(non_unique)

In [None]:
df_full['is_group'] = np.where(df_full['Ticket'].isin(non_unique), 1, 0)
df_full['is_group'].describe()

##### Dummy Variables
- Sex
- Title 
- Cabin (There are a lot, can we combine??)
- Port of Embarkation 

Do not convert Pclass to Dummy... This is an ordinal variable w/ clear rank order value!

In [None]:
# Sex Dummy Variables

df_full['female'] = np.where(df_full['Sex'] == 'female', 1, 0)
df_full['male'] = np.where(df_full['Sex'] == 'male', 1, 0)

df_full.head()

In [1]:
# Title Dummy Variables
labels = list(df_full['Title'].unique())

for title in labels:
    df_full[title] = np.where(df_full['Title'] == title, 1, 0)
    
df_full.head()

In [None]:
# Cabin Dummy Variable
labels = list(df_full['New Cabin'].unique())

for cabs in labels:
    df_full['cabin_' + str(cabs)] = np.where(df_full['New Cabin'] == cabs, 1, 0)
    
df_full.head()

In [None]:
# Embarked Dummy Variable
port = df_full['Embarked'].unique()

for port in port:
    df_full['Embarked_' + port] = np.where(df_full['Embarked'] == port, 1, 0)

##### Fare per Person on Group Tickets
- Fare is not normalized by number of people, just total cost, this could produce a bias for large families v. individuals
- Let's normalize by n-people per ticket to get a per person cost (should better proxy socio-economic status)

In [None]:
sns.barplot(x=df_full['is_group'].astype(str), y=df_full['Fare'])
plt.show()

In [None]:
pers_per_tkt= [df_full['Ticket'].value_counts()[x] for x in df_full['Ticket']]
df_full['fare_divider'] = pers_per_tkt
df_full['New Fare'] = df_full['Fare'] / df_full['fare_divider']

In [None]:
# What do we do with fare's == 0?!?
df_full[(df_full['Fare'] < 1)]

# All males... Only 1 survived, Multiple classes, booked on group tickets
# I think it should be based on class, where they emarked, and the Pclass

In [None]:
fares = df_full.groupby(['Embarked', 'Pclass', 'New Cabin']).median()['Fare']
# fares.reset_index(inplace=True)
fares

In [None]:
df_full.set_index(['Embarked', 'Pclass', 'New Cabin'], drop=False, inplace=True)

df_full['Fare'].fillna(fares, inplace=True)
df_full.reset_index(drop=True, inplace=True)

In [None]:
df_full['Fare'].isna().sum()

In [None]:
sns.barplot(x=df_full['is_group'].astype(str), y=df_full['New Fare'])
plt.show()

##### Scaled Class by Age
- I don't really know that this is useful... But it doesn't hurt to test it out

In [None]:
df_full['class_age'] = df_full['Pclass'] * df_full['Age']

##### Bucket Age
- By bucketing age, we can eliminate *SOME* error from imputation earlier...
    Might help w. 'generated regressors'

In [None]:
df_full['binned_age'] = pd.cut(df_full['Age'], [0, 10, 20, 30, 40, 50, 60, 70, 80], labels=[0, 10, 20, 30, 40, 50, 60, 70])
df_full['binned_age'].head()

In [None]:
df_full[df_full['binned_age'].isna()]

In [None]:
df_full['is_minor'] = np.where(df_full['Age'] < 14, 1, 0)