# Titanic (Classification)
Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

The **data population** is **[TODO]**

# Assignment
From the [Kaggle Competition](https://www.kaggle.com/c/titanic/overview)

## 2.1 State Group Members [1 point]
You can choose to finish the project individually or in group (up to 2 members). If you want to finish the whole project including stage-1, and stage-2 in group, please let the instructor know as soon as possible (before Oct 26, 2021). You are required to state clearly the members' full names. 

## 2.2 Problem Formulation/Introduction [5 points]
At the beginning of your notebook, you are required to write an introduction of the problem, which will motivate the readers.

1. You need to carefully read the description of your chosen Kaggle competition and formulate the data problem you are going to solve similar to what we have discussed in class. Points will be deducted if the problem is not correctly describing the corresponding Kaggle problem or the problem formulation is not clear to understand even judge the previous point. [3 points]

2. According to the problem, what's the data population? [2 points]

## 2.3 Data [38 points]
Then, you can join the competition and download the data. You must include but not limited to the following items in your notebook.

1. If you don't have a Kaggle account, you need to register one. Please include your account name in your notebook, which will be the one showing on the Leaderboard, when you submit your result in stage-2. [1 point]

2. After you download the data, you need to load them to your notebook, show several lines of each separate file and describe what is that data for. [5 points]

3.  Data Wrangling: this is an open component and you need to transform the data into a data frame or data frames for analysis or visualization. However, you need to include the discussion for each data properties (i.e., structure, granularity, scope, temporality, and faithfulness), using the data (e.g., visualizations or statistical summaries) as evidences. If one property cannot be identified using the data, you also need to show and tell that this is not available. [25 points, 5 for each property]

4. According to the data wrangling (maybe some visualizations), is the data obtained representative for the problem according to your data population? What kind of assumptions are needed that the analysis using the current data can solve the problem? [5 points]

5. Comparing the data population and the given data, guess what kind of sampling method maybe used during the data collection and why do you guess so. [2 points]

## 2.4 Formats and Readability
Your notebook should be well organized for the ease of reading as a good document that any follow-up data scientist can follow and understand. [0 points, -5 if hard to follow]

Bonus points [+2] will be given to the best notebook or notebooks, which will be shared to all students in this class.

Base project was worked on by me in late 2020, and I'm building off of this notebook for the course. Here is a link to the [Udemy course](https://www.udemy.com/course/deployment-of-machine-learning-models/) I worked on. 

# 2.1 State Group Members
**Nicole Guobadia [Individual]**

# 2.2 Problem Formulation/Introduction [5 points] **(TODO)**

# 2.3.1 Data Cleaning Phase
We'll start by doing some imports and some constants

In [1]:
import re
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sklearn module for some preprocessing
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# For imputation since there are a couple of missing values
from feature_engine.imputation import (
    CategoricalImputer,
    AddMissingIndicator,
    MeanMedianImputer)

# For encoding difference values
from feature_engine.encoding import (
    RareLabelEncoder,
    OneHotEncoder
)

DATA_DIR = "data/train.csv"

In [2]:
# load the data
data = pd.read_csv(DATA_DIR)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


This is really just a pet peeve of mine, but I really don't like that the column names are capitalized:

In [3]:
new_columns_dict = dict(zip(data.columns, [col.lower() for col in data.columns]))
data = data.rename(columns=new_columns_dict)
data.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# replace interrogation marks by NaN values
data = data.replace('?', np.nan)

In [5]:
# retain only the first cabin if more than
# 1 are available per passenger
def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan
    
data['cabin'] = data['cabin'].apply(get_first_cabin)

In [6]:
# extracts the title (Mr, Ms, etc) from the name variable
def get_title(passenger):
    line = passenger
    if re.search('Mrs', line):
        return 'Mrs'
    elif re.search('Mr', line):
        return 'Mr'
    elif re.search('Miss', line):
        return 'Miss'
    elif re.search('Master', line):
        return 'Master'
    else:
        return 'Other'
    
data['title'] = data['name'].apply(get_title)

In [7]:
# cast numerical variables as floats
data['fare'] = data['fare'].astype('float')
data['age'] = data['age'].astype('float')

In [8]:
# drop unnecessary variables
data = data.drop(labels=['name','ticket'], axis=1)
data.head()

Unnamed: 0,passengerid,survived,pclass,sex,age,sibsp,parch,fare,cabin,embarked,title
0,1,0,3,male,22.0,1,0,7.25,,S,Mr
1,2,1,1,female,38.0,1,0,71.2833,C85,C,Mrs
2,3,1,3,female,26.0,0,0,7.925,,S,Miss
3,4,1,1,female,35.0,1,0,53.1,C123,S,Mrs
4,5,0,3,male,35.0,0,0,8.05,,S,Mr


In [9]:
class ExtractLetterTransformer(BaseEstimator, TransformerMixin):
    # Extract fist letter of variable
    def __init__(self, variables):
        self.variables = variables

    def fit(self, X, y=None):
        # we need this step to fit the sklearn pipeline
        return self

    def transform(self, X):
        # so that we do not over-write the original dataframe
        X = X.copy()
        
        for feature in self.variables:
            X[feature] = X[feature].str[0]

        return X

# 2.3.2 Exploratory Data Analysis [TODO]

In [10]:
# list of variables to be used in the pipeline's transformers
NUMERICAL_VARIABLES = ['age', 'fare']
CATEGORICAL_VARIABLES = ['sex', 'cabin', 'embarked', 'title']
CABIN = ['cabin']

## Pipeline

- Impute categorical variables with string missing
- Add a binary missing indicator to numerical variables with missing data
- Fill NA in original numerical variable with the median
- Extract first letter from cabin
- Group rare Categories
- Perform One hot encoding
- Scale features with standard scaler

In [11]:
# set up the pipeline
titanic_pipe = Pipeline([
    # ===== IMPUTATION =====
    # impute categorical variables with string missing
    ('categorical_imputation', CategoricalImputer(
        imputation_method='missing', variables=CATEGORICAL_VARIABLES)),

    # add missing indicator to numerical variables
    ('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARIABLES)),

    # impute numerical variables with the median
    ('median_imputation', MeanMedianImputer(
        imputation_method='median', variables=NUMERICAL_VARIABLES)),

    # Extract letter from cabin
    ('extract_letter', ExtractLetterTransformer(variables=CABIN)),

    # == CATEGORICAL ENCODING ======
    # remove categories present in less than 5% of the observations (0.05)
    # group them in one category called 'Rare'
    ('rare_label_encoder', RareLabelEncoder(
        tol=0.05, n_categories=1, variables=CATEGORICAL_VARIABLES)),

    # encode categorical variables using one hot encoding into k-1 variables
    ('categorical_encoder', OneHotEncoder(
        drop_last=True, variables=CATEGORICAL_VARIABLES)),

    # scale
    ('scaler', StandardScaler())
])