# Titanic - Part 3: Predicting Survival with Statistical Modeling

## 1. Importing Dataset and Necessary Packages

In [1]:
import os
import pandas as pd

from scipy.stats import randint, uniform
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from xgboost import XGBClassifier

In [2]:
data = pd.read_csv('testdata_after_eda.csv')
df_test = data.copy()

data = pd.read_csv('traindata_after_eda.csv')
df_train = data.copy()

In [3]:
df_template = pd.read_csv('gender_submission.csv')

In [4]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   PassengerId     418 non-null    int64  
 1   Pclass          418 non-null    int64  
 2   Sex             418 non-null    object 
 3   Age             332 non-null    float64
 4   Embarked        418 non-null    object 
 5   Nationality     418 non-null    object 
 6   Missing_Age     418 non-null    int64  
 7   SharedTicket    418 non-null    int64  
 8   Solo            418 non-null    int64  
 9   IndividualFare  418 non-null    float64
 10  DeckKnown       418 non-null    int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 36.1+ KB


In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   PassengerId      891 non-null    int64  
 1   Survived         891 non-null    int64  
 2   Pclass           891 non-null    int64  
 3   Sex              891 non-null    object 
 4   Age              714 non-null    float64
 5   Embarked         891 non-null    object 
 6   Missing_Age      891 non-null    int64  
 7   SharedTicket     891 non-null    int64  
 8   TicketGroupSize  891 non-null    int64  
 9   IndividualFare   891 non-null    float64
 10  Solo             891 non-null    int64  
 11  DeckKnown        891 non-null    int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 83.7+ KB


In [6]:
df_template.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Survived     418 non-null    int64
dtypes: int64(2)
memory usage: 6.7 KB


In [7]:
df_template['Survived'] = 0

## 2. 

For this project, we will use random forests. The training set will be used to train the model which will be applied to the test set. 

In [8]:
X = df_train.drop(columns=['PassengerId', 'Survived'])
y = df_train['Survived']

We are dealing with missing values and categorical features. 

Quantitative features:
- `Age`: continuous;
- `IndividualFare`: continuous;
- `TicketGroupSize`: discrete.

Dummy features:
- `Survived`;
- `Missing_Age`;
- `SharedTicket`;
- `Solo`;
- `DeckKnown`.

Categorical features:
- `Pclass`;
- `Sex`;
- `Embarked`.

The dummy features are good the way they are. The categorical features that have not been encoded yet will need some attention. The missing values are in the `Age` column. They will be imputed using means. 

We will create a preprocessing pipeline to deal with the missing values and the categorical features. We can later use this pipeline when training the model. We want the imputation to be performed without suffering from data leakage, so within the folds of the cross validation process.

In [9]:
categorical_cols = ['Pclass', 'Sex', 'Embarked']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='mean'), ['Age']),
        ('cat', OneHotEncoder(), categorical_cols)
    ]
)

The preprocessor and the XGBoost are combined in a pipeline for our model.

In [10]:
model = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('classifier', XGBClassifier(eval_metric='logloss', random_state=0))
    ]
)

There are some hyperparameters in the boosting model. We will tune the parameters for the maximum depth and the learning rate.

In [11]:
param_distributions = {
    'classifier__max_depth': randint(3, 10),
    'classifier__learning_rate': uniform(0.01, 0.3)
}

We will perform randomized search to tune the hyperparameters during the cross-validation process. This means that there will be 20 iterations in which random combinations of hyperparameter values will be used to train the boosting model. These random values come from the ranges defined above.

In [12]:
random_search = RandomizedSearchCV(
    model,
    param_distributions=param_distributions,
    n_iter=20,
    scoring='accuracy',
    n_jobs=-1,
    random_state=0
)

We can now perform the random search.

In [13]:
random_search.fit(X, y)

In [14]:
y_pred = random_search.predict(df_test.drop(columns=['PassengerId']))

In [15]:
df_template['Survived'] = y_pred

In [16]:
df_template.to_csv("titanic_submission.csv", index=False)