## Titanic

[Titanic on Kaggle](https://www.kaggle.com/c/titanic/overview)

Predict survival on the Titanic and get familiar with ML basics

#### Load dependencies

In [None]:
# This will help us to measure the time it took for the whole
# notebook to execute
import time
start_time = time.time()

import os
import importlib
import sys
sys.path.append('../../utils')
import datasets
importlib.reload(datasets)
import helpers
importlib.reload(helpers)

# Allows plot inside jupyer notebooks
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

import pandas as pd

import sklearn
from sklearn.pipeline import Pipeline               # Allows you to create a sequence of transformations and a final estimator in a single, cohesive workflow
from sklearn.impute import SimpleImputer            # Used to fill in missing values in your dataset.
from sklearn.preprocessing import StandardScaler    # Standardizes features by removing the mean and scaling them to unit variance
from sklearn.preprocessing import OneHotEncoder     # Used to encode categorical variables as one-hot (or dummy) variables.
from sklearn.compose import ColumnTransformer       # Allows you to apply different transformations to different columns within a DataFrame
from sklearn.ensemble import RandomForestClassifier # Builds an ensemble of decision trees, averaging their predictions to reduce overfitting.
from sklearn.model_selection import cross_val_score # Provides robust model evaluation by using multiple train-test splits, leading to a more accurate estimate of model performance.
from sklearn.svm import SVC                         # A powerful classification algorithm that uses hyperplanes to separate classes, particularly effective in high-dimensional spaces or with complex class boundaries.

#### Get Titanic dataset

In [None]:
TITANIC_PATH = os.path.join("../../datasets", "titanic")
DOWNLOAD_URL = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/titanic/"
DATASET_FILES = ["train.csv", "test.csv"]

datasets.download(DOWNLOAD_URL, TITANIC_PATH, DATASET_FILES)
train_data = datasets.load_csv(TITANIC_PATH, DATASET_FILES[0])
test_data = datasets.load_csv(TITANIC_PATH, DATASET_FILES[1])

##### Dataset preview

The attributes have the following meaning:

| Attribute | Meaning | Key |
|:---|:---|:---|
| PassengerId | Unique identifier | |
| Survived | Indicates if passenger survived | 0 = No, 1 = Yes |
| Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| Name | Name of the passanger | |
| Sex | Sex of the passanger | |
| Age | Age in years | |
| SibSp | # of siblings / spouses aboard the Titanic | |
| Parch | # of parents / children aboard the Titanic | |
| Ticket | Ticket number | |
| Fare | Passenger fare | |
| Cabin | Cabin number | |
| Embarked | Port of embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
# Set "PassengerId" as the index column
train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")

In [None]:
# Review data and see if anything is missing
train_data.info()

In [None]:
# Calculate the median age of females passengers
train_data[train_data["Sex"]=="female"]["Age"].median()

In [None]:
# Numerical attributes
train_data.describe()

##### Validate more data

Validate some other attributes.

In [None]:
train_data["Survived"].value_counts()

In [None]:
train_data["Pclass"].value_counts()

In [None]:
train_data["Sex"].value_counts()

In [None]:
train_data["Embarked"].value_counts()

#### Preprocessing pipeline

Preprocessing pipeline that takes the raw data and outputs numerical input features that we can feed to any Machine Learning model we want.

In [None]:
# Pipeline for numerical attributes
num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ])

# Pipeline for categorical attributes
cat_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("cat_encoder", OneHotEncoder(sparse_output=False)),
    ])

# Join pipelines
num_attribs = ["Age", "SibSp", "Parch", "Fare"]
cat_attribs = ["Pclass", "Sex", "Embarked"]

preprocess_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])

# Adding labels
y_train = train_data["Survived"]

# Preview
X_train = preprocess_pipeline.fit_transform(
    train_data[num_attribs + cat_attribs])
X_train

#### Train a classifier

In [None]:
# Train the model
forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_clf.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
X_test = preprocess_pipeline.transform(test_data[num_attribs + cat_attribs])
y_pred = forest_clf.predict(X_test)

In [None]:
# Cross-validate the model
forest_scores = cross_val_score(forest_clf, X_train, y_train, cv=10)
forest_scores.mean()

In [None]:
# Try with an SVC
svm_clf = SVC(gamma="auto")
svm_scores = cross_val_score(svm_clf, X_train, y_train, cv=10)
svm_scores.mean()

#### Plot models

The random forest classifier got a very high score on one of the 10 folds, but overall it had a lower mean score, as well as a bigger spread, so it looks like the SVM classifier is more likely to generalize well.

In [None]:
plt.figure(figsize=(8, 4))
plt.plot([1]*10, svm_scores, ".")
plt.plot([2]*10, forest_scores, ".")
plt.boxplot([svm_scores, forest_scores], tick_labels=("SVM","Random Forest"))
plt.ylabel("Accuracy", fontsize=14)
plt.show()

---

## Total Time

This show the total time of execution

In [None]:
# Sets the total time of execution
end_time = time.time()
helpers.calculate_execution_time(start_time, end_time)