<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill">
    <h1 style="text-align: center;padding: 12px 0px 12px 0px;">🚢 Titanic: XGBoost + Optuna</h1>
</div>
<img src="https://www.seekpng.com/png/full/200-2002376_predicting-titanic-survivors-old-photos-of-the-titanic.png" alt="Titanic" width="300"/>



# Understanding the Titanic Data

## Target - What we want to predict

For the Titantic dataset the target is:`Survived`

## Features

- `Pclass` - Ticket class (1st,2nd,3rd)
- `Name` - Full name
- `Sex` - Gender
- `Age` - Passenger's age
- `SibSp` - # of siblings / spouses aboard the Titanic
- `Parch` - # of parents / children aboard the Titanic
- `Ticket` - Ticket number
- `Fare` - What the passenger paid for a ticket
- `Cabin` - Cabin number
- `Embarked` - C = Cherbourg, Q = Queenstown, S = Southampton

## Evaluation Metric

$Acurracy = \frac{True Positives (TP) + True Negatives (TN)}{True Positives (TP) + True Negatives (TN) + False Positives(FP) + False Negatives(FN)}$

- https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification

Percentage of passengers you correctly predict.  In other words, accurately predict who survived and who did not.

- https://developers.google.com/machine-learning/crash-course/classification/accuracy



<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Import Libraries</h1>
</div>

In [1]:
import os
import time
from pathlib import Path

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix


# Visualization Libraries
import matplotlib.pylab as plt
import seaborn as sns

%matplotlib inline

from itertools import cycle

plt.style.use("ggplot")  # ggplot, fivethirtyeight
color_pal = plt.rcParams["axes.prop_cycle"].by_key()["color"]
color_cycle = cycle(plt.rcParams["axes.prop_cycle"].by_key()["color"])

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Configuration</h1>
</div>

In [2]:
# Change for every project
data_dir = "../input/titanic"

### The target/dependent variable in the dataset

In [3]:
# Did the passenger survive?
# 0 = No, 1 = Yes
TARGET = "Survived"

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Library</h1>
</div>

Creating a few functions that we will reuse in each project.

In [4]:
def read_data(path):
    data_dir = Path(path)

    train = pd.read_csv(data_dir / "train.csv")
    test = pd.read_csv(data_dir / "test.csv")
    submission_df = pd.read_csv(data_dir / "gender_submission.csv")

    print(f"train data: Rows={train.shape[0]}, Columns={train.shape[1]}")
    print(f"test data : Rows={test.shape[0]}, Columns={test.shape[1]}")
    return train, test, submission_df

In [5]:
def create_submission(model_name, target, preds):
    sample_submission[target] = preds

    if len(model_name) > 0:
        fname = "submission_{model_name}.csv"
    else:
        fname = "submission.csv"

    sample_submission.to_csv(fname, index=False)

    return sample_submission[:5]

In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


def show_scores(gt, yhat):
    accuracy = accuracy_score(gt, yhat)
    precision = precision_score(gt, yhat)
    recall = recall_score(gt, yhat)
    f1 = f1_score(gt, yhat)

    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"f1: {f1:.4f}")

In [7]:
from sklearn.preprocessing import LabelEncoder


def label_encoder(train, test, columns):
    for col in columns:
        train[col] = train[col].astype(str)
        test[col] = test[col].astype(str)
        train[col] = LabelEncoder().fit_transform(train[col])
        test[col] = LabelEncoder().fit_transform(test[col])
    return train, test

In [8]:
from sklearn.preprocessing import OneHotEncoder


def one_hot_encoder(train, test, columns):
    for col in columns:
        train[col] = train[col].astype(str)
        test[col] = test[col].astype(str)
        train[col] = LabelEncoder().fit_transform(train[col])
        test[col] = LabelEncoder().fit_transform(test[col])
    return train, test

In [9]:
def show_missing_features(df):
    missing_vals = df.isna().sum()
    print(missing_vals[missing_vals > 0])

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Load Train/Test Data</h1>
</div>

## Load the following files

 - train.csv - Data used to build our machine learning model
 - test.csv - Data used to build our machine learning model. Does not contain the `Suvived` target variable
 - gender_submission.csv - A file in the proper format to submit test predictions

In [10]:
train, test, sample_submission = read_data(data_dir)

train data: Rows=891, Columns=12
test data : Rows=418, Columns=11


In [11]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [13]:
train = train.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"], axis=1).copy()
test = test.drop(columns=["PassengerId"], axis=1).copy()

## Categorical/Numerical Variables

In [14]:
## Separate Categorical and Numerical Features
cat_features = list(train.select_dtypes(include=["category", "object"]).columns)
num_features = list(test.select_dtypes(include=["number"]).columns)

FEATURES = cat_features + num_features

In [15]:
from sklearn.impute import SimpleImputer

# Categorical
imputer = SimpleImputer(strategy="most_frequent")

train[cat_features] = imputer.fit_transform(train[cat_features])
test[cat_features] = imputer.transform(test[cat_features])

# Numerical

# imputer = SimpleImputer(strategy="mean")
imputer = SimpleImputer(strategy="median")  # median is more robust to outliers

train[num_features] = imputer.fit_transform(train[num_features])
test[num_features] = imputer.transform(test[num_features])

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Feature Engineering</h1>
</div>

- [Titanic - Advanced Feature Engineering Tutorial](https://www.kaggle.com/gunesevitan/titanic-advanced-feature-engineering-tutorial)
- [Titanic Survival Predictions (Beginner)](https://www.kaggle.com/nadintamer/titanic-survival-predictions-beginner)
- [Exploring Survival on the Titanic](https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic)

In [16]:
# FEATURES = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Embarked"]

In [17]:
train[FEATURES].head()

Unnamed: 0,Sex,Embarked,Pclass,Age,SibSp,Parch,Fare
0,male,S,3.0,22.0,1.0,0.0,7.25
1,female,C,1.0,38.0,1.0,0.0,71.2833
2,female,S,3.0,26.0,0.0,0.0,7.925
3,female,S,1.0,35.0,1.0,0.0,53.1
4,male,S,3.0,35.0,0.0,0.0,8.05


# Extract Target and Drop Unused Columns

In [18]:
y = train[TARGET]



# X = train_df.drop(columns=["PassengerId", "Survived"], axis=1).copy()
# X = train[FEATURES].copy()

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Missing Values</h1>
</div>

We have 891 rows of training data. Age is the only feature, that we are using, with missing data.

Note, handling missing data is an entire subject that should be studied in detail.  Kaggle offers a [course](https://www.kaggle.com/learn/data-cleaning)

- [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/impute.html)
- https://scikit-learn.org/stable/modules/impute.html

- [A Guide to Handling Missing values in Python](https://www.kaggle.com/parulpandey/a-guide-to-handling-missing-values-in-python)

In [19]:
# missing_vals = train.isna().sum()
# print(missing_vals[missing_vals > 0])
show_missing_features(train)

Series([], dtype: int64)


In [20]:
n = train["Age"].isna().sum()
print(f"Number missing: {n}")

Number missing: 0


### Manual Imputation of Age

In [21]:
# train_df["Age"].fillna(train_df["Age"].mean(), inplace = True)
m = train["Age"].mean()
print(f"Mean age of person on the Titanic: {m:0.2f}")

Mean age of person on the Titanic: 29.36


In [22]:
train["Age"].fillna(train["Age"].median(skipna=True), inplace=True)
train["Embarked"].fillna(train["Embarked"].value_counts().idxmax(), inplace=True)

### Use SimpleImputer Function for Age

Leaving the SimpleImputer code uncommented.  It should do nothing since we filled in the values above.

In [23]:
impute_mean = SimpleImputer(missing_values=np.nan, strategy="mean", verbose=1)
m = impute_mean.fit_transform(train[["Age"]])
# mt = impute_mean.transform(test[["Age"]])

train["Age"] = impute_mean.fit_transform(train[["Age"]])
test["Age"] = impute_mean.transform(test[["Age"]])

### At this point we no longer have missing values

In [24]:
show_missing_features(train)

Series([], dtype: int64)


# Encoding Categorical Features

Need to convert categorical features into numerical features.

Several ways:
- One-hot Encode
- Label Encode

### Encode `Embarked`

First compare what the **drop_first=True** option does.  Some machine learning models require this option while others do not.  Logitistic regression requires us to drop the value.

In [25]:
train, test = label_encoder(train, test, ["Embarked", "Sex"])
# X_test = pd.get_dummies(test[FEATURES], drop_first=True)

train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3.0,1,22.0,1.0,0.0,7.25,2
1,1,1.0,0,38.0,1.0,0.0,71.2833,0
2,1,3.0,0,26.0,0.0,0.0,7.925,2
3,1,1.0,0,35.0,1.0,0.0,53.1,2
4,0,3.0,1,35.0,0.0,0.0,8.05,2


- [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [26]:
y = train[TARGET]
X = train[FEATURES].copy()

X_test = test[FEATURES].copy()

## Train Model

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Train Model with Cross Validation</h1>
</div>

Four out five folds will be used for training. The fifth will be used for validation

Each fold will have a turn at being the validation fold

After each time through the loop


In [27]:
NFOLDS = 5

final_test_predictions = []
final_valid_predictions = {}
scores = []


kf = StratifiedKFold(n_splits=NFOLDS, random_state=42, shuffle=True)

for fold, (train_idx, valid_idx) in enumerate(kf.split(X=X, y=y)):
    print(10 * "=", f"Fold={fold+1}", 10 * "=")
    start_time = time.time()

    x_train = X.loc[train_idx, :]
    x_valid = X.loc[valid_idx, :]  # Validation Features

    y_train = y[train_idx]
    y_valid = y[valid_idx]  # Validation Target

    model = LogisticRegression(C=0.12, solver="liblinear")
    model.fit(x_train, y_train)
    #     preds_valid = model.predict_proba(x_valid)[:,1]
    preds_valid = model.predict(x_valid)

    # Predictions for OOF
    print("--- Predicting OOF ---")
    final_valid_predictions.update(dict(zip(valid_idx, preds_valid)))

    accuracy = accuracy_score(y_valid, preds_valid)
    scores.append(accuracy)

    run_time = time.time() - start_time

    # Predictions for Test Data
    print("--- Predicting Test Data ---")
    test_preds = model.predict_proba(X_test)[:, -1]

    final_test_predictions.append(test_preds)
    print(f"Fold={fold+1}, Accuracy: {accuracy:.8f}, Run Time: {run_time:.2f}\n")

--- Predicting OOF ---
--- Predicting Test Data ---
Fold=1, Accuracy: 0.79888268, Run Time: 0.01

--- Predicting OOF ---
--- Predicting Test Data ---
Fold=2, Accuracy: 0.76966292, Run Time: 0.01

--- Predicting OOF ---
--- Predicting Test Data ---
Fold=3, Accuracy: 0.79775281, Run Time: 0.01

--- Predicting OOF ---
--- Predicting Test Data ---
Fold=4, Accuracy: 0.77528090, Run Time: 0.01

--- Predicting OOF ---
--- Predicting Test Data ---
Fold=5, Accuracy: 0.83146067, Run Time: 0.01



<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Scores</h1>
</div>

CV, or Cross Validation, Score.

We average the means and the standard deviations.

The Adjusted Score is the average of the means minus the average of standard deviation. Do this to attempt to get one number to evaluate the score when comparing different models.

In [28]:
print(
    f"Scores -> Adjusted: {np.mean(scores) - np.std(scores):.8f} , mean: {np.mean(scores):.8f}, std: {np.std(scores):.8f}"
)

Scores -> Adjusted: 0.77278106 , mean: 0.79460800, std: 0.02182694


<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Save OOF Predictions</h1>
</div>

This is unused for this example but needed later for [Blending](https://towardsdatascience.com/ensemble-learning-stacking-blending-voting-b37737c4f483).

**General idea**: The values will be use to create new features in a blended model.

- [Stacking and Blending — An Intuitive ExplanationStacking and Blending — An Intuitive Explanation](https://medium.com/@stevenyu530_73989/stacking-and-blending-intuitive-explanation-of-advanced-ensemble-methods-46b295da413chttps://medium.com/@stevenyu530_73989/stacking-and-blending-intuitive-explanation-of-advanced-ensemble-methods-46b295da413c)

In [29]:
final_valid_predictions = pd.DataFrame.from_dict(
    final_valid_predictions, orient="index"
).reset_index()
final_valid_predictions.columns = ["id", "pred_1"]
final_valid_predictions.to_csv("train_pred_1.csv", index=False)

<div style="background-color:rgba(255, 215, 0, 0.6);border-radius:5px;display:fill"><h1 style="text-align: center;padding: 12px 0px 12px 0px;">Submission File</h1>
</div>

The sample file and our data is in the same row order.  This allows us to simply assign our prediction to the target column (`Survived`) in the sample submission.

In [30]:
m = np.mean(np.column_stack(final_test_predictions), axis=1)

In [31]:
sample_submission["Survived"] = m
sample_submission.to_csv("submission_cv.csv", index=False)
sample_submission

Unnamed: 0,PassengerId,Survived
0,892,0.167286
1,893,0.452896
2,894,0.176650
3,895,0.175058
4,896,0.513189
...,...,...
413,1305,0.172808
414,1306,0.868598
415,1307,0.159228
416,1308,0.172808
