In [None]:
<a target="_blank" href="https://colab.research.google.com/github/lm2612/Tutorials/blob/main/2_supervised_learning_classification/2-Classification_Titanic.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Titanic: Machine learning from disaster

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this tutorial, we will use passenger data to predict who survived the shipwreck and also use our predictive model to answer the question: "what sorts of people were more likely to survive?". We will focus on passenger age, gender and socio-economic class). You can read more about the Titanic dataset [here](https://www.kaggle.com/c/titanic/overview).

First, import packages and load the data.

In [None]:
import sys 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(0)

In [None]:
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    filepath = "https://raw.githubusercontent.com/lm2612/Tutorials/refs/heads/main/2_supervised_learning_classification/titanic.csv"
    print(f"Notebook running in google colab. Using raw github filepath = {filepath}")

else:
    filepath = "./titanic.csv"
    print(f"Notebook running locally. Using local filepath = {filepath}")


In [None]:
df = pd.read_csv(filepath)
df.head()

We are interested in the "Survived" column, where are two possible outcomes: survived (1) or did not survive (0). We want to build a classifier to predict this outcome. In this tutorial, we are going to compare different classification methods, where we try to determine the factors that influence whether a passenger survived. 
Specifically, we are going to investigate how the passenger class, age and sex influenced survival.

For passenger class, we are going to use dummy variables to represent the three possible states: binary variables which take on the value 0 if not true and 1 if true.

Create dummy variables for classes 1 and 2. This implicitly means that the 3rd class will be the base case that we compare to.

Create a dummy variable equal to 1 if the passenger was female.


For age, we are going to split the data up into three segments: (i) those aged 16 or less; (ii) those between 16 and 60; (iii) and those over 60. Create dummy variables for categories (i) and (iii).

Clean up the data - drop all variables except for the 'Age', 'Sex', 'Pclass' for our inputs and  'Survived' for our outputs.

Split the data into training, validation and test data

Set up your X and y variables


## Logistic regression
We will use the `sklearn.linear_model.LogisticRegression`. Read the docs here: https://scikit-learn.org/0.16/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
import sklearn
from sklearn.linear_model import LogisticRegression

# Set up and fit the logistic regression model


What variable is given the most relevance for a prediction?

## Validating our classification model
Predict on the validation dataset and estimate accuracy using one of the classification metrics from [here](https://scikit-learn.org/1.5/modules/model_evaluation.html#classification-metrics)

## Decision tree

Now build decision tree using `sklearn.tree.DecisionTreeClassifier` using the docs here: https://scikit-learn.org/1.5/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier. Use the entropy criterion for splitting we discussed during lectures.

In [None]:
from sklearn.tree import DecisionTreeClassifier


How does the decision tree compare against the logistic regression

## Visualising how good the model is
We can look at how often the models produce false positives and false negatives using a [confusion matrix](https://scikit-learn.org/1.5/auto_examples/model_selection/plot_confusion_matrix.html). It shows:


```
                           | Predicted Negatives (0) | Predicted Positives (1) |
--------------------------------------------------------------------------------
True Negatives (0)         | True Negatives          | False Positives         |
True Positives (1)         | False Negatives         | True Positives          |
```

We want higher values along the upper left to lower right diagonal (more true negatives / true positives) and lower values in the opposite diagonal (fewer false negatives / false positives).

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Compare the confusion matrices on the validation data for the regression and tree.