<a target="_blank" href="https://colab.research.google.com/github/lm2612/Tutorials/blob/main/2_supervised_learning_classification/2-Advanced_Classification_Titanic.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Titanic: Machine learning from disaster. Advanced Tutorial - code your own

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this tutorial, we will use passenger data to predict who survived the shipwreck and also use our predictive model to answer the question: "what sorts of people were more likely to survive?". We will focus on passenger age, gender and socio-economic class). You can read more about the Titanic dataset [here](https://www.kaggle.com/c/titanic/overview).

This is the advanced version of the tutorial, where we will learn how to build our own classifiers.
First, import packages and load the data.

In [1]:
import sys
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt

np.random.seed(0)

In [None]:
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    filepath = "https://raw.githubusercontent.com/lm2612/Tutorials/refs/heads/main/2_supervised_learning_classification/titanic.csv"
    print(f"Notebook running in google colab. Using raw github filepath = {filepath}")

else:
    filepath = "./titanic.csv"
    print(f"Notebook running locally. Using local filepath = {filepath}")


Notebook running locally. Using local filepath = ./titanic.csv


In [None]:
df = pd.read_csv(filepath)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We are interested in the "Survived" column, where are two possible outcomes: survived (1) or did not survive (0). We want to build a classifier to predict this outcome. Specifically, we are going to investigate how the passenger class, age and sex influenced survival.

For passenger class, we are going to use dummy variables to represent the three possible states: binary variables which take on the value 0 if not true and 1 if true.

Create dummy variables for classes 1 and 2. This implicitly means that the 3rd class will be the base case that we compare to.

In [None]:
# Create new columns based on conditions


Create a dummy variable equal to 1 if the passenger was female.


Clean up the data - drop all variables except for 'Class_1', 'Class_2',  'Sex' and 'Age', for our inputs and  'Survived' for our outputs.

## Logistic Regression
Here we will create our own logistic regression classifier.


Write an R function for the logistic function: $\theta = \frac{1}{1 + \exp(-x)}$. The function takes \(x\) as its sole argument. Plot the function.


In [None]:
def logistic(x):


We are going to use the logistic function to represent the probability of a passenger surviving. But to do so, we need to write a function which returns the linear combination of $(\beta_0 + \beta_1 z_1 + \beta_2 z_2)$. The function should take as inputs $ ( \beta_0, \beta_1, \beta_2 )$ (the regression parameters) and the covariates $z_1$ and $z_2$.


In [None]:
def linear_combination(beta_0, beta_1, beta_2, z_1, z_2):


Write a function which returns the probability:

$$\theta_i = Logistic(\beta_0 + \beta_1 * z_1 + \beta_2 * z_2) $$

where $Logistic$ is the function you created above. The function should takes as input $ ( \beta_0, \beta_1, \beta_2 )$ (the regression parameters) and the covariates $z_1$ and $z_2$.

In [None]:
def probability(beta_0, beta_1, beta_2, z_1, z_2):


We are now going to write a function which returns the log-likelihood for a single set, $i$ of data points: $(z_{1i}, z_{2i}, Si)$ where $S_i\in{0,1}$ represents whether a passenger survived (1) or not (0). In logistic regression, we assume that $S_i \sim Bernoulli(\theta_i)$. This means that the likelihood, $L_i$, for a single set of data points is given by:

$$ L_i = \theta_i^{S_i} (1-\theta_i)^{1-S_i} $$

Write a function with takes as input $ ( \beta_0, \beta_1, \beta_2 )$ (the regression parameters) and the covariates $z_1$ and $z_2$ and (crucially!) a value of $S_i$ and returns the likelihood $L_i$

.


In [None]:
def likelihood_i(S_i, beta_0, beta_1, beta_2, z_1, z_2):



The overall likelihood of observations is given by the product of the individual likelihoods of each data point, since we assume that the data are (conditionally) independent given the parameters:

$$ L = \prod_i^{N} L_i $$

Write a function that takes as input your processed Titanic dataset and the parameters $\beta_0, \beta_1, \beta_2$
and returns the likelihood. In calculating the likelihood, specify that $z_1$ and $z_2$ should be your class dummies.

In [None]:
def likelihood(beta_0, beta_1, beta_2, df):



We are now going to try to estimate the parameters $\beta_0$ and $\beta_1$ by doing a grid search. We start by fixing $\beta_0=-1.14$  (this is the maximum likelihood value of the parameter). We are then going to do a grid search across all combinations of the following values of  $\beta_1=(0,1,1.67,2,2.5)$ $\beta_2=(−1,0,1,2,3)$. For each of the 25 combinations of both sets of parameters, calculate the likelihood. In doing so, find parameters that are close to the maximum likelihood values.

In [None]:
beta_0 = -1.14
beta_1 = [0, 1, 1.67, 2, 2.5]
beta_2 = [-1, 0, 1, 2, 3]





Find the values of `beta_1` and `beta_2` in the grid with the maximum likelihood (note, if you are using `np.argmax`, the output index will be flattened, `np.unravel_index` may be helpful.)

Compare your results to the what you would get with [`sklearn.linear_model.LogisticRegression`](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html)

What do your estimates suggest are the odds ratios for survival relative to 3rd class passengers for being in 1st and 2nd classes respectively?



What does your model predict is the change in probability of survival in moving from 3rd to 2nd class?


What is the change in probability in moving from 2nd to 1st class?


## Decision Trees

We are going to build our own decision tree. We will use five variables here. For age, we are going to split the data up into three segments: (i) those aged 16 or less; (ii) those between 16 and 60; (iii) and those over 60. Create dummy variables for categories (i) and (iii).

The entropy of a binary outcome variable, $X_i$, is given by:

$$ H = p \log p + (1-p) \log (1-p) $$

where $ p = \Pr(X=1)$.


Write a function which can calculate the entropy of a binary vector. Use it to calculate the entropy of the survival variable in the full dataset

In [None]:
# Function to calculate entropy
def entropy(v_binary):



## Building a Decision Tree Classifier

We now start to build a decision tree classifier. To do so, we are going to choose one of the five variables to split on based on the reduction in entropy this provides. To do so, we calculate the conditional entropy, $H(X|V) $, where $V $

is a particular variable we have split on. At each step, we will choose the variable to split on so that it results in the greatest reduction in entropy.

### Conditional Entropy for a Binary Variable

The conditional entropy for a binary variable, $V$, is given by:

$$
H(X|V) = \frac{S(V=1)}{S(\varnothing)} \times H(S(V=1)) + \frac{S(V=0)}{S(\varnothing)} \times H(S(V=0)),
$$

where $S(V=v)$
is the set of members of the random variable $X $ corresponding to$V=v$. For example, consider:


In [None]:
df_example = pd.DataFrame({'X': [0, 1, 0, 1, 0, 1],
                           'V': [1, 1, 1, 0, 0, 1]})
df_example

then $S(V=1)$ is the $X$ column from the subsetted data frame:

In [None]:
filtered_df = df_example[df_example['V'] == 1]
filtered_df

Write a function to calculate the conditional entropy of splitting on a particular variable for your dataset.

In [None]:
def conditional_entropy(variable_name, df):
    # Group by the given variable_name and calculate entropy for each group

    # Calculate the probabilities (ps)

    # Return the weighted sum of entropy


Use the function you created in the previous question to determine the reduction in entropy from splitting on each of the five possible variables. Which column yields the greatest reduction in entropy?

In [None]:
# Extract column names except the first one

# Calculate entropy reduction for each variable

# Create the result DataFrame



In [None]:
df_entropy_reduction


Explain intuitively why splitting on that variable resulted in the greatest reduction in entropy?



Create a decision tree classifier using the variable you have identified. The classifier outputs a classification probability:
$$ \Pr (X=1|V=v)=\frac{1}{S(V=v)}\sum_{i\in S(V=v)} X_i $$

where $X$ denotes the survival variable and $V$ denotes the variable you split on. The above just means your outputted probability of survival is the corresponding fraction surviving in the subset corresponding to that particular value of the variable $V$.

In [None]:
# Define the depth_one_classifier function in Python
def depth_one_classifier( ):


Create a decision tree classifier with depth 2 (i.e. it splits on two variables), where in each step it chooses which variable to split on based on the greatest reduction in entropy.


We now consider splitting on another of the remaining variables.

So, we next split on the class_1 variable.

In [None]:
def depth_two_classifier(  ):



Use your classifier to output the probabilities of survival for each (type of) individual in your dataset. Which groups have the highest survival probabilities and the lowest survival probabilities?
