In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import display

pd.options.display.max_columns = 50
pd.options.display.max_rows = 200
plt.rcParams['figure.figsize'] = [16, 9]

## Exercise 10: Titanic
You may be familiar with the Titanic data set from other events. Here, various features are used to predict whether a Titanic passenger survived the disaster. Reading in the data, rough pre-processing and splitting into training and test sets are already predefined here. 

In [3]:
from sklearn.model_selection import train_test_split

df = pd.read_csv("../data/titanic.csv")
# We are removing some features to make the problem clearer
df = df.drop(["PassengerId", "Name","Ticket", "Cabin"], axis=1)
# We remove those passengers for whom the age is not specified
df = df[df["Age"].isna() == False]

# Label: survived yes/no
y = df.pop("Survived")
# Features including one-hot encoding of the categorical
X = pd.get_dummies(df)

# Split in Trainings- und Testset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
328,3,31.0,1,1,20.5250,True,False,False,False,True
73,3,26.0,1,0,14.4542,False,True,True,False,False
253,3,30.0,1,0,16.1000,False,True,False,False,True
719,3,33.0,0,0,7.7750,False,True,False,False,True
666,2,25.0,0,0,13.0000,False,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...
92,1,46.0,1,0,61.1750,False,True,False,False,True
134,2,25.0,0,0,13.0000,False,True,False,False,True
337,1,41.0,0,0,134.5000,True,False,True,False,False
548,3,33.0,1,1,20.5250,False,True,False,False,True


## Task 1: Train a decision tree for the example. 
Take a look at the notebook ``Exercise 11 - Decision Tree Examples.ipynb``. Visualize your result with ``plot_tree()``.
Experiment with the hyperparameters ``max_depth`` and ``min_samples_split``. What do these parameters mean? How does the performance (accuracy, precision, recall) change on the training and test data?

In [3]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, recall_score, precision_score

# Solution

## Task 2: Hyperparameter optimization.
a) Find the best value for the tree depth using hyperparameter tuning and 5-fold cross-validation.

b) [optional] Find the best *combination* of tree depth and minimum number of examples required for a split using hyperparameter tuning and 5-fold cross-validation.

In [4]:
from sklearn.model_selection import KFold

# Solution

## Task 3: Repeat Task 2 with a logistic regression.
You can use the class ``LogisticRegressionCV`` instead of the class ``LogisticRegression``. This performs a cross-validation for the regularization parameter ``C``. Create a model with the parameter ``penalty="l2"`` and select 10 values for ``C`` and a 5-fold cross-validation. What is the (mathematical) meaning of ``C``?

In [5]:
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression


#Solution

## Task 4: Compare both models in a ROC plot.
What statements can you derive from the plot?

Hint:
https://scikit-learn.org/dev/modules/generated/sklearn.metrics.roc_curve.html

https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py

In [6]:
from sklearn.metrics import roc_curve

# Solution