# Lab assignment: tuning decision trees from imbalanced data

<img src="img/surgery.jpg" style="width:640px;height:406px;">

In this assignment we will apply a decision tree to solve an imbalanced problem, where positive patterns are scarce and the cost of making a mistake in this class is high. We will tune the tree construction method to take these facts into account.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

## Guidelines

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
You will need to solve a question by writing your own code or answer in the cell immediately below or in a different file, as instructed.</font>

***

<img src="img/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
This is a hint or useful observation that can help you solve this assignment. You should pay attention to these hints to better understand the assignment.
</font>

***

<img src="img/pro.png" height="80" width="80" style="float: right;"/>

***
<font color=#259b4c>
This is an advanced exercise that can help you gain a deeper knowledge into the topic. Good luck!</font>

***

To avoid missing packages and compatibility issues you should run this notebook under one of the [recommended Ensembles environment files](https://github.com/albarji/teaching-environments-ensembles).

Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Shift+Tab to produce a pop-out with related documentation. This will only work inside code cells. 

Let's go!

## Preliminaries

First of all, let's fix a random seed so all results are reproducible in different runs of the notebook.

In [None]:
import numpy as np
np.random.seed(12345)

The following code will embed any plots into the notebook instead of generating a new window:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

## Data preparation

We will make use of the [Thoracic Surgery Dataset](https://www.kaggle.com/sid321axn/thoraric-surgery), collected by the Wroclaw Thoracic Surgery Centre and readily available at Kaggle. The dataset contains information of 470 patients who underwent major lung resections for primary lung cancer in the years 2007-2011. Given a series of preprocessed prescriptor variables, the goal is to predict whether the patient will die within the next year after the surgery, and to find explanations on the major factors influencing this death. This could be helpful to identify patients at risk in future surgeries.

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
    
The data is contained in the file <b>ThoraricSurgery.csv</b>. Process the data to perform the following tasks:
- Load all the data into a pandas DataFrame.
- Create dummies for all categorical and binary features.
- Group all explanatory features in a new DataFrame or array <b>X</b>
- Extract the target as new DataFrame or array <b>Y</b>
- Split the data into a training and a testing subset. Use about 1/3 of the data as test data. Use a stratified split.
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
How many samples are available for the positive class (Risk)? How many for the negative class (No risk)?
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

## Objective function

This problem is imbalanced, not only in terms of class samples, but also in terms of costs: classifying a safe patient as a risky one might result in unneeded medical treatments, but classifying a patient at risk as a safe one might result in dead or severe health consequences. Therefore, the cost for False Negatives is much higher than the cost for False Positives. How much higher? Let's suppose we agree that a False Negative is 10 times worse than a False Positive.

In [None]:
FALSE_NEGATIVE_PENALTY = 10

With this, we can define a class-weighted accuracy metric as follows

In [None]:
from sklearn.metrics import confusion_matrix

def weighted_accuracy(y_values, y_preds, fn_penalty):
    conf = confusion_matrix(y_values, y_preds)
    loss = conf[0][1] + conf[1][0] * fn_penalty
    maxloss = conf[0][0] + conf[0][1] + (conf[1][0] + conf[1][1]) * fn_penalty
    return (maxloss - loss) / maxloss * 100

We can test it works as expected with the following toy examples. First, an array of perfect predictions except for a false positive

In [None]:
y_values = [0, 0, 1, 1]
y_preds =  [0, 1, 1, 1]
print(f"Weighted accuracy = {weighted_accuracy(y_values, y_preds, FALSE_NEGATIVE_PENALTY):.3}%")

Now a similar example with no false positives but one false negative

In [None]:
y_values = [0, 0, 1, 1]
y_preds =  [0, 0, 0, 1]
print(f"Weighted accuracy = {weighted_accuracy(y_values, y_preds, FALSE_NEGATIVE_PENALTY):.3}%")

As we expected, this metric gives much more weight to mistakes made over patients at risk. We will need to take this into account when creating our model.

## Naive model

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
Build a decision tree model over the training the data, then measure its weighted accuracy over the test data. You can make use of any pre-pruning, post-pruning or hyperparameters search method, but don't try to correct the class imbalance in any way. What is the best weighted accuracy you can obtain?
    
Create also a visualization of your best tree. Does it make any sense?
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

## Cost-sensitive model

A decision tree can take into account the misclassification costs of each class when deciding the splits to make. This can be done by weighting each pattern by the cost of its class in the computation of impurities. In scikit-learn this is easily implemented through the **class_weight** parameter. For instance, if the class labels were encoded as $0$-$1$, to create a decision tree that gives double the weight to positive class patterns we would need to write

In [None]:
DecisionTreeClassifier(class_weight={0: 1, 1: 2})

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
Build another decision tree model, but this time provide class weights according to the false positive / false negative costs defined previously. What is the best weighted accuracy you can obtain now? Does the visualization of the tree produce more sensible rules?
</font>

***

In [None]:
####### INSERT YOUR CODE HERE