# Foundations of Data Science (GDW) 2023



# Exercise IX: Entropy & Overfitting

This week, we will take a look at entropy and the phenomenon of overfitting, especially for the model class of Decision Trees.

## Part 1: Shannon Entropy

Suppose we want to measure the amount of information in a selection process by a function $H$, the
following properties look reasonable:
- the function $H$ should be continuous,
- if all elements are equaly likely $p_{element} = 1/n$, the function $H$ should increase monotonical with the number of possibilities $n$
- if the selection consists of multiple steps, H should be the weighted sum.

Following the proof from ..., the only possible function with these properties is the
Shannon entropy 

$\hspace{6cm} H = - \sum\limits_{x \in X} p(x) \log p(x)$

Now, let us take a look at an example:

A fair die is rolled at the same time as a fair coin is tossed. Let A be the number on the upper
surface of the die and let B describe the outcome of the coin toss, where B is equal to 1 if the result
is “head” and it is equal to 0 if the result if “tail”. 

The random variables X and Y are given by $X = A + B$ and $Y = A − B$, respectively. Let $a, b, x$, and $y$ denote possible values of the random variables $A, B, X,$ and $Y$ , respectively.

The random variable $X = A + B$ can take the values 1 to 7. The probability masses $p(x)$ for the
values 1 and 7 are equal to 1/12, since they correspond to exactly one event. The probability masses
for the values 2 to 6 are equal to 1/6, since each of these values corresponds to two events {a, b}.

Similarly, the random variable $Y = A − B$ can take the values 0 to 6, where the probability masses
for the values 0 and 6 are equal to 1/12, while the probability masses for the values 1 to 5 are equal
to 1/6.

The following tables list the possible values of the random variables $X$ and $Y$ , the associated events
$\{a, b\}$, and the probability masses $p_X(x)$ and $p_Y(y)$.

| x | events {a,b} | p_X(x) |//| y | events {a,b} | p_Y(y) |
|---|--------------|--------||---|--------------|--------|
| 1 | {1,0}        | 1/12   || 0 | {1,1}        | 1/12   |
| 2 | {2,0},{1,1}  | 1/6    || 1 | {1,0},{2,1}  | 1/6    |
| 3 | {3,0}, {2,1} | 1/6    || 2 | {2,0}, {3,1} | 1/6    |
| 4 | {4,0}, {3,1} | 1/6    || 3 | {3,0}, {4,1} | 1/6    |
| 5 | {5,0}, {4,1} | 1/6    || 4 | {4,0}, {5,1} | 1/6    |
| 6 | {6,0}, {5,1} | 1/6    || 5 | {5,0}, {6,1} | 1/6    |
| 7 | {6,1}        | 1/12   || 6 | {6,0}        | 1/12   |
    










### Task 1.1
Calculate the entropies $H(X)$ and $H(Y)$, the conditional entropies $H(X|Y)$ and $H(Y|X)$, the joint
entropy $H(X,Y)$ and the mutual information $I(X;Y)$.

*write your solution here or on paper*

## Part 2: Overfitting Trees

Given the *iris* dataset, we can learn a classiﬁer on the ﬁrst two features with the following code.

In [None]:
import pandas as pd
from sklearn import datasets
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

iris = datasets.load_iris()

X = iris.data[:, :2] # features
y = iris.target # labels

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iris_model = DecisionTreeRegressor(random_state=1, max_depth=None)
# Fit Model
iris_model.fit(train_X, train_y)

# calculate mean absolute error on training instances
train_predictions = iris_model.predict(train_X)
train_mae = mean_absolute_error(train_predictions, train_y)
print ("Training MAE " + repr(train_mae))

# Make validation predictions and calculate mean absolute error
val_predictions = iris_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print ("Validation MAE " + repr(val_mae))

As you observe, the training MAE is much smaller than the validation MAE. This phenomenon is known as *overﬁtting*.

### Task 2.1
Reduce the model complexity by changing the parameter `max_depth` and observe its behaviour.
What do you think is the optimal tree depth, when comparing MAEs?

In [None]:
import matplotlib . pyplot as plt
import numpy as np

def train_with_depth(X, y, maxdepth):

    # Split into validation and training data
    t_mae =[0]* maxdepth
    v_mae =[0]* maxdepth
    train_X, val_X, train_y, val_y = train_test_split(X,y)

    for depth in range (1, maxdepth+1):
        # Specify Model
        iris_model = DecisionTreeRegressor(max_depth = depth)
        # Fit Model
        iris_model.fit(train_X, train_y)
        # calculate mean absolute error on training instances
        train_predictions = iris_model.predict(train_X)
        t_mae[depth-1] = mean_absolute_error(train_predictions, train_y)
        # Make validation predictions and calculate mean absolute error
        val_predictions = iris_model.predict(val_X)
        v_mae[depth-1] = mean_absolute_error(val_predictions, val_y)
        # print (" Validation MAE "+ repr ( val_mae ) )

    t_mae = np.asarray(t_mae)
    v_mae = np.asarray(v_mae)
    return t_mae, v_mae
    
t_mae, v_mae = train_with_depth(X, y, maxdepth=10)

plt.plot(range(1, maxdepth+1), t_mae, c="blue")
plt.plot(range(1, maxdepth+1), v_mae , c="red")

plt.show()

However, since we set no seed, if you rerun the code, you can see that the results diﬀer, as they are highly dependent on the split of training and validation data. 
This means, that we can not generalize our ﬁndings, but instead have to average over multiple
random splits, to get the expected error values.

### Task 2.2
Repeat the code for at least 100 repetitions and take the average MAEs. Plot the results in a graph.

In [None]:
# write your code here

### Task 2.3
Which is the prefered depth for our model? Define a variable `pref_depth=` that holds your chosen value.

In [None]:
pref_depth=-1 # <- change this value

We may train and plot one of these trees by executing the code below:

In [None]:
from sklearn.tree import plot_tree
iris_model = DecisionTreeRegressor(max_depth=pref_depth)
# Fit Model
iris_model.fit(train_X, train_y)
plt.figure(figsize =(10, 7))
p = plot_tree(iris_model, label="none", feature_names=["length", "width"])

In the leafs you can see the predicted label. As you realize the label is continuous, because we applied a re-
gression model, but actually predicting a discrete class is a Classiﬁcation problem.

You may train and plot a Classiﬁcation tree with the following code

In [None]:
from sklearn.tree import DecisionTreeClassifier
iris_model = DecisionTreeClassifier(max_depth=pref_depth)
# Fit Model
iris_model.fit(train_X, train_y)
plt.figure(figsize=(10, 7))
p = plot_tree(iris_model, label="none", impurity=True, filled=True, feature_names=["length", "width"])