# Homework III

Diogo Correia (ist199211) & Tomás Esteves (ist199341)

## I. Pen and Paper [12v]

**Given the following decision tree learnt from 20 observation using Shannon entropy, with leaf annotations (`#correct/#total`)**

![Decision Tree](./decision_tree.png)

### 1) [4v] Draw the training confusion matrix

<table>
  <tr>
    <td colspan="2" rowspan="2" style="border-top: none; border-left: none;"></td>
    <th colspan="2">True</th>
    <td rowspan="2" style="border-top: none; border-right: none;"></td>
  </tr>
  <tr>
    <th>Positive</th>
    <th>Negative</th>
  </tr>
  <tr>
    <th rowspan="2">Predicted</th>
    <th>Positive</th>
    <td>8</td>
    <td>4</td>
    <td>12</td>
  </tr>
  <tr>
    <th>Negative</th>
    <td>3</td>
    <td>5</td>
    <td>8</td>
  </tr>
  <tr>
    <th colspan="2" style="border-left: none; border-bottom: none;"></th>
    <td>11</td>
    <td>9</td>
    <td>20</td>
  </tr>
</table>

### 2) [3v] Identify the training F1 after a post-pruning of the given tree under a maximum depth of 1.

<table>
  <tr>
    <td colspan="2" rowspan="2" style="border-top: none; border-left: none;"></td>
    <th colspan="2">True</th>
    <td rowspan="2" style="border-top: none; border-right: none;"></td>
  </tr>
  <tr>
    <th>Positive</th>
    <th>Negative</th>
  </tr>
  <tr>
    <th rowspan="2">Predicted</th>
    <th>Positive</th>
    <td>5</td>
    <td>2</td>
    <td>7</td>
  </tr>
  <tr>
    <th>Negative</th>
    <td>6</td>
    <td>7</td>
    <td>13</td>
  </tr>
  <tr>
    <th colspan="2" style="border-left: none; border-bottom: none;"></th>
    <td>11</td>
    <td>9</td>
    <td>20</td>
  </tr>
</table>

In [None]:
true_positives = 5
false_positives = 2
false_negatives = 6

precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)

f1_measure = (0.5 * (1 / precision + 1 / recall)) ** (-1)

f1_measure

### 3) [2v] Identify two different reasons as to why the left tree path was not further decomposed.

The left tree path might not have been further decomposed because:

- We did not want to overfit the model, since we have a very small samlpe size.
  For this reason, if we were to further decompose the left tree path, we might end up with a less accurate
  decision tree, since the 2 negative observations might have been outliers.
- The information gain of this branch, $IG(y_{out} | y_2, y_1 = A)$, might be very small,
  since there are a lot more observations classified as positive than as negative.
  If we were to decompose the left path, there might be no optimal division that would correctly identify all observations.

### 4) [3v] Compute the information gain of variable y1

In [None]:
from math import log2
import operator as op
from itertools import chain
from functools import reduce

In [None]:
# INPUT
total_positive_count = 11
total_negative_count = 9

branch_a_positive_count = 5
branch_a_negative_count = 2

branch_b_positive_count = 6
branch_b_negative_count = 7

In [None]:
# Functions
def entropy_by_count(counts):
    """
    Calculates the information entropy, I(X), of a set, given the count of each class of element
    """
    total = sum(counts)
    return reduce(op.add, map(lambda x: -(x / total) * log2(x / total), counts))


def split_entropy_by_count(branch_counts):
    """
    Calculates the entropy after branching on a variable
    """
    # branch counts is a list of int lists
    total = sum(chain(*branch_counts))
    return reduce(
        op.add, map(lambda x: (sum(x) / total) * entropy_by_count(x), branch_counts)
    )

In [None]:
entropy_y_out = entropy_by_count([total_positive_count, total_negative_count])
entropy_y_out_y1 = split_entropy_by_count(
    [
        [branch_a_positive_count, branch_a_negative_count],
        [branch_b_positive_count, branch_b_negative_count],
    ]
)

information_gain = entropy_y_out - entropy_y_out_y1

information_gain

## Programming and Critical Anlaysis[8v]

**Consider the following three regressors applied on kin8nm.arff data (available at the webpage):**

- linear regression with Ridge regularization term of 0.1
- two MLPs
     - 𝑀𝐿𝑃1 and 𝑀𝐿𝑃2 
- each with two hidden layers of size 10, hyperbolic tangent function as the activation function of all nodes, a maximum of 500 iterations, and a fixed seed (random_state=0). 
- 𝑀𝐿𝑃1 should be parameterized with early stopping while 𝑀𝐿𝑃2 should not consider early stopping. 

Remaining parameters (e.g., loss function, batch size, regularization term, solver) should be set as default

Using a 70-30 training-test split with a fixed seed (random_state=0):

### 4) [4v] **Compute the MAE of the three regressors: linear regression, 𝑀𝐿𝑃1 and 𝑀𝐿𝑃2.**

In [None]:
from operator import itemgetter
import pandas as pd
from scipy.io.arff import loadarff
from sklearn import feature_selection, model_selection, tree, metrics, preprocessing
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Reading the ARFF file
data = loadarff("../data/kin8nm.arff")
df = pd.DataFrame(data[0])

df.head()

In [None]:
# Separate features from the outcome (class)
X = df.drop("y", axis=1)
y = df["y"]

y.head()

In [None]:
# Split the dataset into a training set (70%) and a testing set (30%)
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X.values, y.values, train_size=0.7, random_state=0
)

In [None]:
from sklearn.linear_model import Ridge

In [None]:
rr = Ridge(alpha=0.1)

In [None]:
from sklearn.neural_network import MLPRegressor

In [None]:
mlp1 = MLPRegressor(
    hidden_layer_sizes=(10, 10),
    activation="tanh",
    max_iter=500,
    random_state=0,
    early_stopping=True,
)

In [None]:
mlp2 = MLPRegressor(
    hidden_layer_sizes=(10, 10),
    activation="tanh",
    max_iter=500,
    random_state=0,
    early_stopping=False,
)

In [None]:
rr.fit(X_train, y_train)
mlp1.fit(X_train, y_train)
mlp2.fit(X_train, y_train)

In [None]:
rr_pred = rr.predict(X_test)
mlp1_pred = mlp1.predict(X_test)
mlp2_pred = mlp2.predict(X_test)

In [None]:
print("Ridge Regularization MAE:", metrics.mean_absolute_error(y_test, rr_pred))

In [None]:
print("MLP1 Regularization MAE:", metrics.mean_absolute_error(y_test, mlp1_pred))

In [None]:
print("MLP2 Regularization MAE:", metrics.mean_absolute_error(y_test, mlp2_pred))

### 5) [1.5v] **Plot the residues (in absolute value) using two visualizations: boxplots and histograms.**

Hint: consider using boxplot and hist functions from matplotlib.pyplot to this end

In [None]:
rr_residues = []
mlp1_residues = []
mlp2_residues = []

for i in range(0, len(y_test)):
    rr_residues.append(abs(y_test[i] - rr_pred[i]))
    mlp1_residues.append(abs(y_test[i] - mlp1_pred[i]))
    mlp2_residues.append(abs(y_test[i] - mlp2_pred[i]))

In [None]:
df = pd.DataFrame({"Ridge": rr_residues, "MLP1": mlp1_residues, "MLP2": mlp2_residues})

df.head()

In [None]:
sns.boxplot(data=df)

plt.show()

In [None]:
sns.histplot(data=df)

plt.show()

### 6) [1v] **How many iterations were required for 𝑀𝐿𝑃1 and 𝑀𝐿𝑃2 to converge?**

In [None]:
print("MLP1 number of iterations:", mlp1.n_iter_)

In [None]:
print("MLP2 number of iterations:", mlp2.n_iter_)

### 7) [1.5v] **What can be motivating the unexpected differences on the number of iterations?**

**Hypothesize one reason underlying the observed performance differences between the MLPs.**

Read more about the MLP regressor at https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html