# Exercises solutions
First thing first, we need to import the relevant libraries and the dataset

In [89]:
from sklearn import tree
import graphviz
from sklearn.datasets import load_iris
import numpy as np

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

We then load the dataset and split it into test set and training set

In [14]:
# Load the iris dataset
iris = load_iris()

# Setup seed
rngs = 0
np.random.seed(rngs)

# Create a uniform distribution of indexes
indices = np.random.permutation(len(iris.data))
# Split the dataset
indices_train = indices[:-10]
indices_test = indices[-10:]

X_train = iris.data[indices_train]
X_test = iris.data[indices_test]
y_train = iris.target[indices_train]
y_test = iris.target[indices_test]

[ 88  70  87  36  21   9 103  67 117  47]


# Inflating the dataset
To solve the first point of the exercise, we need to *artificially inflate* the dataset. The idea is to create a new train dataset, in which each example is repeated according to its corresponding class cost.
In order to do that, we first specify a cost array, where the $i^{th}$ element corresponds to the cost assigned to the $i^{th}$ class. We then create an array that specify for every element how many times it has to be repeated, which will then be used by the `np.repeat` function. With that function we can inflate the dataset by inflating indexes that corresponds to instances on the training set.

In [80]:
costs = np.array([10, 1, 1])
inflation_factors = costs[y_train]
# ai stands for: artificially inflated
indices_train_ai = np.repeat(indices_train, inflation_factors)

# We now get the real inflated dataset by using inflated indexes
X_train_ai = iris.data[indices_train_ai]
y_train_ai = iris.target[indices_train_ai]

# Defining the models
Essentially 2 models will be trained: one on the *artificially inflated* dataset, while the other one will be trained on the original training set, but with the cost hyperparameter set to match the inflated version.
According to the theory, those models should ouput the same predictions.

In [78]:
seed = 300
impurity_measure = "entropy"
minimum_samples = 5
weights = { i:costs[i] for i in range(0, len(costs)) } # transform costs inot a suitable form for the library

clf_hp = tree.DecisionTreeClassifier(criterion=impurity_measure, random_state=seed, min_samples_leaf=minimum_samples, class_weight=weights)
clf_ai = tree.DecisionTreeClassifier(criterion=impurity_measure, random_state=seed, min_samples_leaf=minimum_samples)

We now train the model and save their predictions on the test set

In [86]:
clf_hp.fit(X_train, y_train)
clf_ai.fit(X_train_ai, y_train_ai)

results_hp = clf_hp.predict(X_test)
results_ai = clf_hp.predict(X_test)

As a last step, we evaluate the performances of the models

In [91]:
print("Predictions:")
print(results_hp)
print(results_ai)
print("True classes:")
print(y_test) 
# We can see that results are in accordance with the theory

acc_hp = accuracy_score(y_test, results_hp)
acc_ai = accuracy_score(y_test, results_ai)
print("Accuracy score (Costs on hyperparameters): "+ str(acc_hp))
print("Accuracy score (Costs artificially inflating): "+ str(acc_ai))
f1_hp = f1_score(y_test, results_hp, average='macro')
f1_ai = f1_score(y_test, results_hp, average='macro')
print("F1 score (Costs on hyperparameters): "+str(f1_hp))
print("F1 score (Costs artificially inflating): "+str(f1_ai))

Predictions:
[1 2 1 0 0 0 2 1 2 0]
[1 2 1 0 0 0 2 1 2 0]
True classes:
[1 1 1 0 0 0 2 1 2 0]
Accuracy score (Costs on hyperparameters): 0.9
Accuracy score (Costs artificially inflating): 0.9
F1 score (Costs on hyperparameters): 0.8857142857142858
F1 score (Costs artificially inflating): 0.8857142857142858
