Evaluation of KD Experiments
========================

This Jupyter notebook will cover the topics:

* precision and recall
* nDCG
* confidence intervals

Precision and Recall
-----------------------------

Assume you trained a model to detect cats in images. The two lists below contain the true labels for a sequence of 20 images (`true_labels`) and the predictions of your model (`predicted_labels`).

In [None]:
true_labels = [
    'cat',
    'cat',
    'no_cat',
    'cat',
    'no_cat',
    'cat',
    'cat',
    'no_cat',
    'cat',
    'no_cat',
    'no_cat',
    'cat',
    'cat',
    'no_cat',
    'cat',
    'cat',
    'cat',
    'cat',
    'no_cat',
    'cat'
]
predicted_labels = [
    'cat',
    'no_cat',
    'no_cat',
    'no_cat',
    'no_cat',
    'cat',
    'no_cat',
    'cat',
    'cat',
    'no_cat',
    'no_cat',
    'cat',
    'cat',
    'no_cat',
    'cat',
    'cat',
    'no_cat',
    'cat',
    'cat',
    'cat'
]

**Task:** Complete the code below to calculate precision and recall for the given predictions.

*Note:* You're implementing the calculation as an exercise. In real applications one would rely on a library.

In [None]:
num_predictions = len(predicted_labels)

tp = 0
tn = 0
fp = 0
fn = 0
for i in range(num_predictions):
    ground_truth = true_labels[i]
    prediction = predicted_labels[i]
    # ...
        
precision = # ...
recall = # ...

print(precision)
print(recall)

Assume your model outputs its confidence of having detected a cat as a probability score $\in [0-1]$. For above predicted labels, a threshold of 0.5 was used.

**Task:** Complete the code below to generate a precision recall curve.

*Note:* From hereon we'll use methods from the *sklearn* library to calculate precision and recall. As the function expects *0*s and *1*s as class "labels" we have to convert the ground truth accordingly.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import recall_score, precision_score

prediction_probabilities = [
    0.71,  # cat
    0.19,  # no_cat
    0.20,  # no_cat
    0.46,  # no_cat
    0.44,  # no_cat
    0.61,  # cat
    0.22,  # no_cat
    0.80,  # cat
    0.82,  # cat
    0.41,  # no_cat
    0.15,  # no_cat
    0.82,  # cat
    0.89,  # cat
    0.47,  # no_cat
    0.75,  # cat
    0.57,  # cat
    0.40,  # no_cat
    0.87,  # cat
    0.67,  # cat
    0.71   # cat
]

# convert ground truth labels to numerical values
cat_to_num = {
    'cat': 1,
    'no_cat': 0
}
ground_truths_numerical = [
    # ...
]

# calculate
precision_values = []
recall_values = []

for i in range(101):
    threshold = i/100
    
    # determine predicted classes (as numerical values)
    predictions_numerical = []
    # ...
    
    precision = precision_score(
        ground_truths_numerical,
        predictions_numerical,
        average='binary',
        pos_label=1,
        zero_division=1
    )
    recall = recall_score(
        ground_truths_numerical,
        predictions_numerical,
        average='binary',
        pos_label=1,
        zero_division=1
    )
    
    # ...

plt.plot(precision_values, recall_values, marker='.')
plt.xlabel('precision')
plt.ylabel('recall')
plt.show()

You can check your output against a precision recall curve generated by *sklearn*:

In [None]:
from sklearn.metrics import precision_recall_curve

p_vals_check, r_vals_check, thresholds = precision_recall_curve(
    ground_truths_numerical,
    prediction_probabilities
)
plt.plot(p_vals_check, r_vals_check, marker='.')
plt.xlabel('precision')
plt.ylabel('recall')
plt.show()

Similarly you can, for example, easily generate a ROC curve:

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(ground_truths_numerical, prediction_probabilities)
plt.plot(fpr, tpr, marker='.')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.show()

nDCG
---------

In the evaluation of a recommender system you got an ordered list of 20 documents with the following relevance.

In [None]:
relevance_scores = [
    0.7, 0.9, 0.3, 0.5, 0.1, 0.1, 0.8, 0.2, 0.4, 0, 0, 0, 0, 0, 0.1, 0.5, 0, 0.3, 0, 0.1
]

**Task:** Complete the implementation of the nDCG score below.

In [None]:
import math


num_documents = len(relevance_scores)
ideal_order_scores = sorted(relevance_scores, reverse=True)

dcg = 0
idcg = 0

for i in range(num_documents):
    # ...

ndcg = # ...
print(ndcg)

**Task:** Copy and modify above code such that the nDCG@5 is calculated.

* a) with a focus purely on the *order* of the top 5 documents
* b) with a focus on the result quality in general evaluated at by means of the top 5 documents

In [None]:
# a) Order focus

# ...

In [None]:
# b) General result quality focus

# ...

Confidence intervals
-----------------------------

To get information on the basic properties of of a list of values, we can use the *Pandas* library. Working with *Pandas* you will often encounter DataFrames (you can think of these as tables). For a list of values (you can also think of a column in a table), *Pandas* uses so called Series.

We'll start out by defining a standard Python list and creating a Series from it.

In [None]:
import pandas as pd

l = [1, 4, 3, 4, 0, 6]
s = pd.Series(l)

**Task:** Execute below cell and take a look at `s`.

You'll see that every element as a number (called *index*) associated with it. You can access a single element of `s` using `s.at[i]` where `i` is the index of the value.

`s` has *a lot* more to offer than just `.at`. You can use Python's builtin function `dir()` to see the list of all the attributes of the Series `s` (enter `dir(s)` and execute). *Note:* attributes starting with an underscore are typically meant for internal use of the class and not for use from "outside".

**Task:** pick a few interesting/useful sounding attributes of `s` and inform yourself about them using `help(s.foo)` where `foo` is the attributes name. Showcase the use of two attributes and print some useful/interesting information.

In [None]:
s

Some of the values related to confidence inverfals are directly available as attributes provided by Series objects. These are`var` (variance), `std` (standard deviation),  and `sem` (standard error).

Let's assume the values in `l` are observations in an experiment. We can get their variance, standard deviation and standard error as follows.

In [None]:
delta_deg_of_freedom = 0  # not using Bessel's correction (done here just so that we get nice numbers)

variance = s.var(ddof=delta_deg_of_freedom)
std = s.std(ddof=delta_deg_of_freedom)
std_err = s.sem(ddof=delta_deg_of_freedom)

print(f'Variance: {variance}')
print(f'Standard deviation: {std}')
print(f'Standard error: {std_err}')

**Task:** Complete the code below to calculate a confidence interval for `s`.

*Note:* You may use `std` but not `sdr_err`.

In [None]:
from scipy.stats import t

n = len(s)  # number of observations
confidence = 0.95
students_t = t.ppf((1 + confidence) / 2, n)
print('Student\'s t = {:.4f}'.format(students_t))

std_error = # ...
# ...
upper_bound = # ...
lower_bound = # ...
print(f'Confidence interval (95%): {lower_bound:.4f} ≤ μ ≤ {upper_bound:.4f}')

A nice way to visualize a series of measurements for a given value is a box plot. Below cell generates 50 random values and generates a box blot with nochtes.

**Task:** Execute below cell and identify the informative elements that appear in the plot.

In [None]:
import random
import numpy as np
random_vals = np.random.normal(0, 2, 50)
random_vals[0] = 7
df_rand_vals = pd.DataFrame(random_vals, columns=['vals'])

print('Random values: {}'.format(
    ','.join(['{:.2f}'.format(v) for v in random_vals])
))
print('Median: {}'.format(
    df_rand_vals.vals.median()
))
print('95% confidence interval: {}'.format(
    t.interval(
        0.95,
        len(foo),
        loc=df_rand_vals.vals.mean(),
        scale=df_rand_vals.vals.sem()
    )
))
df_rand_vals.boxplot(notch=True, figsize=(15,10))