# Foundations of Data Science (GDW) 2023



# Exercise X: Classification Quality & SVMs

This week, we will take a look at precision and recall as quality measures for classification models, and additionally at the Support Vector Machine.

## Part 1: Precision and Recall

For binary classification problems the known quality measures MSE etc. may hardly be applied.
Instead, performance is often measured in terms of Precision and Recall. These measures are
defined by the following ratios

$Precision =\frac{TP}{TP + FP}$

and

$Recall = \frac{TP}{TP + FN}$

where $TP$ are the number of true positives, $FP$ is the number of false positives, $FN$ is the number of
false negatives.

Suppose a binary classification problem, spam classification, suppose further, we projected the features to the interval $x \in (0, 1)$ and have a decision function f(x) dependant on this feature.

$f(x) = \quad \text{"spam" if } x \geq 0, \\\quad\quad\quad\quad \text{"no spam" if } x < 0$.

In the next figures, you can see the classification results of different models. The binary label $y$ is indicated by the color below the data
points, from which you can see whether these points are $TP$ , $TF$ , $FN$ or $FP$ according to the model. 

For generating the images we want, we need to install the `Pillow` library first.

In [None]:
!pip install Pillow

Afterwards, we can execute the script below, which generates images for classification results.

In [None]:
from PIL import Image, ImageDraw, ImageFont
from IPython.display import display

class_results = [["TN", "TN","TN","TN","TN","TN","TN","TN","TN", "TN","TN","TN","TN","TN","TN", "FN", 
                 "TN", "FN", "TN", "FN", "FP", "TP", "FP", "TP", "TP", "TP", "TP", "TP", "TP", "TP"],
                ["TN", "TN","TN","TN","TN","TN","TN","TN","TN", "TN","TN","TN","TN","TN","TN", "FN", 
                 "TN", "FN", "TN", "FN", "TN", "FN", "FP", "TP", "TP", "TP", "TP", "TP", "TP", "TP"],
                 ["TN", "TN","TN","TN","TN","TN","TN","TN","TN", "TN","TN","TN","TN","TN","TN", "FN", 
                 "TN", "FN", "FP", "TP", "FP", "TP", "FP", "TP", "TP", "TP", "TP", "TP", "TP", "TP"]]
class_thresholds = [20, 22, 18]  

def generate_class_colors(results):
    circle_colors = ["red", "red", "green", "green"]
    labels = ["TN", "FP", "TP", "FN"]
    result_colors = []
    for res in results:
        result_colors.append(circle_colors[labels.index(res)])
    
    return result_colors

def generate_image(circle_colors, labels, threshold, image_width=850, circle_radius=14, label_font_size=14):
    image_height = circle_radius*2 + label_font_size + 30
    image = Image.new("RGB", (image_width, image_height), "white")
    draw = ImageDraw.Draw(image)
    font = ImageFont.truetype("Pillow/Tests/fonts/FreeMono.ttf")
    
    start_x = (image_width - len(circle_colors) * (circle_radius*2)) - 10
    
    for i, (color, label) in enumerate(zip(circle_colors, labels)):
            
        draw.ellipse(
            [(start_x, 10), (start_x + circle_radius*2, circle_radius*2 + 10)],
            fill=color,
            outline="black",
        )
        
        label_width, label_height = draw.textsize(label)
        label_position = (start_x + circle_radius - label_width // 2, circle_radius*2 + 20,)
        draw.text(label_position, label, fill="black", font=font)
        
        if i == threshold-1:
            line_margin = 3
            line_x = start_x + circle_radius*2 + line_margin
            line_y_start = 0
            line_y_end = circle_radius*2 + 35
            draw.line([(line_x, line_y_start), (line_x, line_y_end)], fill="black")
            start_x += line_margin*2
        
        start_x += circle_radius*2
        
    display(image)

for i, (class_result, class_threshold) in enumerate(zip(class_results, class_thresholds)):
    result_colors = generate_class_colors(class_result)
    print(f"Classification model {i+1}:")
    generate_image(result_colors, class_result, threshold=class_threshold)

### Task 1.1
Compute precision and recall for each of these classifiers. The vertical lines denote the classification thresholds (leftmost = 0, rightmost = 1).

*write your answers here*

## Part 2: The Support Vector Machine
Before we focus on the support vector method, it is important to understand classification with a linear
model. Next, we exemplify this for the well-known iris datset.

In [None]:
import pandas as pd
from sklearn import datasets , svm
import numpy as np
import matplotlib.pyplot as plt

normal_plane =[0, -1, 2]

#sign function returns -1 for negative x and 1 for positive x, 0 for edge cases
def sgn(x):
    return 1 if x > 0 else -1 if x < 0 else 0

def distance (x, normalvector) :
    return (np.dot(x, normalvector[1:]) + normalvector[0])

iris = datasets.load_iris()
X = iris.data[:, [0 ,1]] # features
y = iris.target # labels
y = (y==0)*2 - 1

y_pred = [0]*y.size
i = 0

for x in X:
    y_pred[i]= sgn(distance(x, normal_plane))
    i += 1
    
# plt . scatter ( X [: ,0] , X [: ,1] , c = y )
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

 # fit the model
for kernel in ('linear', 'rbf', 'poly'):
    clf = svm.SVC(kernel=kernel, gamma=10)
    clf.fit(X_train, y_train)
    plt.figure()
    plt.clf()
    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.Paired, edgecolor='k', s=20)

    # Circle out the test data
    plt.scatter(X_test[:, 0], X_test[:, 1], s=80, facecolors='none', zorder=10, edgecolor='k')

    plt.axis('tight')
    x_min = X[:, 0].min()
    x_max = X[:, 0].max()
    y_min = X[:, 1].min()
    y_max = X[:, 1].max()

    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(XX.shape)
    # plt . pcolormesh ( XX , YY , Z > 0 , cmap = plt . cm . Paired )
    plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--','-','--'], levels=[-.5, 0, .5])
    plt.title(kernel)
    
plt.show()

### Task 2.1
For the code above, apply the following changes:

`X = iris.data[:, [0,3]] #features` and
`y = (y==0)*2 - 1 #labels`

What is the best SVM for this data?

In [None]:
# write your code here

*write your answer here*