# Calculating evaluation metrics with Python

In this week's lectures you have seen how to evaluate a model based on quality metrics.

Let's see how we can calculate those metrics on a model with Python. To illustrate, let's train a logistic regression on the Iris dataset, like you did in the Regression instruction.

First off, import the dataset and separate descriptive attributes from target.

In [None]:
import pandas as pd

iris = pd.read_csv(r"iris.csv")
X = iris.iloc[:,0:4]
y = iris.iloc[:,-1]

Then, let's separate training and test data. We can do it with a `scikit-learn` function, `train_test_split`. This function will split _randomly_ the data in a training and test set, in a 75%-25% proportion; the stratify parameter makes sure the classes are **balanced**.

In [None]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, stratify=y)
print(train_X.shape)
print(test_X.shape)
print(train_y.shape)
print(test_y.shape)

Now, let's fit a logistic regression model to the training data...

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(train_X, train_y.ravel())

...and predict the target value for the test data.

In [None]:
pred_y = classifier.predict(test_X)

At this point, we can calculate a confusion matrix for the real and predicted values of the test data target. It is very straightforward.

Notice that, since the split is random, it is not guaranteed that the confusion matrix will contain exactly the same values for each execution of this Jupyter Notebook.

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(test_y, pred_y)
print(cm)

...alright, maybe it IS straightforward, but it is not good looking.

A better alternative is the `crosstab` function from Pandas:

In [None]:
import numpy as np
pred_y = np.array(pred_y)
true_y = np.array(test_y)

pd.crosstab(np.array(true_y), np.array(pred_y), rownames=['True'], colnames=['Predicted'], margins=True)


We can calculate the common metrics of our classification as follows.

Notice that **precision**, **recall** and **f-measure** need an additional parameter. The default behaviour works only for **binary** classification; for multiclass, we have to specify the aggregation function to be used as average.

Passing `None` we obtain the **class-wise** metrics. You can find the other options in the docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

In [None]:
from sklearn.metrics import accuracy_score

print(precision_score(true_y, pred_y, average=None))
print(recall_score(true_y, pred_y, average=None))
print(accuracy_score(true_y, pred_y))
print(f1_score(true_y, pred_y, average=None))

## Binary classification case

In the case of binary classification, we can of course use the `sklearn.metrics.confusion_matrix` to get true/false positive/negative, flattening the matrix.
Let's do it again on the Iris dataset, but removing the instances with label _Iris-setosa_ to turn it into a binary classification problem.

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

iris = pd.read_csv(r"iris.csv")
iris = iris.loc[iris['Species'] != 'Iris-setosa']
X = iris.iloc[:,0:4]
y = iris.iloc[:,-1]

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, stratify=y)

classifier = LogisticRegression()
classifier.fit(train_X, train_y.ravel())

pred_y = classifier.predict(test_X)

import numpy as np
pred_y = np.array(pred_y)
true_y = np.array(test_y)

cm = confusion_matrix(true_y, pred_y)
tn, fp, fn, tp = cm.ravel()
print(tn, fp, fn, tp)

And we can calculate the common metrics as follows. In this case, we don't need to specify the aggregation function since the classification is binary (binary metrics is the default behaviour). Since the labels are strings we must, however, specify which label is the _positive_ and which is the _negative_.

In [None]:
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

print(precision_score(true_y, pred_y, pos_label='Iris-versicolor'))
print(recall_score(true_y, pred_y, pos_label='Iris-versicolor'))
print(accuracy_score(true_y, pred_y))
print(f1_score(true_y, pred_y, pos_label='Iris-versicolor'))

### ROC and AUC

When a binary classifier returns a probability estimate or a degree of belief for a certain class we can plot the ROC curve that you have seen in the lecture, and then compute the AUC. Refer to the slides for the theory behind it; the only difference is that the Python implementation of the AUC score calculates the area with the trapezoidal rule, rather than rectangles; the resulting AUC score is more precise, but the formula is more complex.

Note that the `auc` function we use is a generic function that can calculate the area under any curve, having the x and y points.
There is also the way to directly calculate the AUC from labels and probabilities, with the `sklearn.metrics.roc_auc_score` function.

In [None]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(true_y, classifier.predict_proba(test_X)[:,0], drop_intermediate=False, pos_label='Iris-versicolor')
roc_auc = auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()


Rectangle rule: ![image.png](attachment:image.png)

Trapezoidal rule: ![image.png](attachment:image.png)

## Cross validation

Cross-validating a model with K-fold cross validation can be done through the `cross_val_score` of Sklearn. This method takes care of everything, you just have to specify the classifier, the training data, the labels, the metric to be evaluated, and the number of folds.

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

iris = pd.read_csv(r"iris.csv")
X = iris.iloc[:,0:4]
y = iris.iloc[:,-1]

classifier = LogisticRegression()

cross_val_score(classifier, X, y, scoring='accuracy', cv=5)