# Evaluation metrics in classification problems

We need some kind of evaluation metric to evaluate how good our trained model is doing on unseen data (validation set). Following are the metrics which are commonly used for classification problems:

1. Accuracy score
2. Precision
3. Recall
4. Area under ROC (Receiver Operating characteristics) or AUC
5. Log loss

Knowing above evaluation metrics is good, but we should also know when to use which metric. These metrics usage depend on the problem and mainly target we're trying to predict.

Consider, a **binary classification problem** i.e. *the target variable is divided into 2 classes*. An example could be a classification problem for given chest X-ray images to classify if there is pneumothorax in the image. Pneumothorax is a condition where the lung is collapsed and it can be seen in the chest X-ray image.

![Normal vs Pneumothorax](https://assets.aboutkidshealth.ca/akhassets/Pneumothorax_XRAY_MEDIMG_PHO_EN.png?RenditionID=19)

Let's say we're given 100 images with equal number of Pneumothorax and non-Pneumothorax images. For our training purpose, we divide the dataset into training and validation sets (7:3) with same ratio as in original dataset.

Thus, we'll have 70 images in training set and 30 images in validation set. Then we train our model on 70 images and we'd like to evaluate the model using validation set. Let's look at our Evaluation metrics for classification.

## 1. Accuracy score

Accuracy score is the ratio of correct predictions out of total in the target. Let's say in our above example, out of 30 images in validation set, the model predicts 27 images correctly. Then we can say model predicts with 90% accuracy or 0.9 accuracy score.

In [1]:
def accuracy_score_v1(y_true, y_pred):
    """ Computes the accuracy score
    :param y_true - Actual target values
    :param y_pred - Predicted values from the model
    :returns the accuracy score
    """
    # Assign correct variable to zero, which will contain total correct predictions out of the actual values 
    correct = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == yp:
            correct += 1
    return 1.0 * correct / len(y_true)

In [2]:
# Let's check the accuracy score for the sample data
targets = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
preds   = [0, 1, 0, 1, 0, 0, 1, 1, 1, 0]

accuracy_score_v1(targets, preds)

0.8

Above, we can see that our `accuracy_score_v1` method gives the accuracy score of 0.8 or 80%.

`sklearn` python package provides a function to calculate the accuracy score which we can use to cross check.

In [4]:
from sklearn.metrics import accuracy_score

accuracy_score(targets, preds)

0.8

from `sklearn` package's method we get same accuracy score which means our implementation of accuracy score is correct.

Great!!!

Now, consider another example, where we are given a dataset of 100 images in which only 10 images has Pneumothorax and rest are non-Pneumothorax. If we divide the dataset into training and validation as 80:20 with equal ratio of images. Then training set will contain 72 images of non-Pneumothorax and 8 images of Pneumothorax. Similarly, validation set will contain 18 images of non-Pneumothorax and 2 images of Pneumothorax.

In the above case, if we always predict non-Pneumothorax for any image, then still we'll get 90% accuracy without building a model. But would that be a good case? Definitely not. So we can see that having evaluation metric as `accuracy_score` for all problems wouldn't work. Specially, not in the cases where target variable is skewed. Here comes the `precision`, `recall`, `F1-score`, etc. for the rescue.

Now, before we move forward with Precision and others, we need to be familiar with some terminologies.

**True Positive or TP**: It is defined as if the model predicts `True` where the actual value is also `True`, then consider it as True Positive.  
**True Negative or TN**: It is defined as if the model predicts `False` where the actual value is also `False`, then consider it as True Negative.  
**False Positive or FP**: It is defined as if the model predicts `True` where the actual value is `False`, then consider it as False Positive.  
**False Negative or FN**: It is defined as if the model predicts `False` where the actual value is `True`, then consider it as False Negative.

In [5]:
def true_positive(y_true, y_pred):
    """ Computes count of True Positive
    :param y_true - Actual target values
    :param y_pred - Predicted values
    :return count of true positive
    """
    tp = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 1 and yp == 1:
            tp += 1
    return tp

def true_negative(y_true, y_pred):
    """ Computes count of True Negative
    :param y_true - Actual target values
    :param y_pred - Predicted values
    :return count of true negative
    """
    tn = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 0 and yp == 0:
            tn += 1
    return tn

def false_positive(y_true, y_pred):
    """ Computes count of False Positive
    :param y_true - Actual target values
    :param y_pred - Predicted values
    :return count of false positive
    """
    fp = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 0 and yp == 1:
            fp += 1
    return fp

def false_negative(y_true, y_pred):
    """ Computes count of False Negative
    :param y_true - Actual target values
    :param y_pred - Predicted values
    :return count of false negative
    """
    fn = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 1 and yp == 0:
            fn += 1
    return fn

In [7]:
# Let's take same values above targets and preds.
targets = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
preds   = [0, 1, 0, 1, 0, 0, 1, 1, 1, 0]

# We know that, TP = 4, TN = 4, FP = 1, FN = 1.
# Let's see what our functions gives
print(f'TP:{true_positive(targets, preds)}')
print(f'TN:{true_negative(targets, preds)}')
print(f'FP:{false_positive(targets, preds)}')
print(f'FN:{false_negative(targets, preds)}')

TP:4
TN:4
FP:1
FN:1


Voila!! we have same values as we expected.