# Evaluating Model Results

How do we decide if a model is doing a good at predicting outcomes? We use various metrics to evaluate its performance. Again, I focus on binary classifiers. Multi-label classifiers and regression may call for different procedures).

Let's first recall our lookup algorithm, split data into training and test data, and see how it does on test data.

In [1]:
import pandas as pd
from pandas import DataFrame
from sklearn.model_selection import train_test_split

In [2]:
titanic = pd.read_csv("titanic.csv")
titanic_train, titanic_test = train_test_split(titanic, test_size=0.1)

def table_lookup_predictor(x, table, age):
    """Implements the table-lookup algorithm with ages after cufoff"""
    
    # Get most common label
    default = table.Survived.value_counts().argmax()
    # Get similar individuals
    similar_tab = table.loc[(table["Pclass"] == x["Pclass"]) &\
                            (table["Sex"] == x["Sex"]) &\
                            (table["Siblings/Spouses Aboard"] == x["Siblings/Spouses Aboard"]) &\
                            (table["Parents/Children Aboard"] == x["Parents/Children Aboard"]) &\
                            ((table["Age"] < age) == (x["Age"] < age)) , "Survived"]
    if len(similar_tab) == 0:
        # If table is empty (no "similar" individuals), guess the most common label
        return default
    else:
        return similar_tab.value_counts().argmax()

actual = titanic_test.Survived
predicted = titanic_test.apply(table_lookup_predictor, 1, table=titanic_train, age=10)

In [3]:
DataFrame({"actual": actual, "predicted": predicted})

Unnamed: 0,actual,predicted
417,0,1
364,1,1
656,0,1
652,0,0
229,1,1
822,0,0
730,0,0
160,1,1
294,0,0
351,0,0


## Accuracy

An obvious metric to check is the algorithm's accuracy on the test set. We've already seen how to compute this.

In [4]:
from sklearn.metrics import accuracy_score

In [5]:
accuracy_score(y_true=actual, y_pred=predicted)

0.797752808988764

Accuracy alone does not give a complete picture of how well an algorithm is doing when making predictions. It's possible that the learning problem is "easy". For example, if nearly everyone on the *Titanic* died, always predicting "did not survive" would have high accuracy, yet incorrectly predicts that every survivor died.

## Precision

**Precision** describes how often the model correctly predicts a given label. For example, the table lookup algorithm would have high precision for survivors if every time it predicts a passenger would be a survivor, the passenger is in fact a survivor.

`precision_score()` from **sklearn** computes precision.

In [6]:
from sklearn.metrics import precision_score

In [7]:
precision_score(y_true=actual, y_pred=predicted)

0.76000000000000001

There is a precision score for every possible label. In particular, there is a precision score for both survivorship and death.

In [8]:
precision_score(y_true=actual, y_pred=predicted, pos_label=0)

0.8125

It seems that predictions of death are more precise than prediction of survival.

## Recall

**Recall** is the ability of the model to correctly predict *a particular outcome*. In this example, recall is how many *Titanic* survivors were predicted by the model as being survivors. We would prefer recall to be close to 1.

`recall_score()` from **sklearn** computes recall.

In [9]:
from sklearn.metrics import recall_score

In [10]:
recall_score(y_true=actual, y_pred=predicted)

0.61290322580645162

Only half of surviors were correctly predicted by our model to be survivors.

Recall depends on the label of interest. For example, here's the recall rate for those who did not survive the *Titanic* disaster.

In [11]:
recall_score(y_true=actual, y_pred=predicted, pos_label=0)

0.89655172413793105

We see that our algorithm does a good job at predicting deaths but a mediocre job at predicting survivors.

## F1 Score

The **F1 score** attempts to balance precision and recall in a single number; let $i$ be a label:

$$\text{F1}(i) = \frac{\text{precision}(i) \times \text{recall}(i)}{\text{precision}(i) + \text{recall}(i)}$$

A score close to 1 is desirable, and a score close to 0 indicates an overall subpar model.

`f1_score()` from **sklearn** computes this metric.

In [12]:
from sklearn.metrics import f1_score

In [13]:
f1_score(y_true=actual, y_pred=predicted)

0.67857142857142849

In [14]:
# Again, depends on the label of interest
f1_score(y_true=actual, y_pred=predicted, pos_label=0)

0.85245901639344257

All of the metrics mentioned so far can be computed together in a nice bundle by the **sklearn** function `classification_report()`

In [15]:
from sklearn.metrics  import classification_report

In [16]:
print(classification_report(y_true=actual, y_pred=predicted))

             precision    recall  f1-score   support

          0       0.81      0.90      0.85        58
          1       0.76      0.61      0.68        31

avg / total       0.79      0.80      0.79        89



## Bayes Factor

A **Bayes factor** is a metric to determine which of two models better fits a dataset. We can compute a Bayes factor to determine if our algorithm is doing a better job of predicting survivorship compared to a "dumb" predictor that predicts the most common label. (Computing the Bayes factor can also help decide between models in the modelling phase, though we're seeing it here presumably after modelling has been completed and a predictive algorithm is nearing deployment.)

Recall Bayes' Theorem:

$$P(M|D) = \frac{P(D|M)P(M)}{P(D)}$$

$P(M|D)$ is roughly interpreted as the probability the model $M$ is appropriate given a dataset $D$. Let $M_1$ and $M_2$ be two competing models. The Bayes factor is then:

$$K = \frac{P(D|M_1)}{P(D|M_2)} = \frac{P(M_1|D)}{P(M_2|D)}\frac{P(M_2)}{P(M_1)}$$

Recall that $P(M_i)$ is the prior likelihood the model $M_i$ is appropriate.

Let's make this concrete. $M_1$ will denote the event that our table lookup algorithm does better than the "naive" algorithm, while $M_2$ is the event the "naive" algorithm is better than our algorithm. If $p_1$ and $p_2$ denotes the accuracy of the two algorithms, $M_1$ corresponds to $p_1 > p_2$ while $M_2$ corresponds to $p_1 < p_2$.

We will use conjugate priors, assume that $p_1$ and $p_2$ are independent under the prior distribution, and both parameters follow the $\text{Beta}(3,3)$ distribution. It can be shown that under these conditions $P(M_1) = P(M_2) = \frac{1}{2}$ (this is not true in general); in other words, $M_1$ and $M_2$ are equally likely.

We then compute the parameters of the posterior distributions of $p_1$ and $p_2$. I denote a correct prediction as a "success". We then need the number of "successes" in the test set.

In [17]:
N = len(actual)    # Total sample size
M = (actual == predicted).sum()    # A shorthand for computing the number of "successes
(M, N)

(71, 89)

In [18]:
post_params_lookup = (3 + M, 3 + N - M)
post_params_lookup

(74, 21)

In [19]:
ds = pd.Series(actual).value_counts()
ds

0    58
1    31
Name: Survived, dtype: int64

In [20]:
post_params_naive = (3 + ds[0], 3 + ds[1])
post_params_naive

(61, 34)

We can use the simulation trick seen in a previous section to estimate $P(M_1|D) = 1 - P(M_2|D)$.

In [21]:
from scipy.stats import beta

In [22]:
N = 10000    # Number of simulations
p_1 = beta.rvs(67, 28, size=N)
p_2 = beta.rvs(64, 31, size=N)
trial = p_1 > p_2

pm1 = trial.mean()
pm2 = 1 - pm1
(pm1, pm2)

(0.68769999999999998, 0.31230000000000002)

Now compute the Bayes factor. Notice that $P(M_1) = P(M_2)$ so $\frac{P(M_2)}{P(M_1)} = 1$ and $K = \frac{P(M_1|D)}{P(M_2|D)}$.

In [23]:
K = pm1 / pm2
K

2.202049311559398

What do we make of this number? We can use the Jeffreys' Scale to give meaning to $K$:

|                  $K$                | Strength of Evidence |
|-------------------------------------|----------------------|
|                 $< 1$               |  Evidence for $M_2$  |
|  $1$ to $10^{1/2}$ ($\approx 3.2$)  |       Negligible     |
|         $10^{1/2}$ to $10$          |     Substantial      |
| $10$ to $10^{3/2}$ ($\approx 31.6$) |        Strong        |
|         $10^{3/2}$ to $100$         |     Very strong      |
|              $> 100$                |      Decisive        |

Our $K$ falls into the "negligible" range. Our algorithm seems to do barely better than the naive algorithm at predicting who survived the Titanic, but is not worth the computational effort.

In the next section we will see potentially better algorithms that hopefully will have better performance.