# Evaluating Machine Learning Models

Once we have a machine learning model, how do we report how well it performs?  We need some way
to measure this.

## Evaluation metrics

- There are various numerical ways to measure the performance of our machine learning models.

- These generally correspond to our cost functions, $J(w)$, for whatever model we are using, though there
are others.

### Regression metrics

- For regression problems (predicting a number), we can use **mean squared error** (which we use to train
linear regression):

$$\text{MSE} = \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)})^2$$

but this is not the only metric we can use.

- Some people recommend the **mean absolute error**:

$$\text{MAE} = \frac{1}{m} \sum_{i=1}^m \left\vert \hat{y}^{(i)} - y^{(i)} \right\vert$$

One advantage MAE has is the quantity is more "interpretable" in that it has the same units as the target
value $y$.  For example, if we are predicting housing prices in dollars, then the MAE tells us, on average, how
far away our predictions are from the "true" price of a house.  On the other hand, MSE gives us a value 
in "square dollars" which is harder to interpret.

- We can also take the square root of the mean squared error to get the **root-mean-square error**:

$$\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{ \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)})^2 }$$

This metric also has the property that the units match the units of $y$, though some people still recommend
MAE over RMSE.

- There is one more common metric, known as **mean absolute percentage error**:

$$\text{MAE} = \frac{1}{m} \sum_{i=1}^m \left\vert \dfrac{\hat{y}^{(i)} - y^{(i)}}{y^{(i)}} \right\vert$$

This metric has the advantage that it will always be between 0 and 1, and therefore can be 
interpreted as a percentage.

<hr>

For all of these metrics, lower numbers indicate better predictions.

### Classification metrics

For binary classification problems, we can use any of the metrics above with 0 and 1 for our target
variable $y$.  For $\hat{y}$, we can use 0 or 1 if our model directly predicts the class, or if the model
predicts a probability between 0 and 1, we can use that as well.

However, there are a number of more common metrics:

- **Accuracy** is probably the simplest and most common.  It just measures the number of values
predicted correctly divided by the total number of values.

Accuracy, however, can be misleading when the number of data points in each of our classes
is unbalanced.  Consider an example where we have 100 data points, but 95 of them are from the positive
class and only 5 are from the negative class.  Assume we have a machine learning model that always 
predicts the positive class ($f(x) = 1$), no matter what $x$ is.  The accuracy of this model
will be 95%, but obviously it will always miss predictions for the negative class, which probably isn't
very useful.

We have a few ways to make better metrics:

Let's assume our classification problem is trying to predict whether someone has a disease or not.  
We build a binary classifier for this and will predict either 1 (has the disease) or 0 (does not 
have the disease).  There are some standardized terms we can use:

- A **true positive** is predicting 1 (has the disease) when the patient does have the disease.
- A **true negative** is prediction 0 (does not have the disease) when the patient does not have the disease.
- A **false positive** is predicting 1 when the patient does **not** have the disease.
- A **false negative** is predicting 0 when the patient **does** have the disease.

We will use TP, TN, FP, and FN as abbreviations for the total number of these occurrences in a data set.

A few common metrics using these values are:

- **True positive rate (TPR)**: $\dfrac{TP}{TP+FN}$.  

- **True negative rate (TNR)**: $\dfrac{TN}{TN+FP}$. 

- **Balanced accuracy** = $\dfrac{TPR + TNR}{2}$

There are others that are useful in certain situations, such as precision, recall, sensitivity, specificity,
F-score, etc.

## Training sets and testing sets

- Recall that in machine learning, we are primarily concerned with making models that can **generalize**
to new data (data the model has not seen; data the model was not trained on).  Therefore, when we want
to evaluate a model, we must do it on a different data set than the data the model was trained on. 

- Traditionally, if we have enough data, we will split all of the data we have available to us into
a **training set** and a **testing set**.  The sizes of these sets can vary; you will see recommendations
from using 2/3 of your data for training and the last 1/3 for testing, to using 80% for training and 20%
for testing.

- The easiest thing to do is train your model on the training set, and then test on the testing set.
We will report metrics (from above, like accuracy or MSE), on both sets, but we are often more interested
in the testing set evaluation.  However, performing evaluation on both sets can often reveal other 
interesting things (for example, if we get a small amount of errors on the training set, but a large number
of errors on the testing set, this often indicates overfitting).


## What if I don't have enough data?

While large data sets are now common in machine learning, we don't always have enough data to split
the data set into a large enough subset to train on, plus a large enough set to test on as well.  Or,
even if we do have enough data, we know that training a model on **more** data almost always results in a
better model, so it seems silly to let 1/3 or 20% of our data "go to waste" and not use it for training.

A common solution to this is called **cross-validation**, and there are a few ways to do it:

- **$k$-fold cross-validation**: This is a nice method if you have a pretty big data set but you don't feel like you can hold back 1/3 or 20% of it for testing.

  We will partition the entire data set into $k$ equally-sized subsets.  Of the $k$ subsets, we choose one of them to be the testing set, and the other $k-1$ subsets combined will form the training set.  We train our model on the training set and test it on the testing set as normal.  Then, we repeat this process but **using a different subset of the original $k$ subsets for the testing set**.  We will usually then average all the evaluations from each of the $k$ individual training/testing cycles.
  
- **Leave $p$ out cross-validation**: This is even more extreme than $k$-fold cross validation, but
is nice when your data set is very small.  

  In this method, we remove $p$ data points from the data set and use only those $p$ points for testing. The rest of the data is used for training.  We repeat this process with every possible set of $p$ data points from the original data, and average all of the evaluations on the different testing sets.
  
  This method quickly becomes unwieldy for large values of $p$, but it is very commonly used with $p=1$, or **leave-one-out cross-validation (LOOCV)**.  Here, we cycle through our entire data set and use each individual data point in turn as the sole member of the testing set, training on all the rest of the data except that one point.  We average all the results.
  
<hr>

<b>The key idea in all of these methods is that we never use the same data point in both the testing and training sets.</b>

## A complication: validation sets

In many situations, we need a third data set, which is known as a **validation set**.  

Imagine the following situation: you are training a model with linear regression, but you don't know what features to use.  (This might be because you have a large set of features that you need to cut down, or maybe you think you might need to create some new features through feature engineering.)  

In this situation, you might need to create a bunch of different linear regression models with different combinations of features and evaluate how well they do.  If you simply use a single training set and a single testing set in this situation (and therefore evaluate all the different models using the same testing set), you are effectively now using the testing set for training.  This is because presumably you will choose the model from all the possible models you create that performs best on the testing data.  But that's exactly what we're trying to avoid --- having any overlap between the data we're using to create the model and the data we use to evaluate how well the model is doing.  (And here, "creating" the model also includes picking what features to use.)

To handle this, we introduce a third data set: the **validation set**.  This is another subset of your data that you will use if you need to evaluate multiple models.  You use the validation set as "testing data" for choosing the best model, then you use the "true" testing set to evaluate the best model once you've found it (but then you can't go back and switch to a different model based on the evaluation on the testing set).

The point of this is that the testing set is only used at the very end, once all the details of the model are fully specified.