# Evaluation Metrics

The goal of this lecture is to get comfortable with model evaluation metrics and the intricacies of model performance/selection.

* Bias Variance Trade Off
* Cross Validation
* Confusion Matrix
* ROC curves


## 1. Bias Variance Trade Off

We are assuming that our dataset is a random sample of the data. We want to make sure to answer this question: Does the model I'm building represent the whole population well? (i.e. not just the sample dataset that I have!)

* Bias: Is the average of the residuals of the models close to the true model?

    * A biased model would center around the incorrect solution. How you collect data can lead to bias (for example, you only get user data from San Francisco and try to use your model for your whole user base).
    * High bias can also come from underfitting, i.e., not fully representing the data you are given.

* Variance: Are all the models close together?

    * The main contributor to high variance is insufficient data or that what you're trying to predict isn't actually correlated to your features.
    * High variance is also a result of overfitting to your sample dataset.
    
Note that both high bias or high variance are bad. Note that high variance is worse than it sounds since you will only be constructing the model once, so with high variance there's a low probability that your model will be near the optimal one.

Looking at this from a number of feature perspective:

* Increasing the number of features means:
    * Increase in variance
    * Decrease in bias
    
A graph can make this more clear:
    
![bias variance](images/bias_variance_graph.png)

## Cross Validation

Cross validation is used to measure how well we do on a *different dataset*.

*We use cross validation as a means to get a sense of the error. Our final model will be built on all of the data so that we can have the best model possible.*


#### Validation Set

A validation (or hold out) set is a random sample of our data that we reserve for testing. We don't use this part of our data for building our model, just for assessing how well it did.

* A typical breakdown is:
    - 80% of our data in the training set (which we use the build the model)
    - 20% of our data in the test set (which we use to evaluate the model)
* Make sure that you randomize your data! It's really dangerous to pick the first 80% of the data to train and the last 20% to test since data is often sorted by a feature or the target! It would cause trouble if all the expensive houses were in your test set and never in the training set!
* Concerns:
    - *Variable:* Depending what random sample we get, we will get different values
    - *Underestimating:* We are actually underestimating a little bit since we testing a model built on just 80% of the dataset instead of the whole 100%.


#### KFold Cross Validation

In K-fold cross validation, the data is split into **k** groups. One group
out of the k groups will be the test set, the rest (**k-1**) groups will
be the training set. In the next iteration, another group will be the test set,
and the rest will be the training set. The process repeats for k iterations (k-fold).
In each fold, a metric for accuracy will be calculated and
an overall average of that metric will be calculated over k-folds. 

![KFold Cross Validation](images/kfold.png)


#### Cross Validation Example

To see how we can use Cross Validation to pick between models, let's consider the example where we are predicting the MPG of cars from the horsepower. Our data looks like this:

![scatter plot](images/data_scatter.png)

If we try building a Linear Regression model with different degrees of the polynomial, we will get better at predicting on the training set as we increase the degree, but at some point we will be overfitting and hurt our performance on the test set. We can see this graphically:

![error w.r.t. polynomial degree](images/data_error.png)

* *Underfitting:* On the left side of the graph, you can see that both the train and test errors are bad. This is because we are underfit.
* *Overfitting:* On the right side of the graph, you can see that the train error is great, but the test error is bad. This is because we are overfit.
* *Just right:* Around a degree 5 polynomial, we see the perfect balance between the two. This is the point where the test error is minimum.

In practice, we just care about minimizing the test error, we only look at the training error the get a better understanding of what's going on.

## 3. Measuring Success


### Accuracy
The simplest measure is **accuracy**. This is the number of correct predictions over the total number of predictions. It's the percent you predicted correctly. In `sklearn`, this is what the `score` method calculates.

### Shortcomings of Accuracy
Accuracy is often a good first glance measure, but it has many shortcomings. If the classes are unbalanced, accuracy will not measure how well you did at predicting. Say you are trying to predict whether or not an email is spam. Only 2% of emails are in fact spam emails. You could get 98% accuracy by always predicting not spam. This is a great accuracy but a horrible model!

### Confusion Matrix
We can get a better picture our model but looking at the confusion matrix. We get the following four metrics:

* **True Positives (TP)**: Correct positive predictions
* **False Positives (FP)**: Incorrect positive predictions (false alarm)
* **True Negatives (TN)**: Correct negative predictions
* **False Negatives (FN)**: Incorrect negative predictions (a miss)

|            | Predicted Yes  | Predicted No   |
| ---------- | -------------- | -------------- |
| Actual Yes | True positive  | False negative |
| Actual No  | False positive | True negative  |

With logistic regression, we can visualize it as follows:

![logistic confusion matrix](images/logistic.png)

### Precision, Recall and F1

Instead of accuracy, there are some other scores we can calculate:

* **Precision**: A measure of how good your positive predictions are
    ```
    Precison = TP / (TP + FP)
             = TP / (predicted yes)
    ```
* **Recall**: A measure of how well you predict positive cases. Aka *sensitivity*.
    ```
    Recall = TP / (TP + FN) 
           = TP / (actual yes)
    ```
* **F1 Score**: The harmonic mean of Precision and Recall
    ```
    F1 = 2 / (1/Precision + 1/Recall)
       = 2 * Precision * Recall / (Precision + Recall)
       = 2TP / (2TP + FN + FP)
    ```

* Accuracy can also be written in this notation:
    ```
    Accuracy = (TP + TN) / (TP + FP + TN + FN)
    ```

##4.  ROC Curves

Using the confusion matrix and all the values that we can calculate with it doesn't quite get at everything. With logistic regression, we are actually predicting probabilities, not just which class we think it belongs to. In the standard Logistic Regression, we put anything that has a probability of at least 0.5 in the positive class and everything else in the negative class. We could however choose any threshold from 0 to 1.

Each threshold corresponds to a different confusion matrix.

* Increasing the threshold will:
    - :grinning: decrease the number of False Positives,
    - :weary: decrease the number of True Positives,
    - :weary: increase the number of False Negatives, and
    - :grinning: increase the number of True Negatives.
    
Now we have an infinite number of models to compare! How are we going to investigate them?

We use the Receiver Operating Characteristic Curve, more commonly called the ROC Curve.

We calculate the False Positive Rate and the True Positive Rate for every posible threshold and graph them all!

Here's how we calculate the FPR and TPR:
```
TPR = TP / P 
    = TP / (TP + FN)
    = Recall

FPR = FP / N 
    = FP / (FP + TN)
```
Here's what the ROC Curve looks like:

![](images/roc_curve.png)



Note that our ideal is the upper left corner, where our FPR is 0 and our TPR is 1.

While this graph shows us what FPR and TPR we can get, it does not tell us the theshold that yeilds a specific FPR and TPR. However, once we figure out what point on the graph we want, we can figure out the appropriate threshold (it's just not shown to us on the ROC curve).

