# Evaluation Metrics

In [1]:
# Import some helper functions (please ignore this!)
from utils import *

**Context:** At this point, we have a general framework for developing probabilistic models, as well as one way of fitting them to data, MLE. We've then instantiated this framework to develop two types of predictive models---regression and classification. We further learned how to build in expressivity using tools from deep learning---namely, neural networks---into these models. Are we finally ready to apply these models to real-life tasks? 

Unfortunately, there's one key piece we're still missing: so far, we've only used 1- and 2-dimensional input data and 1-dimensional output data. While in principle, we already have the tools to implement predictive models for higher dimensional data, we don't yet have the tools to *evaluate* them. We've purposefully worked with lower dimensional data because it is easy to visualize, and therefore easy to qualitatively evaluate. But as data becomes higher dimensional, it's much more difficult to get intuition using visualizations. As a result, we will have to rely on *metrics*. 

**Challenge:** There are many ways of measuring model performance. Which metrics should we use? What are the pros and cons of each metric? We will answer these questions here. Even though our motivation for developing evaluation metrics is our inability to visualize high-dimensional data, we will in fact focus on low-dimensional data, again. This is because we need to gain intuition about each metric we introduce. 

**Outline:**
* What's underfitting/overfitting? How do we prevent it?
* Introduce log-likelihood
* Introduce metrics specific to regression
* Introduce metrics specific to classification
* Discuss broader challenges with model evaluation

**Data:** TODO
* Introduce motivating high-dimensional regression and classification data and tasks.
* Provide students with several choices of fitted models for each task to evaluate.

## Preventing Over- and Under-fitting

**Decomposition into Sigal vs. Noise.** In probabilistic models, we often term one part of the model the "trend" or the "signal," and we term the rest "noise" or "observation error." For example, recall that in regression, our trend is captured by a function of the inputs, $\mu(\cdot; W)$, around which we assume a Gaussian observation error. Similarly, in the classification model we introduced, our trend is captured by a function of the inputs, $\rho(\cdot; W)$, which computes the probability of belonging to class 1 (as opposed to class 0). We then sample from a Bernoulli distribution with that probability---this represents the "noise" or "observation error." When fitting a probabilistic model, we're always at risk of learning models that mix up what's signal vs. what's noise. This can happen in two ways: 
* **Overfitting:** When the "signal" part of the model memorizes the "noise" part, and as a result, interpolates poorly. By *interpolation*, we mean, how well the model fits data points *near* our training data. 
* **Underfitting:** When a model doesn't capture enough of the "signal" in the data (assuming it's noise), and as a result, interpolates poorly.

Let's illustrate what this looks like in both regression and classification, starting with regression:

TODO regression figure and explanation

TODO classification figure and explanation

**Preventing Inappropriate Decomposition into Signal vs. Noise.** How can we tell if a model overfit or underfit? Well, looking at the definition above, we said a model over/under-fitts if it *interpolates* poorly. This means that, to determine if a model over/under-fit, we should check its fit *on data it has not seen yet*---specifically, on data that's similar to our training data. 

But where can we get such extra data points? Oftentimes, it's hard to collect additional data. For example, at the IHH, data collection is expensive; it's funded by large government grants and requires approval of an [Institutional Review Board (IRB)](https://en.wikipedia.org/wiki/Institutional_review_board) to ensure data collection and analysis is ethical. As a result, we often cannot collect additional data. Instead, however, we can *split our data into parts*---one part for training and one part for testing the model's interpolation. Specifically, after fitting the model on the training data, we compare the model's fit on the training data with that of the test data:
* If the fit on the training data is better than that on the test data, we *overfit*.
* If the fit on neither the training data or the test data is good, we *underfit.*
* If the fit on both the training and test data is comparably good, we fit just right!

Let's illustrate this with a visualization:

TODO figure for regression

TODO figure for classification

**Model Selection Using a Validation Set.** In practice, we often aren't just interested in the fitted model---we also want to report some *metric* to quantify how well it fits. To accomodate this need, we split the data into three parts: a training set, a validation set, and a test set. 
* We fit the model to the *training* data.
* We use the *validation* set to determine over/under-fitting (by comparing the fit on the validation vs. training data, as described above).
* Finally, we report the model's performance on the *test* set.

**But which metric should we use?** That's a million-dollar question! We will next present several common metrics for evaluating model performance numerically. As you will see, model performance is multi-faceted---there's more to consider than just how well the model fits. For example, is the model fair? and will the model be understandable to a human? If we have multiple factors to consider, how can we possibly design a metric that gives us a single ranking of models from best to worst? The answer is: we can't. We will have to use multiple metrics to evaluate our models. At times, these metrics will present us with contradictory information, and we will have to determine what to do. 

## Log-Likelihood

**A General Purpose Metric.** Since so far, we fit our models by finding parameters that maximize the data log-likelihood, can we just use log-likelihood as an evaluation metric? If we trust this metric for *fitting* our model, surely we should trust it for *evaluation*. And what's nice about log-likelihood is that, unlike some of the other metrics we'll introduce next, it's not model-specific. We can evaluate the log-likelihood of *any* probabilistic model.

````{admonition} Exercise: Getting Intuition for Log-Likelihood
**TODO:** 
* Load in the list of models
* Visualize each model and describe in what way does it not fit the data
* Evaluate each model's log-likelihood (provide function)
````

**Human-meaningful notions of fit.** TODO Cons: units are difficult to interpret.

## Metrics for Regression

TODO:
* Introduce MSE, R^2
* Pros: easy to interpret
* Cons: hard to know if scale of MSE is reasonable
* Exercise: ask students to compute MSE and log-likelihood and decide which model they prefer
* Exercise: add additional sensitive feature to data---ask them to break down metrics on sensitive variable. Which model do they pick now? 

## Metrics for Classification

TODO:
* Introduce accuracy. Con: doesn't communicate false positives of negatives (use class imbalance example)
* Introduce confusion matrix. Cons: doesn't communicate predictive uncertainty
* Exercise: ask students to compute confusion matrices and decide which model they prefer
* Exercise: add an additional sensitive feature to data---ask them to break down metrics on sensitive variable. Which model do they pick now? 

## Challenges with Evaluation

TODO:
* Model performance is multi-dimensional, but we can only use one "scoring system" to select
* This is becomes even more tricky when fairness is involved
* How to be ethical? Participatory design.