# Bias Variance, Cross Validation, & Regularization

### 11/26/18

### Created by Roland Chin, Nichole Sun, Michelle Hao
##### with material from Ajay Raj, Nichole Sun, Rosa Choe and Data 100 Lecture

In [None]:
import matplotlib.pyplot as plt
import random
import numpy as np
import pandas as pd
import seaborn as sns
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from plotting import overfittingDemo, plot_multiple_linear_regression, ridgeRegularizationDemo, lassoRegularizationDemo
from scipy.optimize import curve_fit
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)

<a id='recap'></a>
## Introduction

![alt text](fit_graphs.png "Fit Graphs")

Bias corresponds to underfitting. If we look at the first model, the points seem to follow some sort of curve, but our predictor is linear and therefore, unable to capture all the points. In this case, we have chosen a model which is not complex enough to accurately capture all the information from our data set. 

If we look at the last model, the predictor is now overly complex because it adjusts based on every point in order to get as close to every data point as possible. In this case, the model changes too much based on small fluctuations caused by insignificant details in the data. 

<a id='bv-tradeoff'></a>
## Bias-Variance Tradeoff

Today we'll perform **model evaluation**, where we'll judge how our linear regression models actually perform. We will require a **loss function**, which describes a numerical value for how far your model is from the true values.

$$\textit{RSS} = \sum_{i=0}^n {e_i}^2 = \sum_{i=0}^n (y_i - mx_i - b)^2$$

Now we will generalize this formula: Say that there are $p$ features, or independent variables. Your task is to create a model $\hat{f}$, such that the loss is now:

$$\frac{1}{n} \sum_i^n (y_i - \hat{f}(x_i))^2$$

In this loss function, $y_i$ is a number, and $x_i$ is a $p$-vector, because there are $p$ features. This loss is called **mean squared error**, or **MSE**.

Now, we'll talk about other ways to evaluate a model.

First, let's define some terms.

We can say that everything in the universe can be described with the following equation:

$$y = h(x) + \epsilon$$

- $y$ is the quantity you are trying to model
- $x$ are the parameters (independent variables)
- $h$ is the **true model** for $y$ in terms of $x$
- $\epsilon$ represents **noise**, a random number which has mean zero

Let $\hat{f}$ be your model for $y$ in terms of $x$.

### Bias

When evaluating a model, the most intuitive first step is to look at how well the model performs. For classification, this may be the percentage of data points correctly classified, or for regression it may be how close the predicted values are to actual. The **bias** of a model is a measure of how close our prediction is to the actual value on average from an average model. Note that bias is not a measure of a single model, it encapsulates the scenario in which we collect many datasets, create models for each dataset, and average the error over all of models. Bias is not a measure of error for a single model, but a more abstract concept describing the average error over all errors. A low value for the bias of a model describes that on average, our predictions are similar to the actual values.

### Variance
The **variance** of  a model relates to the variance of the distribution of all models. In the previous section about bias, we envisioned the scenario of collecting many datasets, creating models for each dataset, and averaging the error overall the datasets. Instead, the variance of a model describes the variance in prediction. While we might be able to predict a value very well on average, if the variance of predictions is very high this may not be very helpful, as when we train a model we only have one such instance, and a high model variance tells us little about the true nature of the predictions. A low variance describes that our model will not predict very different values for different datasets.

![alt text](BiasVariance.jpg "Bias Variance Visualization")

The image describes what bias and variance are in a more simplified example. Consider that we would like to create a model that selects a point close to the center. The models on the top row have low bias, meaning the center of the cluster is close to the red dot on the target. The models on the left column have low variance, the clusters are quite tight, meaning our predictions are close together.

### The Tradeoff

We are trying to minimize **expected error**, or the average **MSE** over all datasets. It turns out (with some advanced probability gymnastics), that:

$$\text{Expected Error} = \text{Noise Variance} + \text{Bias}^2 + \text{Variance}$$

Note that $\text{Noise Variance}$ is constant: we assume there is some noise, and $\text{Noise Variance}$ is simply a value that describes how noisy your dataset will be on average.

This equation defines what is known as the **bias variance tradeoff**. 

![alt text](BiasVarianceTradeoff.png "Bias Variance Tradeoff")

Image from http://scott.fortmann-roe.com/docs/BiasVariance.html


Why is this true intuitively?

At some point as we decrease **bias**, instead of getting closer to the **true model** $h$, we go past and try to fit to the $\epsilon$ (noise) that is part of our current dataset. This is equivalent to making our model more noisy: which means that over all datasets, it has more **variance**.

**Questions for understanding**:
> 1. Where does underfitting and overfitting lie in the graph above? How do they relate to bias and variance?
> 2. Why can't we usually just make a bunch of models with low bias and high variance and average them?
> 3. Why is low variance important in models?

### Polynomial Regression

Let's look at a polynomial problem

In this case, if our model has degree $d$, we have $d + 1$ features: $x = [x^0, x^1, ..., x^d]$. Now, we have a linear model with $d + 1$ features:

$$\hat{f}(x) = \sum_{i=0}^{d} a_i x_i$$

Model complexity in this case is the degree of the polynomial. As we saw last week, as $d$ increases, model complexity increases. The model gets better, but then gets erratic. This directly corresponds to the bias-variance graph above.

In [None]:
overfittingDemo()

Looking at these models, we can tell the best model is a degree 3 model.

In [None]:
mpg = pd.read_csv("mpg.csv", index_col='name')# load mpg dataset
mpg = mpg.loc[mpg["horsepower"] != '?'].astype(float) # remove columns with missing horsepower values
mpg_train, mpg_test = train_test_split(mpg, test_size = .2, random_state = 0) # split into training set and test set
mpg_train, mpg_validation = train_test_split(mpg_train, test_size = .5, random_state = 0)
mpg_train.head()

## Cross-Validation

Our goal is to find some kind of happy medium between underfitting and overfitting, so our model fits well against **the underlying distribution of our data** and also **generalizes well against other data points**.

We always want a more accurate way of seeing what the test error is in order to manage the bias-variance trade off. As we have seen when we create a model, training error is misleadingly low due to overfitting since we are fitting our model on the training set. When we predict on the test set afterwards, this error might be very high since we haven't seen anything other than the training data.

A solution to this is to use cross validation, in which we split our original data into training, test, and validation data. This lets us repeatedly estimate our model error by testing our model multiple times.

## Why? Because Bias-Variance Tradeoff

Cross-validation helps us manage the bias-variance tradeoff more accurately.

The validation error estimates test error by checking the model's performance on a dataset that isn't the training data, which allows us to estimate model bias and model variance.

## Train-Validation-Test Split

The original dataset is split into 3 subsets:

* Training set: models are fit on this data (~70%)
* Validation set: we select the best features/hyperparameters (things which you set that cannot be learned) from here (~15%)
* Test set: the final set used for determining the model's accuracy (~15%)

We want to have a good balance between the 3 data sets; a larger training set will decrease the model accuracy but this means the validation and test sets may be too small and not representative of the original data.

Then we select a model and a set of features by doing the following:

1. For every potential set of features, fit a model using the training set. The error of a model on the training set is its *training error*.
1. Check the error of each model on the validation set, which is its *validation error*. Select the model that achieves the lowest validation error. This will be our choice of features, hyperparameters, and model.
1. Calculate the *test error*, which is just the error of the best model we've chosen on the last data set, the test set. From here, we cannot adjust the features or model to decrease test error, since this changes the test set into a validation set. What we can do is find a new test set.

This process allows us to more accurately determine the model to use than using the training error alone. By using cross-validation, we can test our model on data that it wasn't fit on, simulating test error without using the test set. This gives us a sense of how our model performs on unseen data.  

**Mean Squared Error:** The error output from finding the average of all the squared magnitudes from all the predicted and actual outputs. This is the type of error "function" we use to calculate training and test errors. To get a better idea of what **MSE** is, take a look at the following picture:

<img src='mean_squared_error.png' width=400, height=400>

## Training Error and Test Error

A model is "bad" if it fails to gather anything from unseen data (test set). The test error is how we determine how accurate our model performs since we've never seen it before.

The training error decreases as we make our model more complex with additional features and such, but having a really low training error doesn't always mean our model is perfect, since our model may just be overfitting the more complex it is.

We can see how the test and training error change as complexity changes below.

![feature_train_test_error.png](https://raw.githubusercontent.com/DS-100/textbook/master/assets/feature_train_test_error.png)

## Even Better: K-Fold Cross-Validation

We can improve on the **train-validation-test split** method. With train-validation-test, making 3 splits results in too little data for training.

Thus we can run the train-validation split multiple times on the same dataset. The training dataset is divided into *k* equally-sized subsets, and the train-validation split is repeated *k* times. Every time, one of the *k* subsets, or folds, is used as the validation set, and the remaining *k - 1* folds are used for its training. 

The model's validation error is the average of it's $k$ validation errors. 

Here is an example with 5 folds.

![feature_5_fold_cv.jpg](https://github.com/DS-100/textbook/blob/master/assets/feature_5_fold_cv.jpg?raw=true)

A benefit of this is that every data point is used for validation exactly once and for training *k-1* times.

If *k* is small, the error estimate has lower variance (many validation points) but higher bias (fewer training points).

If *k* is large, the error estimate has lower bias but higher variance. 

A disadvantage of this model is that it takes more computation time, but it computes a more accurate validation error n the end.

The `scikit-learn` library provides a convenient [`sklearn.model_selection.KFold`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) class to implement $k$-fold cross-validation.

## Example: Model Selection for Ice Cream Ratings
Here we have a simple example in which we utilize cross-validation to select a model in order to predict ice cream ratings from ice cream sweetness.

First we visualize the data below:

In [None]:
# ignore this cell
ice = pd.read_csv('icecream.csv')
transformer = PolynomialFeatures(degree=2)
X = transformer.fit_transform(ice[['sweetness']])

clf = LinearRegression(fit_intercept=False).fit(X, ice[['overall']])
xs = np.linspace(3.5, 12.5, 300).reshape(-1, 1)
rating_pred = clf.predict(transformer.transform(xs))

temp = pd.DataFrame(xs, columns = ['sweetness'])
temp['overall'] = rating_pred

np.random.seed(42)
x_devs = np.random.normal(scale=0.2, size=len(temp))
y_devs = np.random.normal(scale=0.2, size=len(temp))
temp['sweetness'] = np.round(temp['sweetness'] + x_devs, decimals=2)
temp['overall'] = np.round(temp['overall'] + y_devs, decimals=2)

ice = pd.concat([temp, ice])
ice

In [None]:
# ignore this cell
plt.scatter(ice['sweetness'], ice['overall'])
plt.title('Ice Cream Rating vs. Sweetness')
plt.xlabel('Sweetness')
plt.ylabel('Rating');

We can use a degree 10 polynomial on 9 random points to create a perfectly accurate model for these points. But this is an example of overfitting, which means the model fails to generalize to unseen data.

In [None]:
# ignore this cell
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

ice2 = pd.read_csv('icecream.csv')
trans_ten = PolynomialFeatures(degree=10)
X_ten = trans_ten.fit_transform(ice2[['sweetness']])
y = ice2['overall']

clf_ten = LinearRegression(fit_intercept=False).fit(X_ten, y)

In [None]:
# ignore this cell
np.random.seed(1)
x_devs = np.random.normal(scale=0.4, size=len(ice2))
y_devs = np.random.normal(scale=0.4, size=len(ice2))

plt.figure(figsize=(10, 5))

plt.subplot(121)
plt.scatter(ice2['sweetness'], ice2['overall'])
xs = np.linspace(3.5, 12.5, 1000).reshape(-1, 1)
ys = clf_ten.predict(trans_ten.transform(xs))
plt.plot(xs, ys)
plt.title('Degree 10 polynomial fit')
plt.ylim(3, 7);

plt.subplot(122)
ys = clf_ten.predict(trans_ten.transform(xs))
plt.plot(xs, ys)
plt.scatter(ice2['sweetness'] + x_devs, ice2['overall'] + y_devs, c='g')
plt.title('Degree 10 poly, second set of data')
plt.ylim(3, 7);

Let's follow what we learned. First we first partition our data into training, validation, and test datasets using `scikit-learn`'s [`sklearn.model_selection.train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method to perform a 70/30% train-test split.

In [None]:
from sklearn.model_selection import train_test_split

test_size = 92

X_train, X_test, y_train, y_test = train_test_split(ice[['sweetness']], ice['overall'], test_size=test_size, random_state=0)

print(f'  Training set size: {len(X_train)}')
print(f'      Test set size: {len(X_test)}')

Let's try to fit polynomial regression models using the training set, from polynomial degrees 1-10.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# First, we add polynomial features to X_train
transformers = [PolynomialFeatures(degree=deg) for deg in range(1, 11)]
X_train_polys = [transformer.fit_transform(X_train) for transformer in transformers]

We will then perform 5-fold cross-validation on the 10 featurized datasets. To do so, we will create a function that:
1. Uses the [`KFold.split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) function to get 5 splits on the training data. (`split` returns the indices of the data for that split)
2. For each split, select the rows and columns based on the split indices and features.
3. Fit a linear model on the training split.
4. Compute the mean squared error on the validation split.
5. Return the average error across all cross validation splits.

This may sound complicated, but is actually a pretty standard procedure which sklearn allows us to easily do.

In [None]:
from sklearn.model_selection import KFold

def mse_cost(y_pred, y_actual):
    return np.mean((y_pred - y_actual) ** 2)

def compute_CV_error(model, X_train, Y_train):
    kf = KFold(n_splits=5)
    validation_errors = []
    
    for train_idx, valid_idx in kf.split(X_train):
        # split the data
        split_X_train, split_X_valid = X_train[train_idx], X_train[valid_idx]
        split_Y_train, split_Y_valid = Y_train.iloc[train_idx], Y_train.iloc[valid_idx]

        # Fit the model on the training split
        model.fit(split_X_train, split_Y_train)
        
        # Compute the RMSE on the validation split
        error = mse_cost(split_Y_valid, model.predict(split_X_valid))
        
        # Add this split's error to a list with all of the errors
        validation_errors.append(error)
    
    #average all the validation errors
    return np.mean(validation_errors)

In [None]:
# We train a linear regression classifier for each featurized dataset and perform cross-validation
cross_validation_errors = [compute_CV_error(LinearRegression(fit_intercept=False), X_train_poly, y_train) for X_train_poly in X_train_polys]

We can see that as we use higher degree polynomial features, the validation error decreases and increases again.

In [None]:
# ignore this cell
from IPython.core.display import display
cv_df = pd.DataFrame({'Validation Error': cross_validation_errors}, index=range(1, 11))
cv_df.index.name = 'Degree'
pd.options.display.max_rows = 20
display(cv_df)
pd.options.display.max_rows = 7

plt.figure(figsize=(10, 5))

plt.subplot(121)
plt.plot(cv_df.index, cv_df['Validation Error'])
plt.scatter(cv_df.index, cv_df['Validation Error'])
plt.title('Validation Error vs. Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Validation Error');

plt.subplot(122)
plt.plot(cv_df.index, cv_df['Validation Error'])
plt.scatter(cv_df.index, cv_df['Validation Error'])
plt.ylim(0.044925, 0.05)
plt.title('Zoomed In')
plt.xlabel('Polynomial Degree')
plt.ylabel('Validation Error')

plt.tight_layout();

After examining the validation errors, we see that the best, most accurate model is that of a degree 2 polynomial features. Thus, we select the degree 2 polynomial model as our final model and fit it on the training data.

Finally, we compute its error on the test set.

In [None]:
best_trans = transformers[1]
best_model = LinearRegression(fit_intercept=False).fit(X_train_polys[1], y_train)

training_error = mse_cost(best_model.predict(X_train_polys[1]), y_train)
validation_error = cross_validation_errors[1]
test_error = mse_cost(best_model.predict(best_trans.transform(X_test)), y_test)

print('Degree 2 polynomial')
print(f'  Training error: {training_error:0.5f}')
print(f'Validation error: {validation_error:0.5f}')
print(f'      Test error: {test_error:0.5f}')

Note that the test error is higher than the validation error which is higher than the training error.

This makes sense because the model is fit on the training data, which minimizes the mean squared error for that dataset. The validation error and the test error are usually always higher than the training error because these errors are computed on unknown datasets.

In the future, `scikit-learn` has a [`cross_val_predict`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html) method to automatically perform cross-validation, so we don't have to methodically break the data into training and validation sets ourselves.

## Cross Validation Summary

The end goal of the cross-validation technique is to manage the bias-variance tradeoff and find an optimal model.

**In conclusion**, we can see that validation can be an incredibly useful process through which we create a model that fits, but doesn't overfit or underfit, on our data. We also see that different measurable quantities, such as training error and test error, can also be indicative of how well our model is doing at different periods of time, and see how well it's fitting against our data.

<a id='regularization'></a>
## Regularization 
As we add lot of features, this typically increases the variance of the model, resulting in worse performance overall. However, the features may contain important information about our data, so we may not want to throw them out completely. Regularization is a method that allows us to penalize complexity, while still allowing us to incorporate as much information as possible. 

Recall that the ordinary least squares model is the following, where $\hat{\theta}$ is the model weights, and $x$ is the vector of features:
$$f_\hat{\theta}(x) = \hat{\theta} \cdot x$$

When we find the best fit for our model, we want to minimize the square differences, so we want to find $\hat{\theta}$ such that: 
$$\hat{\boldsymbol{\theta}} = \arg\!\min_\theta \sum_{i=0}^n (y_i - f_\boldsymbol{\theta}(x_i))^2$$

This equation above is the ordinary least squares model __without regularization__.

We know that having large weights for more features increases variance. Because of this, we want to add regularization in order to discourage our model from overfitting. This is done by adding a $R(\boldsymbol{\theta})$ term to penalize large weight values. 

Here's what our ordinary least squares model looks like __with regularization__:

$$\hat{\boldsymbol{\theta}} = \arg\!\min_\theta \sum_{i=0}^n (y_i - f_\boldsymbol{\theta}(x_i))^2 + \lambda R(\boldsymbol{\theta})$$

The new term after the summation is the **regularization** term. The $\lambda$ parameter in front of it dictates how limiting our regularization term is – the higher $\lambda$ is, the more we penalize large weights, and the more the regularization makes our weights deviate from OLS. 

**Question**: What happens when $\lambda = 0$?

So, what is $R(\theta)$? It could be a lot of things! Today we'll talk about two of the most common regularization functions – ridge and LASSO. 

<table>
    <tr><td>Ridge</td><td>L2 Norm</td><td>$R(\boldsymbol{\theta}) = \sum\limits_{i=0}^n \theta_i^2$</td></tr>
    <tr><td>LASSO</td><td>L1 Norm</td><td>$R(\boldsymbol{\theta}) = \sum\limits_{i=0}^n \lvert\theta_i\rvert$</td></tr>
</table>

<a id='ridge'></a>


<a id='L2 Regression: Ridge Regression'></a>
### L2 Regression: Ridge Regression
One way we could penalize the weights is penalizing the sum of the squared weights.  
For **ridge regression** the penalty is,
$$R(\boldsymbol{\theta}) = \sum\limits_{i=0}^n \theta_i^2$$

Substituting this regularization function into the least squared model from before, our regularized model now looks as follows:
$$\hat{\boldsymbol{\theta}} = \arg\!\min_\theta \sum_{i=0}^n (y_i - f_\boldsymbol{\theta}(x_i))^2 + \lambda \sum\limits_{i=0}^n \theta_i^2$$



Something interesting about Ridge Regression is that there is always a unique, mathematical solution that can be found using a known formula. The solution involves linear algebra, so you don't need to know it, but the existence of this formula also makes it computationally easy to solve.
$$\hat{\boldsymbol{\theta}} = \left(\boldsymbol{X}^T \boldsymbol{X} + \lambda\boldsymbol{I}\right)^{-1}\boldsymbol{X}^T\boldsymbol{Y}$$

In order to truly see the effect of the regularization, lets first create a __regular linear regression model__ that we can compare to the regularized models. 
We will use a polynomial model with `displacement` up to degree 20.

In [None]:
from sklearn.linear_model import LinearRegression

x_train = np.vander(mpg_train["displacement"], 13)
y_train = mpg_train[["mpg"]]

x_validation = np.vander(mpg_validation["displacement"], 13)
y_validation = mpg_validation[["mpg"]]

# instantiate your model
linear_model = ...

# fit the model
...
# make predictions on validation set
linear_prediction = ...
# find mean squared error
linear_loss = ...

print("Root Mean Squared Error of linear model: {:.2f}".format(linear_loss))

Using what you did above as reference, do the same using a Ridge regression model.

In [None]:
from sklearn.linear_model import Ridge

...
ridge_loss = ... # mean squared error of ridge model

print("Root Mean Squared Error of linear model: {:.2f}".format(linear_loss))
print("Root Mean Squared Error of ridge model: {:.2f}".format(ridge_loss))

<a id='L1 Regularization: LASSO'></a>
### L1 Regularization: LASSO
In **LASSO**, we penalize the sum of absolute values of the weights. So,
$$R(\boldsymbol{\theta}) = \sum\limits_{i=0}^n \lvert\theta_i\rvert$$


If there's one thing you should know about LASSO is that it is *sparsity inducing*. This just means that it forces some weights to take on zero values, leaving you with fewer explanatory variables in the resulting model than you put in. Unlike ridge regression, LASSO doesn't necessarily have a unique solution, and there's no formula that determines what the optimal weights should be.

In [None]:
from sklearn.linear_model import Lasso

...
lasso_loss = ... # mean squared error of lasso model

print("Root Mean Squared Error of linear model: {:.2f}".format(linear_loss))
print("Root Mean Squared Error of lasso model: {:.2f}".format(lasso_loss))

### Visualizing Ridge and LASSO
We just told you a lot of things about ridge and lasso, but here are some visualizations to help you understand the intuition behind some of the characteristics of these two regularization methods. Another way to describe the modified minimization function above is that it's the same loss function as before, with the *additional constraint* that $R(\boldsymbol{\theta}) \leq t$. Now, $t$ is related to $\lambda$ but the exact relationship between the two parameters depends on your data. Regardless, let's take a look at what this means in the two-dimensional case. For ridge,

$$\theta_0^2 + \theta_1^2 \leq t$$

Does this look familiar to you? What if it's in the form $x^2 + y^2 \leq t$? Or how about now:
<img src='http://vikingsseason5i.com/wp-content/uploads/2018/08/circle-equation-circle-equation-unit-circle.jpg' width=400 />

Lasso is of the form $$\left|\theta_0\right| + \left|\theta_1\right| \leq t$$ This one's a little harder to interpret, perhaps this will help inspire you:
<img src='https://cdn.kastatic.org/ka-perseus-graphie/3b9b8f4b4dac19e1197e9dd94553d0822f9fe69a.png' />

#### Norm Balls
<img src='https://upload.wikimedia.org/wikipedia/commons/f/f8/L1_and_L2_balls.svg' width=400/>
<img src='norm_balls.png' width=400/>

The rhombus and circle as a visualization of the regularization term, while the blue circles are the topological curves representing the loss function based on the weights. You want to minimize the sum of these, which means you want to minimize each of those. The point that minimizes the sum is the minimum point at which they intersect.


**Question**: Based on these visualizations, could you explain why LASSO is sparsity-inducing?

Turns out that the $L2-norm$ is always some sort of smooth surface, from a circle in 2D to a sphere in 3D. On the other hand, LASSO always has sharp corners. This is exactly the feature that makes it sparsiy-inducing. As you might imagine, just as humans are more likely to bump into sharp corners than smooth surfaces, the loss term is also most likely to intersect the $L2-norm$ at one of the corners.

### Regularization and Bias Variance
As we mentioned earlier, the bias is the average OLS loss term across multiple models of the same family (e.g. same degree polynomial) trained on separate datasets. Variance is the average variance of the weight vectors (coefficients) on your features. 

Without the regularization term, we’re just minimizing bias; the regularization term means we won’t get the lowest possible bias, but we’re exchanging that for some lower variance so that our model does better at generalizing to data points outside of our training data.


### Lambda

We said that $\lambda$ is how much we care about the regularization term, but what does that look like? Let's return to the polynomial example from last week, and see what the resulting models look like with different values of $\lambda$ given a degree 8 polynomial.

In [None]:
ridgeRegularizationDemo([0, 0.5, 1.0, 5.0, 10.0], 9)

How do we know what to use for $\lambda$ (or `alpha` in the `sklearn.linear_model` constructors)?

That's right, it's validation! 

### Validation on Lambda
Let's try to find the best $\lambda$ for the degree 20 polynomial on `displacement` from above.

In [None]:
lambdas = np.arange(0, 200) # create a list of potential lambda values

# create a list containing the corresponding mean_squared_error for each lambda usinb both ridge and lasso regression
ridge_errors = [] 
lasso_errors = []

# 

# finds the index of the minimum value in each list
ridge_errors.index(min(ridge_errors)), lasso_errors.index(min(lasso_errors))

## Sanity Check

1. What happens as $\lambda$ increases?
    1. bias increases, variance increases
    2. bias increases, variance decreases
    3. bias decreases, variance increases
    4. bias decreases, variance decreases
2. **True** or **False**? Bias is how much error your model makes.
3. What is **sparsity**?
4. For each of the following, choose **ridge**, **lasso**, **both**, or **neither**:
    1. L1-norm
    2. L2-norm
    3. Induces sparsity
    4. Has analytic (mathematical) solution
    5. Increases bias
    6. Increases variance

## Precision & Recall

There are two kinds of errors our model can make:
- False positive (FP), for example a good email gets flagged as spam and filtered out of the inbox
- False negative (FN), for example a spam email gets mislabeled as good and ends up in the inbox

These definitions depend both on the true labels and the predicted labels. False positives and false negatives may be of differing importance. For example, a false positive for cancer is less impactful than a false negative for cancer, since treatment is crucial if someone does have cancer.

Going back to our example with normal and spam emails:

**Precision** measures the proportion $\frac{\text{TP}}{\text{TP} + \text{FP}}$ of emails flagged as spam that are actually spam.

**Recall** measures the proportion $\frac{\text{TP}}{\text{TP} + \text{FN}}$ of spam emails that were correctly flagged as spam. 

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/700px-Precisionrecall.svg.png" width="500px">

**A system with high recall but low precision returns many results, but most of its predicted labels are incorrect.**

**A system with high precision but low recall returns very few results, but most of its predicted labels are correct.**

**An ideal model yields high precision and high recall, which will return many results, with almost all results labeled correctly.**

Say we have 5 emails, whose true labels are the following: 1, 0, 0, 1, 1; here 1 is spam, and 0 is normal.

Our model outputs the following predicted labels, say, based on the words in the body of each email: 0, 1, 0, 1, 0.

In order to calculate the precision, and recall, we need the number of true positives, false positives, and false negatives.

In [None]:
true_label = np.array([1, 0, 0, 1, 1])
predicted_label = np.array([0, 1, 0, 1, 0])

We can use basic array arithmetic to find each of the above.

In [1]:
true_positives = np.sum(true_label * predicted_label)
false_positives = np.count_nonzero((true_label - predicted_label) == -1)
false_negatives = np.count_nonzero((predicted_label - true_label) == -1)

print("true positives:\t\t{}".format(true_positives))
print("false positives:\t{}".format(false_positives))
print("false negatives:\t{}".format(false_negatives))

NameError: name 'np' is not defined

In [None]:
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)

print("precision:\t{}".format(precision))
print("recall:\t\t{}".format(recall))