# Regularization, Cross Validation, Overfitting/Underfitting

## Regularization
Recall from Lesson 2 that linear regression minimizes the following cost function:

$$
J(\vec{\theta}) = \frac{1}{2m} \sum_{i = 1}^{m}{(y^{(i)} - \vec{\theta}^T \mathbf{x}^{(i)})^2}
$$

The cost function can be optimized with the normal equation or with the more generally applicable gradient descent algorithm. Let us ignore the normal equation and suppose we do all cost minimization with gradient descent.

The entries of the $\vec{\theta}$ vector are called the **parameters** of the machine learning model because they are found by minimizing the cost function.

The polynomial degree, on the other hand, is called a **hyperparameter** because we cannot determine its optimal value by minimizing the cost function - it must be provided before the minimization.

**
Exercise 1: Can you think of any other hyperparameters that we need to set to fit a linear regression model?
Hint: What is the formula for gradient descent?
**

The way we solved this problem was by using a validation set and trying all the polynomial degrees from a set $H$ (we chose $H = \{1, ..., 20\}$). Using a validation set is generally applicable techinique for picking hyperparameters, but it has its limitations. Namely, it is more computationally expensive.

**
Exercise 2: Assume that the average compute time required to minimize the cost function is $C$, what is the compute time required to find the optimal polynomial degree, assuming we try all degrees in the set $H_{degree}$? 
**

Thus, to get around this problem, we can use a technique called **regularization**, wherein we modify the cost function to encode our preference for simpler models. That is, instead of minimizing $J(\theta)$, we minimize:

$$
\tilde{J}(\vec{\theta}) = \alpha \Omega(\vec{\theta}) + J(\vec{\theta})
$$

where $\Omega(\vec{\theta})$ measures how "complicated" our model is. Thus, the model is forced to minimize the original cost function without making the model too complicated.

With this framework in hand, we now need to figure out what particular $\Omega(\vec{\theta})$ we would like to use.

### L1 Regularization
What does it mean for a polynomial to be simple? One interpretation is that most of its coefficients are zero. One popular way to encode this preference is using something called **L1-regularization**. That is, we set:

$$
\Omega(\vec{\theta}) = ||\vec{\theta}||_1 = \sum_{j=1}^{d}|\theta_j|
$$

Linear regression with this regularization function is called **Lasso** regression.

### L2 Regularization
Another interpretation of a simplicity is that the model's coefficients should not be too large in magnitude. One popular way to encode this preference is through **L2-regularization**. That is, we set:

$$
\Omega(\vec{\theta}) = ||\vec{\theta}||_2^2 = \sum_{j=1}^{d}|\theta_j|^2
$$

Linear regression with this regularization function is called **Ridge** regression.

### L1 and L2 Regularization
If we combine Ridge regression and Lasso regression together, we get something called **Elastic Net** regression, which lets us trade off between the two forms of regression and the .

**Exercise 3: Let $\lambda_1 \in [0, 1]$ and $\lambda_2 \in [0, 1]$ be hyperparameters indicating how much L1 and L2 regularization we want, respectively. Can you write an expression for $\tilde{J}(\vec{\theta})$ that combines both L1 and L2 regularization with the original cost function $J(\vec{\theta})$?**

### Parting Thoughts
Usually, the best models are ones that have high capacity, but are well regularized.

## Cross Validation
Most learning algorithms have hyperparameters. For example:

1. The $\lambda_1$ and $\lambda_2$ in in Elastic Net regression
2. The depth of a decision tree model
3. The learning rate of gradient descent

We saw how we can pick a hyperparameter with a validation set. To refresh your memory, we basically split the dataset into two pieces: a training set and a testing set. Typically, the training set is between 60-90% of the data. We run our optimization algorithm (i.e. gradient descent) on the training set and select the hyperparameters that perform the best on the validation set.

The problem with this approach, however, is that we do not get to train on the validation set, which seems like a waste of data. To get around this issue, we can use a technique called **k-fold cross validation**. Basically, the algorithm works like this:


1. Split the dataset into $k$ pieces.
2. Pick some setting of the hyperparameters - we want to see how well these hyperparameters do.
3. For each of the $k$ pieces, train the model using all of the pieces EXCEPT the given pieces, and then evaluate the hyperparameters on the given piece - call the result of the evaluation $M_k$
4. Average the evaluation metric over all the $k$ pieces: $\frac{1}{k} \sum_{i=1}^{k}{M_k}$

With the above technique, we are able to use all of the data for both training and validation! If $k$ = number of training examples, we call it **leave-one-out cross validation**.

**Exercise 4: k-fold Cross Validation is a great way to use the entire dataset for training and validation. However, it doesn't come for free. What is one disadvantage that k-fold Cross Validation has compared to a static train-validation split?**

## Grid Search and Random Search
Suppose that we have two hyperparameters, $h_1$ and $h_2$. We have decided on some possible values to try for $h_1$ - call this set $H_1$ (define $H_2$ analogously). How do we select the $(h_1, h_2)$ pair that we want to evaluate with cross validation?

One approach is to use **grid search**. This is when we consider all possible pairs $(h_1, h_2)$ where $h_1 \in H_1$ and $h_2 \in H_2$, one after the other.

This approach is not ideal, however. For example, suppose there is one choice of $h_1 \in H_1$ which is so bad that it completely ruins the algorithm's performance. We will end up using this value $|H_2|$ times, which is just wasted computation. For this reason, we consider another approach...

A better alternative to grid search is **random search**. This is when we randomly sample $h_1$ from $H_1$ and randomly sample $h_2$ from $H_2$ and evaluate the pair $(h_1, h_2)$. Then, we sample again and repeat.

**Exercise 5: Suppose that we have $h$ hyperparameters, where each one has $S$ possible values. How many hyperparameter configurations will grid search evaluate? If we use $k$-fold cross validation to evaluate a hyperparameter setting, how many times will we have to train machine learning model to pick the best hyperparameters?**

## Overfitting and Underfitting
The key to good machine learning is the ability to **generalize**. That is, we want to develop a model that will perform well on examples that it has not seen before.

However, there are two ways that our model could fail to do well on new data.

**Overfitting** is the situation where our model has fit the training data too tightly in such a way that it has not learned patterns that will perform well on new data. A model that overfits is said to have **high variance**.

**Underfitting** is the situation our model has not even fit the training data well, which will cause it to perform poorly on new data. A model that underfits is said to have **high bias**.

**Exercise 6: For each of the following problems, indicate whether the problem will cause underfitting or overfitting**

1. Training set is too small
2. Model is incredibly complicated (i.e. has a ton of parameters)
3. Training data looks nothing like the testing data (i.e. they are drawn from different distributions)
4. We only run for a few steps of gradient descent
5. Model does not use regularization

**Exercise 7: MNIST is a popular dataset where the goal is to examine handwritten digits and identify the digit (0-9) that is indicate in the image. For decades, people have trained models on MNIST's training set and evaluated their models on MNIST's test set. However, if you take take the state-of-the-art model on MNIST and apply it to regular handwriting recognition, it won't perform nearly as well as its performance on MNIST's test set. How is this possible? After all, none of the researchers ever set their parameters or hyperparameters using the MNIST test set.**

**Exercise 1**: $$\alpha$$, the step size, is another hyper-parameter that we supply gradient-descent.

**Exercise 2**: $$|H_{degree}| * C$$

**Exercise 3**: $$\tilde{J}(\vec{\theta}) = ||\vec{\theta}||_1 + ||\vec{\theta}||_2^2 + J(\vec{\theta})$$

**Exercise 4**: Although you get to train on all of your data, the price of cross-validation is that you have to do more computation. If you had very simple cross validation of k=2, then you would have to train the model twice (instead of once). Training with a higher k would make it far more expensive to train the model (since you have to do it once for each of k left-out pieces) than if you just train the model once.

**Exercise 5**: Grid search will evaluate $h^S$ parameters. To perform *k-crossfold-validation*, we'll need to do this search *k* times, meaning we'll train the model $k * h^S$ times - that's a lot of computation!

**Exercise 6**:
1. Underfitting - Without enough training data, the model won't be able to pick up any patterns, even for the validation set
2. Overfitting OR Underfitting - If the model has lots of parameters, it might not have enough data to fine-tune all of them to the right settings for the general case. Alternatively, if the generated model has lots of parameters, it might be because it is overfitting, and making parameters to suit the particular training set.
3. Overfitting - The model will be fit to the distribution of the training data, and will therefore perform poorly on the new data.
4. Underfitting - Gradient descent takes time to converge on local minima. If it's only been run for a few steps, it may not have had time to converge, and will perform poorly on both the training and testing data.
5. Overfitting - If the model is not regularized, it may fit the data too tightly (as there is no cost to fitting it tighter). If the regularization is done just by testing different hyper-parameters (instead of turning some hyper-parameters into just parmeters), it will just take more compute time.

**Exercise 7**:
Although the researchers may not have been trying to fit MNIST, it's possible that the MNIST data does not accurately portray real-life data. It's possible they had bias when collecting data (for example, collecting handwriting from only one area, or from only well educated people) or just that they were unable to collect enough varied samples.