# Regularization, Cross Validation

## Regularization
Recall from Lesson 2 that linear regression minimizes the following cost function:

$$
J(\vec{\theta}) = \frac{1}{2m} \sum_{i = 1}^{m}{(y^{(i)} - \vec{\theta}^T \mathbf{x}^{(i)})^2}
$$

The cost function can be optimized with the normal equation or with the more generally applicable gradient descent algorithm. Let us ignore the normal equation and suppose we do all cost minimization with gradient descent.

The entries of the $\vec{\theta}$ vector are called the **parameters** of the machine learning model because they are found by minimizing the cost function.

The polynomial degree, on the other hand, is called a **hyperparameter** because we cannot determine its optimal value by minimizing the cost function - it must be provided before the minimization.

**
Exercise 1: Can you think of any other hyperparameters that we need to set to fit a linear regression model?
Hint: What is the formula for gradient descent?
**

The way we solved this problem was by using a validation set and trying all the polynomial degrees from a set $H$ (we chose $H = \{1, ..., 20\}$). Using a validation set is generally applicable techinique for picking hyperparameters, but it has its limitations. Namely, it is more computationally expensive.

**
Exercise 2: Assume that the average compute time required to minimize the cost function is $C$, what is the compute time required to find the optimal polynomial degree, assuming we try all degrees in the set $H_{degree}$? 
**

Thus, to get around this problem, we can use a technique called **regularization**, wherein we modify the cost function to encode our preference for simpler models. That is, instead of minimizing $J(\theta)$, we minimize:

$$
\tilde{J}(\vec{\theta}) = \alpha \Omega(\vec{\theta}) + J(\vec{\theta})
$$

where $\Omega(\vec{\theta})$ measures how "complicated" our model is. Thus, the model is forced to minimize the original cost function without making the model too complicated.

With this framework in hand, we now need to figure out what particular $\Omega(\vec{\theta})$ we would like to use.

### L1 Regularization
What does it mean for a polynomial to be simple? One interpretation is that most of its coefficients are zero. One popular way to encode this preference is using something called **L1-regularization**. That is, we set:

$$
\Omega(\vec{\theta}) = ||\vec{\theta}||_1 = \sum_{j=1}^{d}|\theta_j|
$$

Linear regression with this regularization function is called **Lasso** regression.

### L2 Regularization
Another interpretation of a simplicity is that the model's coefficients should not be too large in magnitude. One popular way to encode this preference is through **L2-regularization**. That is, we set:

$$
\Omega(\vec{\theta}) = ||\vec{\theta}||_2^2 = \sum_{j=1}^{d}|\theta_j|^2
$$

Linear regression with this regularization function is called **Ridge** regression.

### L1 and L2 Regularization
If we combine Ridge regression and Lasso regression together, we get something called **Elastic Net** regression, which lets us trade off between the two forms of regression and the .

**Exercise 3: Let $\lambda_1 \in [0, 1]$ and $\lambda_2 \in [0, 1]$ be hyperparameters indicating how much L1 and L2 regularization we want, respectively. Can you write an expression for $\tilde{J}(\vec{\theta})$ that combines both L1 and L2 regularization with the original cost function $J(\vec{\theta})$?**

## Cross Validation
In some cases, we cannot 
