# Using regularization to reduce overfitting

- Previously, we discussed the concepts of overfitting and underfitting.  Let's review.

- **Underfitting** is where our model is not sufficiently powerful enough to capture 
all of the details of the training data.

  - For example, this could be using linear regression with not enough features (for instance, our
  data fits a quadratic curve but we don't include quadratic features of our data).

- **Overrfitting** is where our model is too powerful, and so captures noise in the 
training data, or extraneous details of the data. 

  - For example, this could be using linear regression with too many features (for instance, our
  data fits a quadratic curve but we include too many higher-order polynomial features.
  
  - This can also happen when we include too many features in general: regression (or classification)
  can start finding trends in the data that are just coincidental and don't serve as good predictors
  in general.

## What can we do to address overfitting?

Let's say you fit a model to your training set and you have overfitting.  What do you do?

- One solution is to collect more training data.  This will tend to smooth out any 
overfitting in your model, because the model has to take into account more data and can't
"wiggle around" to fit every training example anymore.

- This is probably the number-one best strategy to address overfitting.  But sometimes, 
this isn't an option.  Collecting more data can be expensive (in terms of money, time, or effort),
or there just might not be any more.

- Another strategy is to reduce the number of features you use.  If you have many features
but not a lot of data, overfitting is very likely (this is due to the way mathematics work in
high-dimensional spaces, which we will see later).  

- Reducing the number of features, (e.g., by using your intuition to focus on only the few you think
have the most correlation with the variable you're trying to predict) is called **feature selection**.
But it's sometimes hard to do!

- Another problem with reducing the number of features is because sometimes you're forced to 
disregard useful features!  We will see later that there are algorithms to perform this feature
selection process for you.

## Regularization

Another strategy is called **regularization**.  The point of regularization is to reduce the size
(magitude)
of the parameters $w_j$ in the vector $\boldsymbol{w}$.

We do this because if you look at an overfit model, the parameters are often very large.

If you eliminate a feature, like feature selection does, that's equivalent to 
setting the equivalent parameter to zero.

Regularization is a way of "softening" this feature removal process, by not
setting a parameter to zero, but rather making sure it doesn't grow too large.  This allows us
to have the best of both worlds: we don't remove the feature entirely by setting the
parameter to zero, but we don't let  the parameter
grow to whatever gradient descent "thinks" it should be (which overfits).

It turns out that even if you fit a high-order polynomial to a small data set, by limiting the
size of the parameters, this will smooth out a lot of the "wigglyness" that you would see in an 
overfit model.

So regularization lets you keep all of your features, but prevents them
from having overly-large effects on your model.

By convention, we usually only regularize $w_j$ for $j \geq 1$; in other words, we don't regularize
$w_0$.  Remember $w_0$ is equivalent to our original $b$ parameter, and can be thought of 
in algebraic terms as the "y-intercept" on an x-y graph, and therefore doesn't usually need to
be regularized.  That said, it often doesn't hurt if you regularize it, so some people will do it anyway.

## Summary so far

To reduce overfitting:
  - Collect more data.
  - Select features, drop others.
  - Use regularization.

## Adjusting the cost function with regularization

Let's examine regularization by returning to linear regression (though the same principles apply to logistic regression).

Here's the cost function for linear regression:

$$J(w)   = \frac{1}{2m}\sum_{i=1}^m \left( f_w(x^{(i)})-{y}^{(i)} \right)^2$$

What regularization does is add to the cost function the raw weights $w_j$ themselves,
weighted by a (often large) constant.  Because we are trying to minimize the cost function,
this tends to keep the weights small.


This penalizes large weights, and tends to cause gradient descent to want to keep
those weights small.

This tends to reduce overfitting.

This also works the same way if you have lots of unrelated features (rather than a series of
higher-order features all derived from the same piece of data).  If we penalize *all* the weights
for getting large, this will tend to keep them small, and make it less likely for our model 
to overfit.

We then don't have to worry about picking which features to include or exclude --- just regularize
everything and let gradient descent handle it.

### New cost function

$$J(w)  = \frac{1}{2m}\sum_{i=1}^m \left( f_w(x^{(i)})-{y}^{(i)} \right)^2 + \dfrac{\lambda}{2m}
\sum_{j=1}^n w_j^2$$

Note the new term at the end.

The $\lambda$ (lambda) is called the "regularization parameter."  Like $\alpha$, it's another constant
we have to pick a value for.  

We also then divide $\lambda$ by $2m$, which is a little silly, because we could just build that into
the $\lambda$ constant itself, but it makes the derivative nicer later.  It also makes it easier to pick
a good $\lambda$ in the first place if we scale both pieces of the cost equation in the same fashion.
The reason for this is as the size of the data set increases (and $m$ grows larger), we often need
to scale down the regularization parameter.

Also note that the summation for the regularization term doesn't include $w_0$.

The two terms (components) of the cost equation have names: we call the left part the **mean squared
error** and the right part is the **regularization term**.

These two terms have somewhat opposing goals when we minimize $J$:
  - Minimizing the mean squared error encourages gradient descent to find a model that fits the 
  data well.
  
  - Minimizing the regularization term encourages gradient to descent to find a model that keeps
  the $w_j$'s small (and therefore reduces overfitting).
  
The $\lambda$ parameter controls the balance between these two opposing goals.

As an example, if $\lambda = 0$ (or is too small), the model will overfit.  If 
$\lambda$ is too big, the model will underfit.


## Math time!

## Regularized linear regression

When we add regularization to our cost function, nothing changes about the overall
method of approaching linear regression: we're still going to use gradient descent,
and so we still have to minimize our cost function $J$.

Find the parameter vector $w$ to minimize:

$$J(w)  = \frac{1}{2m}\sum_{i=1}^m \left( f_w(x^{(i)})-{y}^{(i)} \right)^2 + \dfrac{\lambda}{2m}
\sum_{j=1}^n w_j^2$$

Our new update equation for gradient descent:
    
$$w_j = w_j - \alpha  \left[ \frac{1}{m} \sum_{i=1}^m  \left( f_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}) - y^{(i)} \right)  x_j^{(i)} + \dfrac{\lambda}{m}w_j \right] \qquad \text{for $j>0$}$$


(for $w_0$ the equation remains the same as before, because we don't regularize $w_0$)

### A little mathematical explanation

Another way to write the $w_j$ update equation is:
    
$$w_j = w_j - \alpha\dfrac{\lambda}{m}w_j -  \alpha\frac{1}{m} \sum_{i=1}^m  \left( f_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}) - y^{(i)} \right)  x_j^{(i)}  $$

which can be rewritten as

$$w_j = w_j\left(1 - \alpha\dfrac{\lambda}{m}\right) -  \alpha\frac{1}{m} \sum_{i=1}^m  \left( f_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}) - y^{(i)} \right)  x_j^{(i)}  $$

It is interesting to think about that $\left(1 - \alpha\dfrac{\lambda}{m}\right)$ term --- it
will usually be a decimal number slightly smaller than one.  Multiplying a number like that
(e.g., 0.999) has the effect of slightly shrinking $w_j$ on each update.

## Regularized logistic regression

We apply the same regularization term to our cost function $J$ for logistic regression:

$$J(\boldsymbol{w}) = -\frac{1}{m}\sum_{i=1}^m  \left[ y^{(i)}\log\left( f_w(x^{(i)}) \right)+
        (1-y^{(i)})\log\left( 1-f_w(x^{(i)}) \right) \right] + \dfrac{\lambda}{2m}
\sum_{j=1}^n w_j^2$$

And similarly to linear regression, our new update equation changes to:
    
$$w_j = w_j - \alpha  \left[ \frac{1}{m} \sum_{i=1}^m  \left( f_{\boldsymbol{w}}(\boldsymbol{x}^{(i)}) - y^{(i)} \right)  x_j^{(i)} + \dfrac{\lambda}{m}w_j \right] \qquad \text{for $j>0$}$$

Just recall that $f$ here is different than $f$ for linear regression.