# Intuition for Regularization

Okay, so you understand regression like a champion, but something tells you that these models are not a cure-all that explain all data. Big data sets are notorious for growing complex and creating extreme explanations for the data they describe. Even if these extreme explantions fit the data better than a simpler model, they are less likely to be  generalized. These models explain data from the past well, but are less likely to explain for future data it has never seen before. This issue, as you probably heard, is called an overfit model.

Take this trivial example: 

In [1]:
import pandas as pd 
import numpy as np

index = ['Bo', 'Spot', 'Mo', 'Lefty', 'Curly', 'Tom', 'Jerry', 'Sylvester']
columns = ['animal', 'size', 'label']
data = np.array([['snake', 'dog', 'cat', 'cat', 'dog', 'snail', 'dog', 'cat'],
                 ['small', 'small', 'small', 'small', 'large', 'small', 'large', 'large'], 
                ['friendly', 'friendly', 'enemy', 'friendly', 'friendly', 'friendly', 'enemy', 'enemy']]).T
df = pd.DataFrame(data=data, index=index, columns=columns)
df

Unnamed: 0,animal,size,label
Bo,snake,small,friendly
Spot,dog,small,friendly
Mo,cat,small,enemy
Lefty,cat,small,friendly
Curly,dog,large,friendly
Tom,snail,small,friendly
Jerry,dog,large,enemy
Sylvester,cat,large,enemy


Our model builds the rule:

    "pets with up to four-letter names are enemies, as are large dogs with names beginning with 'J', except that small snakes are not enemies"

The model is dumb. Yes, it fits perfectly, but its clunky. Intuitively, we know this will not generalize out to all the pets in the real world. There must be a simpler rule. Consider this:

    "large dogs and cats are enemies"

It does not fit the data perfectly, but fits reasonably well. It's simpler, and we may suppose that this rule will better hold true to thousands more examples.

Regularization prevents this overfitting. Loosely, regularization is a technique to discourage complexity and increase  generalization, and it sometimes comes at the cost of model accuracy. 

"Okay", you say. I'm ready for level 2.

### Underfitting 
First - a quick note on underfitting. This is the condition where our model is not able to find any relationship between inputs and outputs. Our model cannot find any corelation between a pet's characteristics and friendliness. Underfitting is intuitive to catch, just look at the classification accuracy on the training data. If the accuracy is low, the model may be too simple, and therefore underfit. Increase the complexity (the number of paramters in the model), and get a better fit. As we will see, regularization also helps alleviate this problem. 

### Overfitting 
On the other end of the spectrum, overfitting is more difficult to detect. You must monitor when your model is performing _too_ well on training data, and poorly on testing data. 

Regularization will help the model find a happy medium between these two states, and minimize the overall prediction error. It tweaks the model to avoid being too simple, avoid being too complex, and find something juuuuust right. 

<img src="https://raw.githubusercontent.com/momonala/DS_tutorials/master/files/bias_variance_tradeoff_e.jpg">
<center>
    ** Figure 1 **
</center>

### Regularization 

So how do we implement regularization? It is acheived by penalzing the weights in our model. Let's take a high level look at our model: 

<img src="https://raw.githubusercontent.com/momonala/DS_tutorials/master/files/model_matrix.png" width='600'>
<center>
    ** Figure 2 **
</center>

In this infographic, there are four parameters to our model: 

    1) x: The input data with i (height) data points and j (width) features
    2) y: The labels 
    3) ŷ: The predictions (what our model outputs)
    4) W: The weights of the "j" coefficents

<img src='https://raw.githubusercontent.com/momonala/DS_tutorials/master/files/prediction_eq.png'>



In mathematical terms, our prediction, ŷ, equals the weighted sum of the data points. 

### Penalizing Weights

Regularization penalizes the weights, W, of our model, above. How does this help? In Figure 1 we learned that after a point, increasing model complexity is bad for the prediction accuracy. The model complexity causes overfitting by contouring to very small deviations in the data. It turns out that **the size of the coeffient weights, W, increases exponentially with model complexity**.  

The magnitude of the coefficient represents the emphasis we are putting on that feature as a predictor. If we think animal size is important to friendliness, we will give that feature a large weight. When the weight becomes too large, the algorithm starts creating intricate relations to model the output, which is not desireable. Another way to think of it, is that features with large weights can almost signlehandedly control the output prediction. Think about that first model for animal friendliness (the dumb one). 

Penalizing the weights constrains them, and helps reduce model complexity. 

# L2 Ridge Regression

Ridge Regresssion is the most common type of regularization penalty. It is calculated as the sum of the squares of the weights. Here we see the equation for the L2 penalty. Later I will cover how to implement it into the model. 

<img src='https://raw.githubusercontent.com/momonala/DS_tutorials/master/files/l2_eq.png'>

we can gain an intuition of it in python: 

In [2]:
penalty = 0 #initialzie to zero, R(W) in the equation above
W = np.zeros(shape=(10,10)) #fake weights matrix

for i in np.arange(0, W.shape[0]):
    for j in np.arange(0, W.shape[1]):
        penalty += (W[i][j]**2)#square the weights and add 

Here, we are looping over the entire weights matrix and summing the squares of the weights. We can do this more efficiently with linear algebra, but this is just for demonstration. As an exponential function, you can see that L2 will penalize large weights from our W matrix, and prefer smaller ones. This will seek out W values that represent all features of the data. 

We should note that L2 is primarily used to prevent overfitting. The result keeps all the features in the model, but may reduce their weight. This is not true of all types of regulariztion. 

# L1 Lasso Regression

L1 takes the absolute value rather than the square:

<img src='https://raw.githubusercontent.com/momonala/DS_tutorials/master/files/l1_eq.png'>

L1 regularization shrinks coefficients, like L2, but it may also reduce some coefficents to zero, which is unique. A coeficent of zero essentially means that feature has been removed from the model. This is useful in senarios where we have hundres or thousands of features, like image classification, but may be excessive for smaller needs. 

we can gain an intuition of L1 in python: 

In [3]:
penalty = 0 #initialzie to zero, R(W) in the equation above
W = np.zeros(shape=(10,10)) #fake weights matrix

for i in np.arange(0, W.shape[0]):
    for j in np.arange(0, W.shape[1]):
        penalty += (abs(W[i][j]))#absolute value the weights and add 

There are a few other types of regularization (Elastic Net, Dropout) used for other models. The key takeway is that we have calculated an R(W) penalty term. 

# Loss Function

We must now add this term to our loss function (also called cost function or objective function). The loss function maps our model performance to some "real-world" interpretation, where we can measure its error as a "cost". Our goal is to minimize that cost; minimize the error.  Quantitatively, the loss function answers the following: 

    "The real label was 1, but I predicted 0. Is that bad?" 
    "Yeah. That's bad. 500 bad, to be specific."

Or "X bad", where X is the measure of the cost. Mathematically, this is the most basic form of the loss function: 

<img src='https://raw.githubusercontent.com/momonala/DS_tutorials/master/files/loss_func.png' width='150'>

To implement regularization, we multiply our penalty term, R(W), by a coefficent λ, and add that  to the loss function. 

<img src='https://raw.githubusercontent.com/momonala/DS_tutorials/master/files/loss_func_add_reg.png' width='220'>

λ is a hyperparamter, which means that it is a variable that _**we**_ set, unlike the weights, W, or predictions ŷ. We can also tune λ like a radio dial, and set it to value that gives us the best results. If cleaning up data is the thing you do most in data science, then tuning hyperparameters is the thing you do second most. 

And that's it! We update weights in the loss function, to minimize our prediction error. In doing this, we have reduced the magnitude of the coefficeints, and therefore reduced the model complexity. The trick comes in idenifying when and how to implement regularization. 

# Conclusion

Regularization is a technique to optimize model fit, most commonly to prevent overfitting. It works by penalizing the weights matrix, W, based on the size of your feature coefficients. Applying the penalty has the effect of generalizing the model to new data, but sometimes lowering the training accuracy. L2 regularization, or Ridge, is the most common for preventing overfitting. L1 regularization has the possibility to remove features from the data, and is better for extremely large datasets. Regularization comes with the need to tune the λ hyperparameter, which is the stregnth of the penalty. 

In practice, regularization has the following work flow:

    1) Determine if regularization should be applied. Is there a discrepancy in training and testing accuracy? 
    2) Which type to use? Is the model overfit? How big is the dataset? 
    3) Tune λ, the strength of the regularization to optimize testing accuracy. 

Sources: 

1) https://www.quora.com/What-is-regularization-in-machine-learning

2) http://www.pyimagesearch.com/2016/09/19/understanding-regularization-for-image-classification-and-machine-learning/

3) https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/