## Learning Objectives

We firm up what regularization is and when to use it. And we show how a computer can pick the optimal amount of regularization through grid search.

## Regularization

As we saw regularization was one of the tools that we had to fight the bias variance tradeoff. But we did not see how to effectively use it. We saw that it was able to, with fine grain precision, determine the size of our hypothesis set. We of course know that the size of the hypothesis set is crucial to the final out of sample performance. 

But how do we determine what is the best size for the hypothesis set?

We can determine this using by iterating over a set of regularization parameters and then determining which had the best results. An important note here is that we will be evaluating the final performance on a validation set and this will be able to tell us which hypothesis will do best on outside data. 

Let's go ahead and try this with our decision tree to see if we can get better out of sample performance: 

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

breast_cancer_data = load_breast_cancer()

# we make our test set
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data['data'], breast_cancer_data['target'], test_size=0.2, random_state=1)

# and we make our validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

In [2]:
from sklearn.tree import DecisionTreeClassifier

cls = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)

Notice that I show two ways to regularize my decision tree (there are more). And notice that these fit quite nicely into the paradigm of reducing the hypothesis set. Instead we would be looking at trees with a max depth of five or looking where their leaves can only be so small.

This is perhaps even easier to understand than weight penalization (the most common form of regularization), where you would add to the loss function the sum of the absolute values or squares of the weights.

So what we would do is iterate over different regularization strengths and choose the one that did best on the validation set. Let's show how you would do this below:

In [5]:
from sklearn.metrics import roc_auc_score

# here is where we do grid search
for max_depth in range(2, 10, 2):
    for min_samples_leaf in [5, 10, 20]:
        cls = DecisionTreeClassifier(max_depth=max_depth, min_samples_leaf=min_samples_leaf)
        cls.fit(X_train, y_train)
        
        error = roc_auc_score(cls.predict(X_val), y_val)
        
        print('max depth {}'.format(max_depth))
        print('min samples leaf {}'.format(min_samples_leaf))
        print('error {}'.format(error))
        

max depth 2
min samples leaf 5
error 0.90873015873
max depth 2
min samples leaf 10
error 0.90873015873
max depth 2
min samples leaf 20
error 0.925683060109
max depth 4
min samples leaf 5
error 0.952455590387
max depth 4
min samples leaf 10
error 0.90873015873
max depth 4
min samples leaf 20
error 0.925683060109
max depth 6
min samples leaf 5
error 0.952455590387
max depth 6
min samples leaf 10
error 0.90873015873
max depth 6
min samples leaf 20
error 0.925683060109
max depth 8
min samples leaf 5
error 0.952455590387
max depth 8
min samples leaf 10
error 0.90873015873
max depth 8
min samples leaf 20
error 0.925683060109


Notice that we achieve the best out of sample error somewhere in the middle, where the regularization is not highest, but is high enough.

Of course we should be concerned if we are doing this too much. We don't want to run too too many experiments. If we run too many the errors that we get on the validation set may not be reflective or reality. But the above has shown us that we can get great performance just by changing the hyperparameter (those not learned) of the model. 

## In summary

Regularization is a way to automatically control the size of the hypothesis set, and we can use a computer in order to find the best regularization needed. Generally speaking the models in sklearn will come with docstrings that explain the way to regularize them. 

Regularization may sometimes be confused with inductive bias (which we will talk about a bit later on).

## Learning Objectives

We firm up what regularization is and when to use it. And we show how a computer can pick the optimal amount of regularization through grid search.

## Comprehension Questions

1.	How do we know what the best regularization is?
2.	Why is putting a penalty on large weights regularization?
3.	Is changing models a type of regularization?
