## Regularization

As we saw regularization was one of the tools that we had to fight the bias variance tradeoff. But we did not see how to effectively use it. We saw that it was able to, with fine grain precision, determine the size of our hypothesis set. We of course know that the size of the hypothesis set is crucial to the final out of sample performance. 

But how do we determine what is the best size for the hypothesis set?

We can determine this using by iterating over a set of regularization parameters and then determining which had the best results. An important note here is that we will be evaluating the final performance on a validation set and this will be able to tell us which hypothesis will do best on outside data. 

Let's go ahead and try this with our decision tree to see if we can get better out of sample performance: 

In [1]:
import pandas as pd
import numpy as np

# we now consolidate the preprocessing
def billionaire_preprocess():
    data = pd.read_csv('../data/billionaires.csv')

    del data['was founder']
    del data['inherited']
    del data['from emerging']

    data.age.replace(-1, np.NaN, inplace=True)
    data.founded.replace(0, np.NaN, inplace=True)
    data.gdp.replace(0, np.NaN, inplace=True)
    
    del data['company.name']
    del data['name']
    del data['country code']
    del data['citizenship']
    del data['rank']
    del data['relationship']
    del data['sector']
    
    dummy_data = pd.get_dummies(data, dummy_na=True, columns=data.select_dtypes(exclude=['float64']), drop_first=True)
    
    return dummy_data

In [2]:
from sklearn.model_selection import train_test_split

# now we get the data
data = billionaire_preprocess()

# we parse out the target
y = data['worth in billions']
del data['worth in billions']

# we make our test set
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.2, random_state=1)

# and we make our validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

print X_train.shape, X_val.shape, X_test.shape

(1672, 70) (419, 70) (523, 70)


In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest

# We then consolidate the data munging code
def billionaire_feature_eng(X, y, quantitative_pipeline, aggregated_pipeline, training=False):
    data = X.copy()

    qualitative_features = data.select_dtypes(exclude=['float64'])
    quantitative_features = data.select_dtypes(include=['float64'])
    
    # notice how we only fit on the training data!
    if training:
        quant_X = quantitative_pipeline.fit_transform(quantitative_features)
    else:
        quant_X = quantitative_pipeline.transform(quantitative_features)

    X = np.concatenate([quant_X, qualitative_features], axis=1)
    
    if training:
        X = aggregated_pipeline.fit_transform(X, y)
    else:
        X = aggregated_pipeline.transform(X)
    
    return X, y

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import mutual_info_regression

# and we can abstract out specific parts of the pipeline
quantitative_pipeline = Pipeline([
    ('imputer', Imputer(strategy='median'))
])

aggregated_pipeline = Pipeline([
    ('var_threshold', VarianceThreshold(threshold=0.0))
])

In [5]:
X_train, y_train = billionaire_feature_eng(X_train, y_train, quantitative_pipeline, aggregated_pipeline, training=True)

X_val, y_val = billionaire_feature_eng(X_val, y_val, quantitative_pipeline, aggregated_pipeline)

In [6]:
from sklearn.tree import DecisionTreeRegressor

reg = DecisionTreeRegressor(max_depth=5, min_samples_leaf=10)

Notice that I show two ways to regularize my decision tree (there are more). And notice that these fit quite nicely into the paradigm of reducing the hypothesis set. Instead we would be looking at trees with a max depth of five or looking where their leaves can only be so small.

This is perhaps even easier to understand than weight penalization (the most common form of regularization), where you would add to the loss function the sum of the absolute values or squares of the weights.

So what we would do is iterate over different regularization strengths and choose the one that did best on the validation set. Let's show how you would do this below:

In [12]:
from sklearn.metrics import mean_absolute_error

# here is where we do grid search
for max_depth in range(2, 10, 2):
    for min_samples_leaf in [5, 10, 20]:
        reg = DecisionTreeRegressor(max_depth=max_depth, min_samples_leaf=min_samples_leaf)
        reg.fit(X_train, y_train)
        
        error = mean_absolute_error(reg.predict(X_val), y_val)
        
        print('max depth {}'.format(max_depth))
        print('min samples leaf {}'.format(min_samples_leaf))
        print('error {}'.format(error))
        

max depth 2
min samples leaf 5
error 2.23172876882
max depth 2
min samples leaf 10
error 2.18984332729
max depth 2
min samples leaf 20
error 2.1827550218
max depth 4
min samples leaf 5
error 2.22345650613
max depth 4
min samples leaf 10
error 2.24114147033
max depth 4
min samples leaf 20
error 2.20703925261
max depth 6
min samples leaf 5
error 2.24137734624
max depth 6
min samples leaf 10
error 2.26024532185
max depth 6
min samples leaf 20
error 2.22405848899
max depth 8
min samples leaf 5
error 2.22950176744
max depth 8
min samples leaf 10
error 2.210377466
max depth 8
min samples leaf 20
error 2.29416057636


Notice that we achieve the best out of sample error somewhere in the middle, where the regularization is not highest, but is high enough.

Of course we should be concerned if we are doing this too much. We don't want to run too too many experiments. If we run too many the errors that we get on the validation set may not be reflective or reality. But the above has shown us that we can get great performance just by changing the hyperparameter (those not learned) of the model. 