## Validation Set

Something that we did not go over before was the validation set. We saw what a big difference in error we got with the test set, and we learned that we cannot trust the error we get on the training set. But the problem is that we cannot use the test set but once. So this would be fine if we were testing a single hypothesis, but this would not work if we were testing a series of different hypotheses against each other (consider KNN with K as 3 and 4). So what do we do?

In practice we will split our data into three sets one used for training, one used to experiment on, and one to hold out until the end to test on. 

So what gives? We just showed that as we try multiple hypotheses the quality of the estimate decreases.

Well as we just saw in the previous lesson and as we will see later on, when we train on data we will in general test infinite hypotheses. And in this case the error bound becomes far worse. 

So we have three data splits for three cases:

1. Train data: we use this to test infinite hypotheses
2. Validation data: we use this to test finite hypotheses
3. Test data: we use this to test a single hypothesis

Let's walk through doing this below:

In [140]:
import pandas as pd
import numpy as np

# we now consolidate the preprocessing
def billionaire_preprocess():
    data = pd.read_csv('../data/billionaires.csv')

    del data['was founder']
    del data['inherited']
    del data['from emerging']

    data.age.replace(-1, np.NaN, inplace=True)
    data.founded.replace(0, np.NaN, inplace=True)
    data.gdp.replace(0, np.NaN, inplace=True)
    
    del data['company.name']
    del data['name']
    del data['country code']
    del data['citizenship']
    del data['rank']
    del data['relationship']
    del data['sector']
    
    dummy_data = pd.get_dummies(data, dummy_na=True, columns=data.select_dtypes(exclude=['float64']), drop_first=True)
    
    return dummy_data

In [141]:
# now we get the data
data = billionaire_preprocess()

In [142]:
# we parse out the target (this time classification)
y = data['worth in billions'] > 2
del data['worth in billions']

In [143]:
# now we split the data
from sklearn.model_selection import train_test_split

# we make our test set
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.2, random_state=1)

# and we make our validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

In [144]:
print X_train.shape, X_val.shape, X_test.shape

(1672, 70) (419, 70) (523, 70)


## In practice

So let's say that we wanted to try KNN, but we wanted to try multiple values for K, what could we do...

In this case we would train multiple models and try them all on the validation set!

First let's do some feature engineering (remember we only train our feature engineering on the training data!)

In [145]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest

# We then consolidate the data munging code
def billionaire_feature_eng(X, y, quantitative_pipeline, aggregated_pipeline, training=False):
    data = X.copy()

    qualitative_features = data.select_dtypes(exclude=['float64'])
    quantitative_features = data.select_dtypes(include=['float64'])
    
    # notice how we only fit on the training data!
    if training:
        quant_X = quantitative_pipeline.fit_transform(quantitative_features)
    else:
        quant_X = quantitative_pipeline.transform(quantitative_features)

    X = np.concatenate([quant_X, qualitative_features], axis=1)
    
    if training:
        X = aggregated_pipeline.fit_transform(X, y)
    else:
        X = aggregated_pipeline.transform(X)
    
    return X, y

In [146]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import mutual_info_classif

# and we can abstract out specific parts of the pipeline
quantitative_pipeline = Pipeline([
    ('imputer', Imputer(strategy='median')),
    ('standardize', StandardScaler()) 
])

aggregated_pipeline = Pipeline([
    ('var_threshold', VarianceThreshold(threshold=0.0)),
    ('k_best', SelectKBest(mutual_info_classif, k=5))
])

In [147]:
X_train, y_train = billionaire_feature_eng(X_train, y_train, quantitative_pipeline, aggregated_pipeline, training=True)

In [148]:
# let's check out the features we are using
data.columns[aggregated_pipeline.steps[0][1].get_support()][aggregated_pipeline.steps[1][1].get_support()]

Index([u'age', u'company.type_new, privitization', u'gender_male',
       u'industry_Diversified financial', u'region_South Asia'],
      dtype='object')

In [149]:
from sklearn.neighbors import KNeighborsClassifier

# we can try 3, 4 and 5 for the number of neighbors
cls = KNeighborsClassifier(n_neighbors=1, weights='uniform')

In [150]:
cls.fit(X_train, y_train)

# pretty bad accuracy, not the worst
cls.score(X_train, y_train)

0.55801435406698563

In [151]:
# Now we try it on the validation data

# first we do the feature eng on the validated data
X_val, y_val = billionaire_feature_eng(X_val, y_val, quantitative_pipeline, aggregated_pipeline)

In [152]:
cls.score(X_val, y_val)

0.45346062052505964

In [153]:
from sklearn.neighbors import KNeighborsClassifier

# we can try 3, 4 and 5 for the number of neighbors
cls = KNeighborsClassifier(n_neighbors=2, weights='uniform')

cls.fit(X_train, y_train)

print cls.score(X_train, y_train)

print cls.score(X_val, y_val)

0.57476076555
0.491646778043


In [154]:
from sklearn.neighbors import KNeighborsClassifier

# we can try 3, 4 and 5 for the number of neighbors
cls = KNeighborsClassifier(n_neighbors=3, weights='uniform')

cls.fit(X_train, y_train)

print cls.score(X_train, y_train)

print cls.score(X_val, y_val)

0.590909090909
0.541766109785
