![banner](../img/cdips_2017_logo.png)

# Grid Search Cross-Validation for Neural Networks

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import sklearn as skl
import numpy as np

import scripts.load_data as load

import seaborn as sns
sns.set(font_scale=1)

%matplotlib inline

In [None]:
import sklearn.decomposition
import sklearn.random_projection
import sklearn.neural_network
from sklearn.model_selection import train_test_split

In [None]:
X, y  = load.load_training_spectra()

The results of the grid search are in a pandas DataFrame.

A few notes about the results:

- In attempt to match the [Kaggle competition](https://www.kaggle.com/c/afsis-soil-properties#evaluation),
we used a different evaluation method than the $R^2$,
which is the default for scikit-learn.
The evaluation method is
mean columnwise root mean squared error (MCRMSE)
-- the average, across the five targets, of our root mean squared error.
Though the two are closely related,
there's not an exact transformation between the two.
- GridSearchCV uses a "score",
rather than an "error" --
a score goes up when you do better,
whereas an error goes down.
To match this,
we ran the GridSearch with negative MCRMSE as the
"score".
The cells below load the data and then multiply the score
values by `-1` so that they become MCRMSE values.

In [None]:
grid_search_results = pd.read_csv('../data/model_params/GridSearch_07262017.csv',index_col=0)

In [None]:
columns_to_keep = [column for column in grid_search_results.columns
                                            if not column.endswith('_time')]

grid_search_results = grid_search_results[columns_to_keep]

In [None]:
columns_to_flip = [column for column in grid_search_results.columns
                                          if column.endswith('_score')
                                              and not (column.startswith('std')
                                                      or column.startswith('rank'))]

grid_search_results[columns_to_flip] = grid_search_results[columns_to_flip]*-1

In [None]:
columns_to_rename = [column for column in grid_search_results.columns
                                            if column.endswith('_score')]

renamed_columns = [column.rstrip('score')+'error' for column in columns_to_rename]
grid_search_results[renamed_columns] = grid_search_results[columns_to_rename]
grid_search_results = grid_search_results.drop(columns_to_rename,axis=1)

First, let's look at the basic descriptive statistics of the test and train error.

In [None]:
columns_to_describe = ['mean_test_error','mean_train_error']

grid_search_results[columns_to_describe].describe()

The distribution appears to be very skew,
with the maximum being 11 orders of magnitude above the minimum
and 10 orders of magnitude larger than the 75th percentile.

These bad performers will make it difficult to visualize
the remaining data, so we drop them
by `query`ing for models that have reasonable performance.

In [None]:
filtered_results = grid_search_results.query('mean_test_error<1')
filtered_results = filtered_results.query('mean_train_error<1')

Now we can visualize our results.

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(filtered_results.mean_test_error, kde=False,
             hist_kws={'normed':True,
                       'edgecolor':'k',
                       'linewidth':4,
                      'histtype':'stepfilled'},
            label='test');

sns.distplot(filtered_results.mean_train_error, kde=False,
             hist_kws={'normed':True,
                       'edgecolor':'k',
                       'linewidth':4,
                      'histtype':'stepfilled'},
            label='train');
plt.legend(fontsize=20)
plt.xlabel('MCRMS Error',fontsize=24);

Interestingly, it appears that there's a hard limit to our performance on the test set,
while performance on the training set can reach up to nearly perfect.

Though it's difficult to compare precisely,
since the test set used in the Kaggle competition 
is now unavailable,
this limit to our performance appears close to
the best performance possible --
within a few hundredths, at least.

In addition to the test error,
the difference between the test and train error
is also an important criterion for model selection.
This is called the *generalization error*.

Below, we calculate it and plot its distribution.

In [None]:
for split_index in ['0','1','2']:
    test_error = filtered_results['split'+split_index+'_test_error']
    train_error = filtered_results['split'+split_index+'_train_error']
    filtered_results['generalization_error'+split_index] = test_error-train_error
    
filtered_results['mean_generalization_error'] =  1/3*(filtered_results['generalization_error0'] +
                                                filtered_results['generalization_error1'] +
                                                filtered_results['generalization_error2'])

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(filtered_results.mean_generalization_error, kde=False,
             hist_kws={'normed':True,
                       'edgecolor':'k',
                       'linewidth':4,
                      'histtype':'stepfilled'},
            label='generalization');

plt.legend(fontsize=20)
plt.xlabel('MCRMS Error',fontsize=24);

Generalization performance is very spread out,
with most models generalizing very well,
but some models generalizing very poorly.

In order to pick the best model,
we need to look at the relationship between generalization error
and test error.

The cell below uses seaborn's `pairplot`
to plot the pairwise relationships between training error,
test error, and generalization error.

In [None]:
columns_to_plot = ["mean_test_error","mean_train_error","mean_generalization_error"]

sns.pairplot(filtered_results,vars=columns_to_plot,
                plot_kws={'alpha':0.1,},);

Training and test error are approximately linearly related, as might be expected.

The most important plot here is in the top-right corner,
and indicates the relationship between generalization error and test error.

The cell below plots just this relationship using `jointplot`.
To get a clearer sense of the distribution,
we plot it with a
[kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation),
which is a sort of "smoothed histogram", loosely speaking.

In [None]:
sns.jointplot(x='mean_generalization_error',y='mean_test_error',
             data=filtered_results,
             stat_func=None,
             kind='kde');

The best models are in the bottom-left of this chart --
they have low test error and low generalization error.

It appears that the majority of models tested fall in a single cluster,
with more variability along the test error axis
than along the generalization error axis.

With a gross sense of the performance in hand,
we now proceed to looking at the parameters of the best performers.

## Top Performers on Test Set

First, let's take a look at the top performers on the test set.

We can sort by the `rank_test_error` column
to pull out the best-performing hyperparameter settings.

In [None]:
filtered_results = filtered_results.sort_values(by='rank_test_error')
filtered_results.head(25)

A few things pop out:
for example,
early stopping is unpopular
and rectified linear units (`relu`s)
seem to outperform logistic units.

We can get a closer look by
plotting the distribution of test performances for each hyperparameter.

In [None]:
best_results = filtered_results.head(100)

In [None]:
columns_to_describe = [column for column in best_results.columns
                                          if column.startswith('param_')
                                          and not (column.endswith('_max_iter')
                                                  or column.endswith('_learning_rate_init')
                                                  or column.endswith('_alpha')
                                                  or column.endswith('_early_stopping'))
                                              ]                        

def make_error_plot(dataframe, error_column, error_name, columns_to_describe):
    plt.figure(figsize=(12,24))
    cols = 2
    rows = np.ceil(len(columns_to_describe)/2)
    for idx,column in enumerate(columns_to_describe):
        plt.subplot(rows,cols,idx+1)
        sns.stripplot(x=dataframe[column],y=dataframe[error_column],
                      jitter=True,size=12,alpha=0.75,color='gray'
                     )
        sns.violinplot(x=dataframe[column],y=dataframe[error_column],
                      )
        plt.ylabel(error_name +' MCRMS Error',fontsize='xx-large')
        plt.xlabel(column[6:])
    
    plt.tight_layout()
    plt.suptitle(error_name + ' Error Distribution For Different Hyperparameters',y=1.01,
                fontweight='bold',fontsize='xx-large');
    
make_error_plot(best_results, 'mean_test_error', 'Test', columns_to_describe)

For best test performance,
we want rectified linear units
with a batch size smaller than 128,
100 nodes in the hidden layer,
and no early stopping.

Interestingly, these are essentially
the same parameters that were discovered
by the hand-tuning procedure.

## Networks with Lowest Generalization Error from Best Performers

In [None]:
best_results = best_results.sort_values(by='mean_generalization_error')
best_results[columns_to_describe+['mean_generalization_error','mean_test_error','mean_train_error']].head(10)

Very different patterns arise for the generalization error
than for the test error!
For example, logistic units seem to be better,
and batch size doesn't seem to be as important.

Plotting the distribution of the generalization error confirms these trends.

In [None]:
columns_to_describe = [column for column in best_results.columns
                                          if column.startswith('param_')
                                          and not (column.endswith('_max_iter')
                                                  or column.endswith('_learning_rate_init')
                                                  or column.endswith('_alpha')
                                                  or column.endswith('_early_stopping'))
                                              ]    

make_error_plot(best_results, 'mean_generalization_error', 'Generalization', columns_to_describe)

So which hyperparameter setting do we choose?

Over-fitting is a serious issue for machine learning methods
applied in the real world.
Importantly,
the method of separating a data set collected at one time
into a training and a test set tends to underestimate
what the generalization error will be
when the model is deployed on totally new data.

Heuristically, a large generalization error on the
internally split training and test set
is a sign that the generalization error
on a totally novel set will be even larger.
This would motivate us to select the hyperparameter settings
that minimize generalization error while still
keeping the test error low.