# Beginning Machine Learning with scikit-learn

## Exploring hyperparameters

In the introductory lesson and at various points in other earlier lessons, we utilized **hyperparameters** to models to tune their performance.  In this lesson we will look at hyperparamters a bit more systematically, and especially look at *grid search* which is a nice API to use in exploring hyperparametric space.

While there is overlap in the hyperparameters used by different models, the same name often has a somewhat different meaning because the underlying mathematical process is different.  Moreover, different models usually have mostly different collections of hyperparameters that pertain to them.  Learning the available hyperparameters is a matter of learning about the individual model class.

In [None]:
# Some libraries tend to be in flux for their dependency versions
import warnings
warnings.simplefilter("ignore")

### The Wisconsin breast cancer dataset

For this lesson, we will look at another sample dataset included with scikit-learn.  The cancer dataset has 30 features and a binary target of "malignant" or "benign."  This dataset is moderate sized with 569 samples.  

Our goal in this lesson is not to identify the *optimal* classifier and hyperparameters, but simply to explore how to work with the parametric space.

In [None]:
from sklearn import datasets

In [None]:
cancer = datasets.load_breast_cancer()
cancer.target_names

In [None]:
print(cancer.DESCR)

## Naive classification

For now, we use K Nearest Neighbors (KNN) classification, mostly because it is easy to understand.  The general idea of KNN is simply to identify the K points that are "closest" to a test point or newly observed point, and let the plurality win.$^1$  KNN does quite well for numerous classification and regression problems.

<hr/>
<small>$^1$<i>The winner may not be a majority.  For example with 8 nearest neighbors and four classes, we might have a predicted point equally close to 2 points from each class.  The tie is broken arbitrarily by the order of the training data.  Even if we had 9 neighbors and the count of those nearby were `{A:3, B:2, C:2, D:2}`, letting A win would still be with only 1/3 of neighbors "voting" for A.</i></small> 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=1)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

A 93% score might strike you as quite good, especially having seen much worse ones in other examples.  However, the thing we are trying to predict is literally a life or death matter, which makes the number seem less impressive.  Moreover, we have not here teased out the differences between false positives and false negatives in that mean accuracy score.  Presumably, in this domain we would rather have more false positives than false negatives because unnecessarily treatment (or unnecessary additional testing) is less bad than a missed diagnosis.

In a later lesson we look at metrics in more detail. For this lesson, we will only look at this model `.score()` method as our optimization goal.

## Exploring one hyperparameter

The most obvious hyperparameter for KNN classification is the number of neighbors used.  Many aspects of the data—from number of samples, to number of dimensions, to multi-modality in univariate distribution of features—can greatly affect the "right" answer.  Moreover, if we really want to arrive at the best classification, we should look at scaling issues that will be glossed over here but discussed in a later lesson on feature engineering.

In [None]:
import pandas as pd

scores = []
for k in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    score = knn.score(X_test, y_test)
    scores.append(knn.score(X_test, y_test))
    
pd.Series(scores, index=range(1,40), name="Score")

It is easier to see a pattern if we visualize the trend.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
plt.plot(range(1, 40), scores,marker='o', markerfacecolor='red', markersize=5)
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.title('Model response to number of neighbors');

## Exploring many hyperparameters

While `n_neighbors` is the most obvious hyperparameter, others can significantly affect the accuracy (or other metrics) as well.  All scikit-learn models have default values for their hyperparameters, but how good those choices are is very domain specific.

The additional hyperparameters of interest to KNN are `weights` and `metric`.  There are a few other hyperparameters, but those are used for performance considerations not for fundamental behavior of trained models.

By default, `weight` is `"uniform"` meaning that it simply gives one "vote" to each of the closest neighbors.  But `"distance"` is quite plausible and weights each such neighbor by the inverse of distance (with cutoffs for only K neighbors considered nonetheless).

By default, `metric` is `"minkowski"` which a generalization of Pythagorean distance to higher dimensions.  But `"manhattan"` distance is also often useful; it measure the "city blocks" to get from point to point (i.e. the sum of the distance in each direction).  Other are available and occasionally better choices.

| identifier | distance function
|------------|----------------------
| euclidean  | $$ \sqrt{\vphantom{\int}}{\sum (x-y)^2} $$ 
| manhattan  | $$ \sum{|x-y|} $$
| chebyshev  | $$ \max {\Big\{x-y\Big\}} $$
| minkowski  | $$ \sum{\big(|x-y|^p\big)^{1/p}} $$
| wminkowski | $$ \Big({\sum |w \cdot (x-y)|^p}\Big)^{1/p} $$
| seuclidean | $$ \sqrt{\vphantom{\int}}{\sum \frac{(x-y)^2}{V}} $$
| mahalanobis| $$ \sqrt{(x-y)^{'} \cdot V^{-1} \cdot (x-y)} $$


Let us try combining a couple of these hyperparameters in the same search.

In [None]:
import numpy as np

metrics = ['minkowski', 'manhattan', 'euclidean', 'chebyshev']
K = range(1, 18, 2)
scores = np.empty((len(metrics),len(K)))

for x, k in enumerate(K):
    for y, metric in enumerate(metrics):
        knn = KNeighborsClassifier(n_neighbors=k, metric=metric)
        knn.fit(X_train, y_train)
        score = knn.score(X_test, y_test)
        scores[y, x] = score 

In [None]:
import seaborn as sns
ax = sns.heatmap(scores, 
                 annot=True, 
                 xticklabels=K,
                 yticklabels=metrics,
                 )
ax.set_title('Model response to 2 hyperparameters');

## GridSearchCV

So far, so good.  It would not *too* hard to keep track of the best model discovered within the inner loop.  And any Python programmer could construct more nested loops to search over 3, or 4, or 5, different hyperparameters.  

We could store all the scores in a parameter grid of N dimensions.  

Maybe while we are at it, it would be nice to remember the training and scoring times different hyperparameters take. 

It could be useful to allow for different scoring metrics to be performed within the nested search of hyperparameters.  Or actually perform multiple different scoring functions that may inform the quality of hyperparameter sets differently.

We might also want our code to perform more robust and configurable train/test split strategies.

But really, it is much easier to take the `GridSearchCV` function from scikit-learn that does all of this for us and is well-tested to avoid any pitfalls, bugs, or edge cases we might overlook.

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = {'n_neighbors': range(1, 18, 2),
              'weights': ['uniform', 'distance'],
              'metric': ['minkowski', 'manhattan', 'euclidean', 'chebyshev']
             }

grid =  GridSearchCV(KNeighborsClassifier(), parameters)
# Best fit over cross-product of parameter space, cross-validated
model = grid.fit(cancer.data, cancer.target)
model

### Identifying the best hyperparameters

One additional nice detail of `GridSearchCV` is that by default once a best set of hyperparameters is identified, the model is refit against the entire dataset rather than only the training split.  This can improve accuracy while avoiding overfitting in the initial hyperparameter choice.

The object delivered in an attribute—but also usable directly as the grid search model object itself—reflects this improved refitting (if the argument `refit` is kept as the default `True` value).

In [None]:
print(model.best_params_,'\n')
print(model.best_estimator_,'\n')
print(model.best_score_)

In [None]:
best_model = model.best_estimator_
model.predict(cancer.data)

### Examining the search space

Some the information collected about the search is times taken for steps.  Given a search across a large, multi-dimensional, hyperparameter space can require many combinations, the fitting can take a considerable time. `KNeighborsClassifier` was chosen for this lesson in part because it is a very fast model.

Moreover, while KNN performs pretty much equally quickly across a range of hyperparameters, that is definitely **not true** of many other models.  In some cases, a hyperparameter choses among different algorithms with very different performance characteristics.  In others, a hyperparameter chooses among threshhold type values that can greatly affect convergence rates or other computational details.  Being able to know not only that this combination of hyperparameters has better *accuracy*, but also what the relative *performance* of each is, can be imporant.

In [None]:
(pd.DataFrame(grid.cv_results_)
   .set_index('rank_test_score')
   .sort_index()
)

## Next lesson

Please attend also the companion course **Intermediate Machine Learning with scikit-learn**.  We will look at topics Clustering, Feature engineering and feature selection, Pipelines, and Robust Train/test splits.