<h1 style="color: rebeccapurple;">Cross-validation</h1>

<span style="text-transform: uppercase;
        font-size: 14px;
        letter-spacing: 1px;
        font-family: 'Segoe UI', sans-serif;">
    Author
</span><br>
efrén cruz cortés
<hr style="border: none; height: 1px; background: linear-gradient(to right, transparent 0%, #ccc 10%, transparent 100%); margin-top: 10px;">

Now we are a very important machine learning step: cross-validation.

## <span style="color:darkorange">Conceptual Intermezzo - cross-validation</span>

See slides

## <span style="color:darkorchid"> Imports

In [1]:
# Scikit-learn specifics:
from sklearn import datasets
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Helper modules
import numpy as np

## <span style="color:darkorchid"> Validating Alice's methods

In [2]:
# Load data, split into train/test
X, y = datasets.load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

Remember our test data is sacred, and cannot be touched until the very end!

In [3]:
# Create  classifier
wine_clf = Pipeline(
    [
        ("preprocessor", preprocessing.StandardScaler()),
        ("classifier", svm.SVC())
    ]
)
wine_clf

### <span style="color:teal"> Cross-validation scores

And then, without training (fitting not necessary):

In [4]:
# Get cross-validation scores (one for each of the k folds)
k = 5
cv_scores = cross_val_score(
    estimator=wine_clf,
    X=X_train,
    y=y_train,
    cv=k
)

In [5]:
# Print the info we get
summary = f"k-fold scores: {cv_scores}\n\nAverage cv score: {cv_scores.mean():.3f}"
print(summary)

k-fold scores: [0.96551724 0.96551724 1.         0.96428571 0.96428571]

Average cv score: 0.972


This score gives us a measure of how good, on average, our classifier is at generalizing.

### <span style="color:teal"> Choosing a scoring metric

Remember our discussion about metrics? Well the scores returned by k-fold are the default score metrics for our classifier. In this case it is accuracy. If you wanted different metrics you can indicate it with the `scoring` parameter:

In [6]:
cv_scores_prec = cross_val_score(
    estimator=wine_clf,
    X=X_train, y=y_train, cv=k,
    scoring='f1_macro' # <- average f1 over all classes
)

In [7]:
# Print the info we get
summary = f"k-fold scores: {cv_scores_prec}\n\nAverage cv score: {cv_scores_prec.mean():.3f}"
print(summary)

k-fold scores: [0.96444444 0.96705882 1.         0.96451914 0.96328502]

Average cv score: 0.972


Since our data is nice and balanced, we don't really see a difference. In the exercise you will check an imbalanced dataset :-)

There is a related function called `cross_validate`, which is a bit more versatile than `cross_val_scores`. But we won't look into it.

### <span style="color:teal"> Hyperparameter Search

Now, how is our score above useful? Well, it becomes useful when we want to compare different models! In this specific case, we will compare all SVMs but with different hyperparameters.

As a reminder, the SVM classifier had two hyperparameters:
- `C`, a regularization parameter (default $1$)
- `gamma`, a local influence parameter (default *scale* - function of variance)

When we created our classifier above, we didn't indicate the hyperparameters, we could do that:

In [8]:
clf_1 = svm.SVC(C=1, gamma='scale')

Now, let's say we want to know if a lower value of `C` (say, $.01$), will yield better performance. What about other values of `gamma` (say, $.5$)?

In [9]:
# We can create a bunch of other classifiers:
clf_2 = svm.SVC(C=.01, gamma='scale')
clf_3 = svm.SVC(C=1, gamma=.5)
clf_4 = svm.SVC(C=.01, gamma=.5)

In [10]:
# And then we compute our average cross-validation score for all of them:
cv_1 = cross_val_score(clf_1, X_train, y_train, cv=k).mean()
cv_2 = cross_val_score(clf_2, X_train, y_train, cv=k).mean()
cv_3 = cross_val_score(clf_3, X_train, y_train, cv=k).mean()
cv_4 = cross_val_score(clf_4, X_train, y_train, cv=k).mean()

In [12]:
print(f"C=1, g=scale: {cv_1:.3f}\nC=.01, g=scale: {cv_2:.3f}\nC=1, g=.5: {cv_3:.3f}\nC=.01, g=.5: {cv_4:.3f}")

C=1, g=scale: 0.649
C=.01, g=scale: 0.401
C=1, g=.5: 0.401
C=.01, g=.5: 0.401


Nice, now we have a better idea of which `C` to use. We also don't see a difference among these values of `gamma`. However, our search was not super comprehensive.

<span style="color:red">**WARNING**

Remember that, when comparing performance for different hyperparameters, we should not use the test set! Only the training set. The test set is the ultimate performance result for our classifier, *once the classifier has been set in stone*.

Anyway, back to our hyperparamters. Wouldn't it be nice to search for a larger combination of hyperparameters? Yes it would!

We can either write our own loops by hand, or, we can use sklearn's `GridSearchCV` and `RandomizedSearchCV`. These are classes that do the whole hyperparameter search using cross-validation for us! Simplifying everything to a couple of lines. Let's try them out.

When to use one vs the other? As the name suggests, `GridSearchCV` will search over the full grid of hyperparameter combinations. On the other hand, if you have many hyperparameters (high-dimension) or you want to explore too many values, you can use `RandomizedSearchCV`, this one picks an indicated amount of sample points from the hyperparameter space.

![hyperparameter-grid](images/cv-grid.png){width=75%}

Let's do a grid search over our SVM hyperparameters (random should be similar):

In [13]:
# Setting up the parameter values
C_space = np.arange(.01, 10, .5)
gamma_space = np.arange(.01, 2, .05)
parameter_space = {
    'C': C_space,
    'gamma': gamma_space
}

# Create a simple classifier object (you can also do this with full pipelines)
clf_svc = svm.SVC()
# create the cross validation object:
clf_cross_val_grid = GridSearchCV(
    estimator=clf_svc,
    param_grid=parameter_space,
)
clf_cross_val_grid

Time to fit, this will fit the model for each combination of parameters, $20$ for `C`, $40$ for `gamma`, so $800$ total! This is why `RandomSearchCV` should be your go to in a realistic setting.

In [14]:
clf_cross_val_grid.fit(X_train, y_train)

Notice the new object diagram has a "best_estimator" as opposed to an "estimator" as before! If you click on it, it will show you the best combination of parameters. We can also obtain them programmatically:

In [15]:
clf_cross_val_grid.best_params_

{'C': np.float64(1.51), 'gamma': np.float64(0.01)}

With this knowledge, we could train a new pipeline using the full training data and the best found hyperparamters. However, `GridSearchCV` does it for us automatically, that's what `best_estimator_` is. We can therefore call methods like `predict` and `score` directly on our CV object:

In [16]:
clf_cross_val_grid.score(X_test, y_test)

0.6666666666666666

### <span style="color:teal"> When to use Random Search vs Grid Search

We're dealing with a small dataset, low dimensionality, and a few hyperparameters. Hence, sklearn was able to fit the model $800$ times in a couple of minutes. However, if your number of hyperparameters grows, or fitting the model is time consuming, grid search becomes prohibitive. It is in this case we must use random search and choose a reasonable number of sample points.

Let's compare our results above with a random search approach. Pay attention to how long the `fit()` method takes this time compared to before.

In [20]:
# Create a simple classifier object (you can also do this with full pipelines)
clf_svc = svm.SVC()
# create the cross validation object:
clf_cross_val_random = RandomizedSearchCV(
    estimator=clf_svc,
    param_distributions=parameter_space,
    n_iter=80       # Indicate how many points to sample
)
clf_cross_val_random

In [21]:
# fit it
clf_cross_val_random.fit(X_train, y_train)

In [22]:
clf_cross_val_random.score(X_test, y_test)

0.6666666666666666

Much faster! With ten times fewer loops than with brute grid search.