<h1 style="color: rebeccapurple;">Cross-validation</h1>

<span style="text-transform: uppercase;
        font-size: 14px;
        letter-spacing: 1px;
        font-family: 'Segoe UI', sans-serif;">
    Author
</span><br>
efrén cruz cortés
<hr style="border: none; height: 1px; background: linear-gradient(to right, transparent 0%, #ccc 10%, transparent 100%); margin-top: 10px;">

Now we are at a very important machine learning step: cross-validation.

## <span style="color:darkorange">Conceptual Intermezzo - cross-validation</span>

See slides

## <span style="color:darkorchid"> Imports

In [25]:
# Scikit-learn specifics:
from sklearn import datasets
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Helper modules
import numpy as np
import pandas as pd

In [24]:
# :: DATA ::

try:
    import google.colab
    !wget https://raw.githubusercontent.com/nuitrcs/scikit-learn-workshop-july2025/refs/heads/main/data/red_wine_binary.csv
    !wget https://raw.githubusercontent.com/nuitrcs/scikit-learn-workshop-july2025/refs/heads/main/data/red_wine_binary_imbalanced.csv
    red_wine_directory = "red_wine_binary.csv"
    red_wine_imbalanced_directory = "red_wine_binary_imbalanced.csv"
    print("Successfully loaded files to Colab. Check folder on left column.")
except ModuleNotFoundError:
    red_wine_directory = "data/red_wine_binary.csv"
    red_wine_imbalanced_directory = "data/red_wine_binary_imbalanced.csv"
    print("Data should be in your local directory. Under the 'data' folder.")

Data should be in your local directory. Under the 'data' folder.


## <span style="color:darkorchid"> Validating Alice's methods

Let's get back to the Italian wine dataset.

Alice's arch-nemesis and wine snob Dr. Mac G. Uffin (more on him in the clustering section) claims her models can't actually prove they will perform well unless we use the testing dataset, at which point it is too late to make any changes or tuning of the models.

Alice knows she must hide the test set until the very end, but still would like to have a sense of how well her models generalize and be able to tune hyperparameters. Alice figures out she can beat Mac G. Uffin using cross-validation!

![macguffin](images/villanous_macguffin.png){width=40%}

In [2]:
# Load data, split into train/test
X, y = datasets.load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

Remember our test data is sacred, and cannot be touched until the very end!

In [3]:
# Create  classifier
wine_clf = Pipeline(
    [
        ("preprocessor", preprocessing.StandardScaler()),
        ("classifier", svm.SVC())
    ]
)
wine_clf

### <span style="color:teal"> Cross-validation scores

We don't need to train in order to create the cross-validation scores:

In [4]:
# Get cross-validation scores (one for each of the k folds)
k = 5
cv_scores = cross_val_score(
    estimator=wine_clf,
    X=X_train,
    y=y_train,
    cv=k
)

In [6]:
summary = f"k-fold scores: {cv_scores}\n\nAverage cv score: {cv_scores.mean():.3f}"
print(summary)

k-fold scores: [0.96551724 0.96551724 1.         0.96428571 0.96428571]

Average cv score: 0.972


This score gives us a measure of how good, on average, our classifier is at generalizing.

### <span style="color:teal"> Choosing a scoring metric

Remember our discussion about metrics? Well the scores returned by k-fold are the default score metrics our classifier yields. In this case it is accuracy. If you wanted different metrics you can indicate it with the `scoring` parameter. See the documentation [here](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-string-names) for a comprehensive list.

Now, before we had use the *decision function*, but some metrics below can only be implemented using *probabilities*. For that, we'll need to use the `probability` parameter in our pipeline:

In [12]:
# Let's try these metrics:
cv_metrics = ['f1_macro', 'average_precision', 'roc_auc_ovr']
# (ovr stands for "one-vs-rest", this is how performance is computed when doing multi-class classification, which is our case)

# For the above to work we actually need to enable the probabilities for our classifier:
wine_clf = Pipeline(
    [
        ("preprocessor", preprocessing.StandardScaler()),
        ("classifier", svm.SVC(probability=True))
    ]
)

# Now let's compute the scores for all metrics we are interested in:
cv_scores_list = []
for met in cv_metrics:
    cv_scores = cross_val_score(
        estimator=wine_clf,
        X=X_train, y=y_train, cv=k,
        scoring=met
    )
    cv_scores_list.append(cv_scores)

# Print the info we get
summaries = [
    f"k-fold scores ({met}): {cv_scores}\nAverage cv score ({met}): {cv_scores.mean():.3f}\n\n"
    for met, cv_scores in zip(cv_metrics, cv_scores_list)
]
for summary in summaries:
    print(summary)

k-fold scores (f1_macro): [0.96444444 0.96705882 1.         0.96451914 0.96328502]
Average cv score (f1_macro): 0.972


k-fold scores (average_precision): [1.         0.99415954 1.         0.99284512 1.        ]
Average cv score (average_precision): 0.997


k-fold scores (roc_auc_ovr): [1.         0.99651416 1.         0.99613414 1.        ]
Average cv score (roc_auc_ovr): 0.999




Since our data is nice and balanced, we don't really see a difference. In the exercise you will check an imbalanced dataset :-)

There is a related function called `cross_validate`, which is a bit more versatile than `cross_val_scores`. But we won't look into it.

### <span style="color:red">Exercise</span>
<hr>

Repeat the above but using any of the portuguese wine datasets.
- Remember to set probability to true.
- Name your classifier port_wine_clf or something like that to differenetiate from the italian wine one above.
- Discuss the results with your neighbor(s).

<hr>

### <span style="color:teal"> Hyperparameter Search

Now, how is our score above useful? Well, it becomes useful when we want to compare different hyperparameter values!

As a reminder, the SVM classifier had two hyperparameters:
- `C`, a regularization parameter (default $1$)
- `gamma`, a local influence parameter (default *scale* - function of variance)

When we created our classifier above, we didn't indicate the hyperparameters, we could do that:

In [13]:
clf_1 = svm.SVC(C=1, gamma='scale')

Now, let's say we want to know if a lower value of `C` (say, $.01$), will yield better performance. What about other values of `gamma` (say, $.5$)?

In [14]:
# We can create a bunch of other classifiers:
clf_2 = svm.SVC(C=.01, gamma='scale')
clf_3 = svm.SVC(C=1, gamma=.5)
clf_4 = svm.SVC(C=.01, gamma=.5)

In [15]:
# And then we compute our average cross-validation score for all of them:
cv_1 = cross_val_score(clf_1, X_train, y_train, cv=k).mean()
cv_2 = cross_val_score(clf_2, X_train, y_train, cv=k).mean()
cv_3 = cross_val_score(clf_3, X_train, y_train, cv=k).mean()
cv_4 = cross_val_score(clf_4, X_train, y_train, cv=k).mean()

In [16]:
print(f"C=1, g=scale: {cv_1:.3f}\nC=.01, g=scale: {cv_2:.3f}\nC=1, g=.5: {cv_3:.3f}\nC=.01, g=.5: {cv_4:.3f}")

C=1, g=scale: 0.649
C=.01, g=scale: 0.401
C=1, g=.5: 0.401
C=.01, g=.5: 0.401


Nice, now we have a better idea of which `C` to use. We also don't see a difference among these values of `gamma`. However, our search was not super comprehensive.

**WARNING**

Remember that, when comparing performance for different hyperparameters, we should not use the test set! Only the training set. The test set is the ultimate performance result for our classifier, *once the classifier has been set in stone*.

Anyway, back to our hyperparamters. Wouldn't it be nice to search for a larger combination of hyperparameters? Yes it would!

We can either write our own loops by hand, or, we can use sklearn's `GridSearchCV` and `RandomizedSearchCV`. These are classes that do the whole hyperparameter search using cross-validation for us! Simplifying everything to a couple of lines. Let's try them out.

When to use one vs the other? As the name suggests, `GridSearchCV` will search over the full grid of hyperparameter combinations. On the other hand, if you have many hyperparameters (high-dimension) or you want to explore too many values, you can use `RandomizedSearchCV`, this one picks an indicated amount of sample points from the hyperparameter space.

(The code for the figures below is in the `support_materials.ipynb` file).

![hyperparameter-grid](images/cv-grid.png){width=75%}

Let's do a grid search over our SVM hyperparameters (random should be similar):

In [None]:
# Setting up the parameter values
C_space = np.arange(.01, 10, .5)
gamma_space = np.arange(.01, 2, .05)
parameter_space = {
    'C': C_space,
    'gamma': gamma_space
}

# Create a simple classifier object (you can also do this with full pipelines)
clf_svc = svm.SVC()
# create the cross validation object:
clf_cross_val_grid = GridSearchCV(
    estimator=clf_svc,
    param_grid=parameter_space
)
clf_cross_val_grid

Time to fit, this will fit the model for each combination of parameters: $20$ possibilities for `C`, $40$ for `gamma`, so $800$ total! This is why `RandomSearchCV` should be your go to in a realistic setting.

In [18]:
clf_cross_val_grid.fit(X_train, y_train)

Take note of the time it took to run the above.

Notice the new object diagram has a "best_estimator" as opposed to an "estimator" as before (although some versions of sklearn may not show it)! If you click on it, it will show you the best combination of parameters. We can also obtain them programmatically:

In [19]:
clf_cross_val_grid.best_params_

{'C': np.float64(1.51), 'gamma': np.float64(0.01)}

With this knowledge, we could train a new pipeline using the full training data and the best found hyperparamters. However, `GridSearchCV` does it for us automatically, that's what `best_estimator_` is. We can therefore call methods like `predict` and `score` directly on our CV object:

In [51]:
clf_cross_val_grid.score(X_test, y_test)

0.6666666666666666

### <span style="color:teal"> When to use Random Search vs Grid Search

We're dealing with a small dataset, low dimensionality, and a few hyperparameters. Hence, sklearn was able to fit the model $800$ times in a couple of minutes. However, if your number of hyperparameters grows, or fitting the model is time consuming, grid search becomes prohibitive. It is in this case we must use random search and choose a reasonable number of sample points.

Let's compare our results above with a random search approach. Pay attention to how long the `fit()` method takes this time compared to before.

In [52]:
# Create a simple classifier object (you can also do this with full pipelines)
clf_svc = svm.SVC()
# create the cross validation object:
seed = 42
clf_cross_val_random = RandomizedSearchCV(
    estimator=clf_svc,
    param_distributions=parameter_space,
    n_iter=80,       # Indicate how many points to sample
    random_state=seed
)
clf_cross_val_random

In [53]:
# fit it
clf_cross_val_random.fit(X_train, y_train)

In [54]:
clf_cross_val_random.score(X_test, y_test)

0.6666666666666666

Much faster! With ten times fewer loops than with brute grid search.

**Scoring with hyperparameter search**

Both `GridSearchCV` and `RandomizedSearchCV` accept the `scoring` parameter, so you can try using something besides accuracy, like the $F_1$ score.

### <span style="color:red">Exercise - Bo's Ultimate Undertaking</span>

Bo now has a lot of tools at their disposal, all types of pre-processors, performance metrics, and the cross-validation framework.
- Use the digits dataset we used back in section $006$ and create a full classification pipeline, from the train/test split to hyper-parameter tuning using cross-validation.
- Feel free to take any decisions you may encounter, from the test data size to the scoring metric.
- Discuss the results with your neighbors:
    - What were the best hyperparameters found?
    - How did the classifier perform on the test set?
    - Did your results differ from those of your neighbor? If so, discuss the differences.