# Lab 2: Cross Validation with scikit-learn

In this session, you will learn how to use cross validation to evaluate models in a robust way. You will use a simple dataset from scikit-learn and basic models.

**Instructions:**
- Fill in the code cells marked with 'To complete'.
- Use only scikit-learn and pandas.
- Try to understand what cross validation is and why it is useful.

## 1. Load the dataset
We will use the `wine` dataset from scikit-learn. This is a simple classification dataset.

In [53]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.dummy import DummyClassifier

In [54]:
wine = load_wine(as_frame=True)
data = wine.data
target = wine.target

## 2. Explore the data
Look at the first few rows and basic statistics.

In [55]:
data.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [56]:
data.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


## 3. Train/test split
Before using cross validation, let's see what happens if we just split the data once.

In [57]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=0)

In [58]:
clf = DummyClassifier(strategy='most_frequent')
clf.fit(X_train, y_train)
test_score = clf.score(X_test, y_test)
print(f'Accuracy on test set: {test_score:.2f}')

Accuracy on test set: 0.41


## 4. Cross validation
Now let's use cross validation to get a more robust estimate of model performance.

In [59]:
scores = cross_val_score(clf, data, target, cv=5)
print('Cross-validation scores:', scores)
print('Mean cross-validation score:', scores.mean())

Cross-validation scores: [0.38888889 0.38888889 0.38888889 0.4        0.42857143]
Mean cross-validation score: 0.3990476190476191


## 5. Discussion
- Why is cross validation better than a single train/test split?
- What do you observe about the scores?

**Sample answer:**
- Cross validation gives a more reliable estimate of model performance because it averages over several splits, reducing the risk of a lucky or unlucky split.
- The scores are all the same here because the DummyClassifier always predicts the most frequent class, but with a real model, you would see some variation.

## 6. Logistic Regression to the rescue

Repeat the experiment with a LogisticRegression classifier.

In [60]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=10000, random_state=0)
clf.fit(X_train, y_train)
test_score = clf.score(X_test, y_test)
print(f'Accuracy on test set: {test_score:.2f}')

Accuracy on test set: 0.98


In [61]:
scores = cross_val_score(clf, data, target, cv=5)
print('Cross-validation scores:', scores)
print('Mean cross-validation score:', scores.mean())

Cross-validation scores: [0.97222222 0.91666667 0.91666667 1.         1.        ]
Mean cross-validation score: 0.961111111111111


## 7. Stratified Cross-Validation

When dealing with classification problems, especially with imbalanced datasets (where some classes have significantly fewer samples than others), a simple K-Fold cross-validation might create folds where certain classes are underrepresented or entirely missing. This can lead to biased model evaluation.

Stratified K-Fold cross-validation addresses this by ensuring that each fold has approximately the same percentage of samples of each target class as the complete set. Let's demonstrate this using the `breast_cancer` dataset, which has a binary target.

In [62]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

breast_cancer = load_breast_cancer(as_frame=True)
data_bc = breast_cancer.data
target_bc = breast_cancer.target

# Check target distribution
print("Original target distribution:")
print(target_bc.value_counts(normalize=True))

# Define a simple classifier
clf_lr = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000, random_state=0))

print("\nScores with KFold (non-stratified):")
cv_kfold = KFold(n_splits=5, shuffle=True, random_state=0)
scores_kfold = cross_val_score(clf_lr, data_bc, target_bc, cv=cv_kfold)
print(scores_kfold)
print(f"Mean accuracy: {scores_kfold.mean():.3f} +/- {scores_kfold.std():.3f}")

print("\nScores with StratifiedKFold:")
cv_stratified = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores_stratified = cross_val_score(clf_lr, data_bc, target_bc, cv=cv_stratified)
print(scores_stratified)
print(f"Mean accuracy: {scores_stratified.mean():.3f} +/- {scores_stratified.std():.3f}")

Original target distribution:
target
1    0.627417
0    0.372583
Name: proportion, dtype: float64

Scores with KFold (non-stratified):
[0.96491228 0.98245614 0.95614035 0.95614035 1.        ]
Mean accuracy: 0.972 +/- 0.017

Scores with StratifiedKFold:
[0.95614035 0.97368421 0.98245614 1.         0.98230088]
Mean accuracy: 0.979 +/- 0.014


## 8. Effect of Preprocessing Steps (Scaling)

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This is especially true for algorithms that use distance calculations (like SVMs, K-Nearest Neighbors) or gradient descent optimization (like Logistic Regression, Neural Networks).

Let's see the effect of `StandardScaler` on a `LogisticRegression` model using the `wine` dataset.

In [63]:
# Model without scaling
clf_no_scale = LogisticRegression(max_iter=10000, random_state=0)
scores_no_scale = cross_val_score(clf_no_scale, data, target, cv=5)
print("Scores without scaling:", scores_no_scale)
print(f"Mean accuracy without scaling: {scores_no_scale.mean():.3f} +/- {scores_no_scale.std():.3f}")

# Model with scaling using a pipeline
clf_with_scale = make_pipeline(StandardScaler(), LogisticRegression(max_iter=10000, random_state=0))
scores_with_scale = cross_val_score(clf_with_scale, data, target, cv=5)
print("\nScores with scaling:", scores_with_scale)
print(f"Mean accuracy with scaling: {scores_with_scale.mean():.3f} +/- {scores_with_scale.std():.3f}")

Scores without scaling: [0.97222222 0.91666667 0.91666667 1.         1.        ]
Mean accuracy without scaling: 0.961 +/- 0.038

Scores with scaling: [0.97222222 0.97222222 1.         0.97142857 1.        ]
Mean accuracy with scaling: 0.983 +/- 0.014
