# Lab 2: Cross Validation with scikit-learn

In this session, you will learn how to use cross validation to evaluate models in a robust way. You will use a simple dataset from scikit-learn and basic models.

**Instructions:**
- Fill in the code cells marked with 'To complete'.
- Use only scikit-learn and pandas.
- Try to understand what cross validation is and why it is useful.

## 1. Load the dataset
We will use the `wine` dataset from scikit-learn. This is a simple classification dataset.

In [None]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, StratifiedKFold

In [1]:
# To complete: Load the wine dataset as a pandas DataFrame
# Assign the data to a variable 'data' and the target to 'target'

## 2. Explore the data
Look at the first few rows and basic statistics.

In [None]:
# To complete: Display the first 5 rows of the data

In [None]:
# To complete: Display summary statistics for the features


## 3. Train/test split
Before using cross validation, let's see what happens if we just split the data once.

In [None]:
# To complete: Split the data into train and test sets (test_size=0.3, random_state=0)
# Use train_test_split from sklearn.model_selection

In [None]:
# To complete: Import a simple model (DummyClassifier with strategy='most_frequent')
# Fit it on the training data and print the accuracy on the test set

## 4. Cross validation
Now let's use cross validation to get a more robust estimate of model performance.

In [None]:
# To complete: Use cross_val_score with 5-fold cross validation
# Print the individual scores and their mean

## 5. Discussion
- Why is cross validation better than a single train/test split?
- What do you observe about the scores?

*Write your answers below.*

## 6. Logistic Regression to the rescue

Repeat the experiment with a LogisticRegression classifier.

In [None]:
# To complete: Implement Logistic Regression and evaluate it


In [None]:
# To complete: Use cross_val_score with 5-fold cross validation for Logistic Regression


## 7. Stratified Cross-Validation

When dealing with classification problems, especially with imbalanced datasets (where some classes have significantly fewer samples than others), a simple K-Fold cross-validation might create folds where certain classes are underrepresented or entirely missing. This can lead to biased model evaluation.

Stratified K-Fold cross-validation addresses this by ensuring that each fold has approximately the same percentage of samples of each target class as the complete set. Let's demonstrate this using the `breast_cancer` dataset, which has a binary target.

In [None]:
# To complete: compare KFold with StratifiedKFold
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

breast_cancer = load_breast_cancer(as_frame=True)
data_bc = breast_cancer.data
target_bc = breast_cancer.target

# Check target distribution
print("Original target distribution:")
print(target_bc.value_counts(normalize=True))

# Define a simple classifier

print("\nScores with KFold (non-stratified):")

print("\nScores with StratifiedKFold:")


## 8. Effect of Preprocessing Steps (Scaling)

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This is especially true for algorithms that use distance calculations (like SVMs, K-Nearest Neighbors) or gradient descent optimization (like Logistic Regression, Neural Networks).

Let's see the effect of `StandardScaler` on a `LogisticRegression` model using the `wine` dataset.

In [None]:
# To complete: Compare Logistic Regression performance with and without StandardScaler
# Model without scaling

# Model with scaling using a pipeline
