# Lab 3: Validation Curves and Learning Curves

In this lab, you will learn how to use **validation curves** and **learning curves** to:
- Choose the best hyperparameters for a model
- Determine whether adding more training samples would improve model performance
- Diagnose overfitting and underfitting

**Objectives:**
- Understand the difference between validation curves and learning curves
- Use validation curves to optimize hyperparameters
- Use learning curves to assess data needs
- Interpret curves to make informed decisions about model complexity and data collection

## 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import (
    ValidationCurveDisplay, 
    LearningCurveDisplay,
    StratifiedKFold
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Set random seed for reproducibility
np.random.seed(42)

We will use the **digits dataset** from scikit-learn, which contains 8x8 images of handwritten digits (0-9). This is a classic machine learning dataset with 1,797 samples and 64 features.

In [None]:
# Load the digits dataset
digits = load_digits()
X = digits.data
y = digits.target

print(f"Dataset shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")
print(f"Classes: {np.unique(y)}")

## 2. Understanding Validation Curves

A **validation curve** shows how a model's performance varies as we change a single hyperparameter.

The curve plots:
- **Training score** (blue): Performance on the training data
- **Validation score** (orange): Performance on validation folds (via cross-validation)

This helps us:
1. **Identify underfitting**: When both training and validation scores are low and similar
2. **Identify overfitting**: When training score is high but validation score is low (large gap)
3. **Find the sweet spot**: Where the validation score is highest (best generalization)

### 2.1 Validation Curve for k-Nearest Neighbors (k parameter)

In KNN, the `n_neighbors` parameter (k) controls model complexity:
- **Small k** (e.g., k=1): Model is very flexible, can overfit
- **Large k** (e.g., k=50): Model is simple, might underfit

In [None]:
# To complete: Create a KNN pipeline with StandardScaler and KNeighborsClassifier
# Use make_pipeline()

In [None]:
# To complete: Create a validation curve for the KNN pipeline
# - Define k_range using np.arange(1, 31, 1)
# - Use ValidationCurveDisplay.from_estimator() with:
#   - param_name="kneighborsclassifier__n_neighbors"
#   - param_range=k_range
#   - cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
#   - scoring="accuracy"
#   - n_jobs=-1
# - Set appropriate title and labels
# - Display the plot with plt.show()

**Question 1:** Looking at the validation curve:
- What happens with small k values (1-5)?
- What is the optimal k range?
- What happens with large k values (>15)?

*Write your answer here:*

### 2.2 Validation Curve for Random Forest (max_depth parameter)

In Random Forest, the `max_depth` parameter controls tree depth:
- **Small max_depth**: Trees are shallow and simple (underfitting risk)
- **Large max_depth**: Trees are deep and complex (overfitting risk)

In [None]:
# To complete: Create a Random Forest classifier
# Use: RandomForestClassifier with n_estimators=100, min_samples_split=10, 
#      min_samples_leaf=4, random_state=42
# Note: min_samples_split and min_samples_leaf prevent overfitting

In [None]:
# To complete: Create a validation curve for Random Forest
# - Define depth_range using np.arange(1, 21, 1)
# - Use ValidationCurveDisplay.from_estimator() with param_name="max_depth"
# - Use the same cv and scoring as before
# - Set appropriate title and labels

**Question 2:** Looking at the validation curve:
- What is the effect of small max_depth values?
- What is the optimal max_depth range?
- Why don't we see perfect 1.0 training accuracy?

*Write your answer here:*

### 2.3 Validation Curve for SVM (gamma parameter)

In SVM with RBF kernel, the `gamma` parameter controls how much each training point influences the decision boundary:
- **Small gamma**: Smooth decision boundary (simple model)
- **Large gamma**: Complex, wiggly decision boundary (complex model)

In [None]:
# To complete: Create an SVM pipeline with StandardScaler and SVC
# Use SVC with kernel="rbf" and random_state=42

In [None]:
# To complete: Create a validation curve for SVM
# - Define gamma_range using np.logspace(-3, 1, num=20) for logarithmic scale
# - Use ValidationCurveDisplay.from_estimator() with param_name="svc__gamma"
# - Use the same cv and scoring as before
# - Set appropriate title (mention logarithmic scale) and labels

**Question 3:** Looking at the validation curve:
- What is the effect of small gamma values?
- What is the optimal gamma range?
- What happens with large gamma values?

*Write your answer here:*

## 3. Understanding Learning Curves

A **learning curve** shows how a model's performance improves as we increase the amount of training data.

The curve plots:
- **Training score**: Performance on the training data (typically high and stable)
- **Validation score**: Performance on validation folds (typically improves with more data)

Different patterns tell us:
1. **Converging curves (small gap)**: Model has high bias, adding more data won't help much
2. **Diverging curves (large gap)**: Model has high variance, adding more data will help
3. **Both curves plateau early**: Model is likely underfitting (need more complex model)
4. **Validation curve still rising**: Collecting more data would help (high variance)

### 3.1 Learning Curve for KNN (with optimal k from validation curve)

In [None]:
# To complete: Create a KNN pipeline with optimal k=5 (from validation curve)
# Use make_pipeline with StandardScaler and KNeighborsClassifier(n_neighbors=5)

In [None]:
# To complete: Create a learning curve for KNN
# - Use LearningCurveDisplay.from_estimator()
# - Use cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# - Use scoring="accuracy"
# - Use train_sizes=np.linspace(0.1, 1.0, 10)
# - Use n_jobs=-1
# - Set appropriate title and labels

**Question 4:** Looking at the learning curve:
- What happens to the training score as we add more data?
- What happens to the validation score as we add more data?
- Would collecting more data help this model?

*Write your answer here:*

### 3.2 Learning Curve for Random Forest (with optimal depth from validation curve)

In [None]:
# To complete: Create a Random Forest with optimal parameters
# Use: RandomForestClassifier with n_estimators=100, max_depth=8,
#      min_samples_split=10, min_samples_leaf=4, random_state=42

In [None]:
# To complete: Create a learning curve for Random Forest
# Use the same parameters as for KNN above

**Question 5:** Compare the gap between training and validation scores.
- Is the model well-tuned?
- Would more data help?

*Write your answer here:*

### 3.3 Learning Curve for SVM (with optimal gamma from validation curve)

In [None]:
# To complete: Create an SVM pipeline with optimal gamma=0.01
# Use make_pipeline with StandardScaler and SVC(kernel="rbf", gamma=0.01, random_state=42)

In [None]:
# To complete: Create a learning curve for SVM
# Use the same parameters as before

**Question 6:** Observe the validation curve at the right end.
- Is it still rising or has it plateaued?
- What does this tell us about data collection?

*Write your answer here:*

## 4. Practical Decision-Making Guide

### How to use these curves to make decisions:

#### **Step 1: Use Validation Curves to Find Optimal Parameters**
- Generate validation curves for key hyperparameters
- Choose the parameter value where validation score is highest
- Watch for signs of overfitting (large gap) or underfitting (both scores low)

#### **Step 2: Use Learning Curves to Assess Data Needs**
- Generate a learning curve with the optimal parameters
- Look at the validation score trend:
  - **Still rising at the end?** → Collect more data
  - **Plateau?** → Either stop or use a more complex model
  - **Gap closing as data increases?** → Good sign of improving generalization

#### **Step 3: Interpret Different Patterns**

### 4.1 Example: Diagnosing a Poorly-Tuned Model (Underfitting)

In [None]:
# To complete: Create a KNN pipeline with k=30 (too high, will underfit)
# Then create a learning curve to see the underfitting pattern
# What do you observe about both training and validation scores?

**Question 7:** Diagnosis of this model:
- Are both training and validation scores high or low?
- What is the gap between them?
- What remedies would you suggest?

*Write your answer here:*

### 4.2 Example: Diagnosing an Overfitting Model (High Variance)

In [None]:
# To complete: Create a KNN pipeline with k=1 (too low, will overfit)
# Then create a learning curve to see the overfitting pattern
# What do you observe about the gap between training and validation?

**Question 8:** Diagnosis of this model:
- What is the training score?
- What is the validation score?
- What is the gap between them?
- What remedies would you suggest?

*Write your answer here:*

### 4.3 Example: Well-Tuned Model

In [None]:
# To complete: Create a KNN pipeline with k=7 (well-balanced)
# Then create a learning curve
# How does this compare to the overfitting and underfitting examples?

**Question 9:** Characteristics of a well-tuned model:
- What do you observe about the scores?
- What is the gap between curves?
- What would be your next steps?

*Write your answer here:*

## 5. Practical Exercise: Complete Analysis Workflow

### 5.1 Compare multiple models using learning curves

In [None]:
models = {
    'KNN (k=5)': make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5)),
    'Random Forest': RandomForestClassifier(
        n_estimators=100, 
        max_depth=10, 
        min_samples_split=10,
        min_samples_leaf=4,
        random_state=42
    ),
    'SVM': make_pipeline(StandardScaler(), SVC(kernel='rbf', gamma=0.01, random_state=42))
}

In [None]:
# To complete: Create a figure with 3 subplots (1 row, 3 columns)
# For each model, create a learning curve in its subplot
# Hint: Use plt.subplots(1, 3, figsize=(16, 5)) and iterate over models
# Use enumerate() to get the index for the subplot

**Question 10:** Compare the three models:
- Which model has the best final validation score?
- Which model has the smallest gap (best bias-variance tradeoff)?
- Which model would you choose for production and why?

*Write your answer here:*

## 6. Summary and Key Takeaways

### Validation Curves:
- **Purpose**: Find optimal hyperparameter values
- **How to read**: Look for peak in validation score (orange line)
- **Patterns**:
  - Large gap: overfitting (high variance)
  - Low scores both: underfitting (high bias)
  - Tight curves at high score: well-tuned

### Learning Curves:
- **Purpose**: Determine if more data would help
- **How to read**: Check if validation score still climbing
- **Patterns**:
  - Both curves plateau: enough data, focus on model complexity
  - Validation still rising: more data would help
  - Large gap: model has high variance, more data helps

### Decision Tree:
1. **Start with validation curve**: Find best hyperparameters
2. **Create learning curve**: Check if you need more data
3. **Interpret patterns**: Decide: more data? new model? more features?
4. **Iterate**: Adjust model and retest

### Key Metrics to Track:
- Final validation score (absolute performance)
- Gap between training and validation (overfitting indicator)
- Slope of validation curve (data efficiency)
- Whether curves have plateaued (saturation point)