# Day 2: Regression Model Evaluation (2.2)

## Table of Contents
1. [Metrics: MSE, MAE, R-squared (2.2.1)](#evaluation-metrics)
   - Why We Need Evaluation Metrics
   - Mean Squared Error (MSE)
   - Root Mean Squared Error (RMSE)
   - Mean Absolute Error (MAE)
   - Comparing MSE, RMSE, and MAE
   - R-squared (Coefficient of Determination)
   - Adjusted R-squared
   - Contextual Interpretation of Metrics
2. [Detecting Overfitting & Underfitting (2.2.2)](#overfitting-underfitting)
   - The Goal: Generalization
   - Understanding the Bias-Variance Tradeoff
   - Underfitting
   - Overfitting
   - Good Fit
   - How to Detect Problems
   - Visualizing Underfitting and Overfitting
3. [Interactive Learning: Evaluating Models](#interactive-learning)
   - Metrics Calculation
   - Metric Choice Scenarios
   - Detecting Over/Underfitting
4. [Practice Questions](#practice-questions)

<a id="evaluation-metrics"></a>

## 2.2.1 Metrics: MSE, MAE, R-squared

### Why We Need Evaluation Metrics

Building a machine learning model is only half the battle. We need ways to quantitatively measure how well our model is performing. Evaluation metrics serve several critical purposes:

1. **Assess Model Performance**: Understand how well our model fits the data and makes predictions
2. **Compare Different Models**: Determine which model works better for our problem
3. **Tune Hyperparameters**: Optimize model settings to improve performance
4. **Communicate Results**: Explain model performance to stakeholders in understandable terms
5. **Detect Problems**: Identify issues like overfitting and underfitting

For regression problems specifically, we need metrics that measure the difference between predicted values and actual values. Let's explore the most common regression evaluation metrics.

### Mean Squared Error (MSE)

**Definition**: The average of the squared differences between predicted and actual values.

**Formula**: 
$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Where:
- $n$ is the number of observations
- $y_i$ is the actual value for observation $i$
- $\hat{y}_i$ is the predicted value for observation $i$

**Characteristics**:
- Always non-negative: MSE ≥ 0
- Perfect predictions give MSE = 0
- Larger errors are penalized more heavily due to squaring
- Units are squared (e.g., dollars² for house price prediction)

**Code Implementation**:

In [None]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse:.4f}")

**Advantages**:
- Mathematically convenient for optimization
- Differentiable (important for gradient-based optimization)
- Heavily penalizes large errors

**Disadvantages**:
- Squared units make interpretation difficult
- Sensitive to outliers

### Root Mean Squared Error (RMSE)

**Definition**: The square root of the Mean Squared Error.

**Formula**: 
$RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$

**Characteristics**:
- Always non-negative: RMSE ≥ 0
- Perfect predictions give RMSE = 0
- Units match the original target variable (e.g., dollars for house price prediction)
- Represents the standard deviation of the residuals

**Code Implementation**:

In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np

rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"RMSE: {rmse:.4f}")

**Advantages**:
- More interpretable than MSE (same units as target variable)
- Commonly used in practice
- Represents the "typical" error magnitude

**Disadvantages**:
- Still sensitive to outliers
- Not as mathematically convenient as MSE

### Mean Absolute Error (MAE)

**Definition**: The average of the absolute differences between predicted and actual values.

**Formula**: 
$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

**Characteristics**:
- Always non-negative: MAE ≥ 0
- Perfect predictions give MAE = 0
- Units match the original target variable
- Represents the average absolute error

**Code Implementation**:

In [None]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae:.4f}")

**Advantages**:
- Intuitive interpretation: "average absolute error"
- Less sensitive to outliers compared to MSE/RMSE
- Same units as the target variable

**Disadvantages**:
- Not differentiable at zero (can be an issue for some optimization methods)
- Doesn't penalize large errors as heavily as MSE

### Comparing MSE, RMSE, and MAE

![Comparison of MAE vs MSE Sensitivity](attachment:4fc63d0f-23f4-43ca-aab2-705a457c6b0b.png)

**Error Treatment**:
- MSE and RMSE square errors, making them more sensitive to outliers
- MAE uses absolute values, making it more robust to outliers

**Interpretation**:
- MAE: "On average, our predictions are off by X units"
- RMSE: "The standard deviation of our prediction errors is X units"

**When to use which**:
- **RMSE**: When large errors are particularly undesirable; when you want your metric to match the loss function used in training
- **MAE**: When you need robustness to outliers; when you want a more intuitive metric

**Example Scenario**:
Consider a house price prediction model with the following errors (in $1000s):
- Prediction 1: Off by `$5K` (small error)
- Prediction 2: Off by `$5K` (small error)
- Prediction 3: Off by `$5K` (small error)
- Prediction 4: Off by `$85K` (large error/outlier)

MAE = (5 + 5 + 5 + 85)/4 = `$25K`
RMSE = sqrt((5² + 5² + 5² + 85²)/4) = sqrt(1850)/2 = `$43K`

The RMSE is much higher because it penalizes the large $85K error more heavily.

### R-squared (Coefficient of Determination)

**Definition**: The proportion of the variance in the dependent variable that is predictable from the independent variables.

![image.png](attachment:aa86fbab-e4c6-4867-969b-7cbb915b799d.png)

**Formula**: 
$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 1 - \frac{MSE}{Var(y)}$

Where:
- $\bar{y}$ is the mean of the actual target values
- $Var(y)$ is the variance of the actual target values

![image.png](attachment:ce1b921c-50c1-49b0-b23e-401ff62aa3f5.png)

**Intuitive Explanation**:
- The denominator $\sum_{i=1}^{n} (y_i - \bar{y})^2$ represents how much variation there is in the target variable (total sum of squares)
- The numerator $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ represents how much variation remains unexplained after fitting our model (residual sum of squares)
- R² measures the proportion of variance explained by the model

**Characteristics**:
- R² = 1: Perfect predictions, model explains all variance
- R² = 0: Model is no better than predicting the mean value
- R² < 0: Model is performing worse than predicting the mean (possible but uncommon)
- Typically ranges from 0 to 1 for reasonable models

**Code Implementation**:

In [None]:
from sklearn.metrics import r2_score

r2 = r2_score(y_true, y_pred)
print(f"R²: {r2:.4f}")

**Advantages**:
- Unitless measure, allowing comparison across different target variables
- Intuitive interpretation as "proportion of variance explained"
- Well-known metric, widely used and understood

**Disadvantages**:
- Can increase just by adding more features, even if they're not useful
- Doesn't tell you if predictions are biased
- Can be misleadingly high in some cases

### Adjusted R-squared

**Definition**: A modified version of R² that adjusts for the number of predictors in the model.

**Formula**: 
$\text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$

Where:
- $n$ is the number of observations
- $p$ is the number of predictors (features)

**Characteristics**:
- Increases only if a new feature improves the model more than would be expected by chance
- Can decrease if unnecessary predictors are added
- Always ≤ R² (equal only when p = 0)
- Addresses the problem of R² automatically increasing with more predictors

**Code Implementation**:

In [None]:
# Using statsmodels for Adjusted R²
import statsmodels.api as sm

X_with_const = sm.add_constant(X)  # Adds a constant term for intercept
model = sm.OLS(y, X_with_const).fit()
adjusted_r2 = model.rsquared_adj
print(f"Adjusted R²: {adjusted_r2:.4f}")

**When to Use Adjusted R²**:
- When comparing models with different numbers of features
- When doing feature selection
- When concerned about overfitting due to too many predictors

### Contextual Interpretation of Metrics

**Business Context is Crucial:**
- Whether an RMSE of 0.5 is "good" depends entirely on the problem context
- For house prices in thousands, an RMSE of 50 means predictions are typically off by $50,000
- This might be acceptable for multi-million dollar homes but terrible for lower-priced markets

**Relative Performance:**
- Always compare your model metrics to a baseline model
- Common baselines:
  - Predicting the mean value of the target
  - Using only the most important feature
  - A simpler model (linear vs. complex)

**Example of Contextual Interpretation:**
Consider a model predicting weekly product sales with RMSE = $500 and R² = 0.7
- Is this good? It depends!
- For a product that typically sells `$50,000/week`, being off by `$500` is excellent (1% error)
- For a product that typically sells `$1,000/week`, being off by `$500` is terrible (50% error)
- The R² of 0.7 tells us we're explaining 70% of the variance, which might be good in a noisy retail environment

**Cost of Errors:**
- Consider asymmetric costs: Is over-prediction more costly than under-prediction?
- Example: In medical resource allocation, under-predicting demand could cost lives
- Example: In inventory management, over-predicting demand leads to waste, under-predicting leads to stockouts

<a id="overfitting-underfitting"></a>

## 2.2.2 Detecting Overfitting & Underfitting

### The Goal: Generalization

The ultimate goal of machine learning is to create models that not only perform well on the data they were trained on but also generalize well to new, unseen data. Generalization is the ability of a model to make accurate predictions on data it hasn't seen during training.

### Understanding the Bias-Variance Tradeoff

The bias-variance tradeoff is a fundamental concept that helps us understand the balance between underfitting and overfitting:

**Bias**: Error due to overly simplistic assumptions in the learning algorithm. High bias can cause an algorithm to miss relevant relations between features and target (underfitting).

**Variance**: Error due to excessive sensitivity to small fluctuations in the training data. High variance can cause an algorithm to model random noise in the training data (overfitting).

![Bias-Variance Tradeoff](attachment:7252fec2-95d3-4369-88d4-2fd257230e85.png)

![Underfitting vs. Good Fit vs. Overfitting](attachment:95febd4e-69d9-4ce3-b218-e6239ec92587.png)

As model complexity increases:
- Bias tends to decrease
- Variance tends to increase
- The goal is to find the sweet spot that minimizes total error

### Underfitting

**What it is:** The model is too simple to capture the underlying patterns in the data. It performs poorly on *both* the training set and the test set.

**Symptoms:**
- High training error AND high test error
- Low R-squared on both training and test sets
- Residual plots show clear patterns
- Model predictions are systematically off

**Causes:**
- Model is not complex enough (e.g., using linear regression for a highly non-linear relationship)
- Important features are missing
- Too much regularization (penalties applied to model complexity)

**Example:** Using a linear model to fit data with a clearly quadratic relationship.

![Underfitting Example](attachment:52fc8499-d1b8-4d3e-aa56-56b5f8a47e41.png)

### Overfitting

**What it is:** The model learns the training data *too* well, including noise and random fluctuations. It performs extremely well on the training set but poorly on the test set. It fails to generalize.

**Symptoms:**
- Very low training error BUT high test error
- The gap between training and test performance is large
- Training R² is much higher than test R²
- Model is complex compared to data size

**Causes:**
- Model is too complex (e.g., too many features, high-degree polynomial regression)
- Insufficient training data
- Noisy data
- Training for too many iterations

**Example:** Using a high-degree polynomial that perfectly captures every wiggle in the training data, but performs poorly on new data.

### Good Fit

**What it looks like:** The model captures the underlying pattern well. Training error and test error are both low and relatively close to each other.

**Characteristics:**
- Good performance on both training and test sets
- Small gap between training and test metrics
- Model complexity appropriate for data size
- Residuals show no systematic patterns

**Example:** A model that follows the true underlying pattern in the data without fitting to noise.

### How to Detect Problems

The key is to compare performance (using metrics like MSE, RMSE, MAE, R²) on the **training set** vs. the **test set**.

**Performance Pattern Interpretation:**
- `High Train Error, High Test Error` -> Likely Underfitting
- `Low Train Error, High Test Error` -> Likely Overfitting
- `Low Train Error, Low Test Error (close to train)` -> Good Fit

**Learning Curves:** Plotting training and test error as a function of the training set size or model complexity can provide deep insights:

![Learning Curves](attachment:37f48c21-1b6a-4f4e-8e33-279a93e34bff.png)

**Typical Underfitting Patterns:**
- Both training and test error are high
- Error curves are close together
- Adding model complexity helps both train and test performance

**Typical Overfitting Patterns:**
- Training error continues to decrease with complexity
- Test error initially decreases, then increases
- Large gap between training and test errors

### Visualizing Underfitting and Overfitting

The following plot illustrates the classic case of underfitting, good fit, and overfitting:

![Underfitting vs. Good Fit vs. Overfitting](attachment:4d28ae45-9f9b-4825-869c-fed366b669e9.png)

**Real-world Code Example:**

In [1]:
# Plotting training vs. test performance for evaluation
import matplotlib.pyplot as plt

def plot_train_test_performance(model_name, train_scores, test_scores):
    plt.figure(figsize=(10, 6))
    complexity = range(len(train_scores))
    
    plt.plot(complexity, train_scores, 'o-', color='blue', label='Training')
    plt.plot(complexity, test_scores, 'o-', color='orange', label='Test')
    
    plt.xlabel('Model Complexity')
    plt.ylabel('Performance (R² Score)')
    plt.title(f'{model_name}: Training vs. Test Performance')
    plt.legend()
    plt.grid(True)
    plt.show()
    
    # Analyzing the gap
    gap = [train - test for train, test in zip(train_scores, test_scores)]
    largest_gap_idx = gap.index(max(gap))
    
    print(f"Largest gap at complexity level {largest_gap_idx}")
    print(f"Training score: {train_scores[largest_gap_idx]:.4f}")
    print(f"Test score: {test_scores[largest_gap_idx]:.4f}")
    print(f"Gap: {gap[largest_gap_idx]:.4f}")
    
    if gap[largest_gap_idx] > 0.1:  # Arbitrary threshold
        print("WARNING: Significant gap between training and test performance suggests overfitting.")

### Strategies to Address Overfitting and Underfitting

While we'll cover these topics in more depth in future sessions, here's a preview of strategies:

**For Underfitting:**
- Increase model complexity (add more features, higher-degree polynomials)
- Reduce regularization
- Feature engineering to better capture patterns
- Try a more flexible model class

**For Overfitting:**
- Collect more training data
- Reduce model complexity
- Use regularization techniques (Ridge, Lasso - coming in future sessions)
- Early stopping
- Feature selection
- Cross-validation for more robust evaluation

<a id="interactive-learning"></a>

## Interactive Learning: Evaluating Models

### Metrics Calculation Exercise

Let's try calculating metrics for a simple example:

Imagine we have a model predicting house prices (in $1000s) with the following results:

| House | Actual Price (y) | Predicted Price (ŷ) |
|-------|-----------------|---------------------|
| 1     | 250             | 230                 |
| 2     | 300             | 320                 |
| 3     | 150             | 170                 |
| 4     | 500             | 450                 |
| 5     | 400             | 390                 |

**Task 1:** Calculate by hand (or in code):
1. MSE
2. RMSE
3. MAE
4. R²

**Solution:**

In [3]:
# Set up actual and predicted values
y_true = [250, 300, 150, 500, 400]
y_pred = [230, 320, 170, 450, 390]

# Calculate residuals
residuals = [y - y_hat for y, y_hat in zip(y_true, y_pred)]
print("Residuals:", residuals)

# 1. Calculate MSE
squared_residuals = [r**2 for r in residuals]
mse = sum(squared_residuals) / len(y_true)
print(f"MSE: {mse:.2f}")

# 2. Calculate RMSE
rmse = mse**0.5
print(f"RMSE: {rmse:.2f}")

# 3. Calculate MAE
absolute_residuals = [abs(r) for r in residuals]
mae = sum(absolute_residuals) / len(y_true)
print(f"MAE: {mae:.2f}")

# 4. Calculate R²
# First, calculate the mean of actual values
y_mean = sum(y_true) / len(y_true)
# Calculate total sum of squares
tss = sum([(y - y_mean)**2 for y in y_true])
# Calculate residual sum of squares
rss = sum(squared_residuals)
# Calculate R²
r2 = 1 - (rss / tss)
print(f"R²: {r2:.4f}")

Residuals: [20, -20, -20, 50, 10]
MSE: 760.00
RMSE: 27.57
MAE: 24.00
R²: 0.9479


### Metric Choice Scenarios

Let's consider different scenarios and decide which metric(s) would be most appropriate:

**Scenario 1: Medical Cost Prediction**
- Task: Predict patient treatment costs for hospital resource planning
- Data: Historical patient records with treatment costs ranging from `$1,000` to `$500,000`
- Considerations: Some extremely expensive outlier cases exist, but overall accuracy is important

**Which metrics would you choose and why?**

<details>
<summary>Click for solution</summary>

**Solution:** 
- **RMSE and MAE together**: RMSE will highlight the impact of expensive outlier cases, while MAE will show typical error. 
- **R²**: To understand overall predictive power.
- The context requires balancing sensitivity to outliers (critical expensive cases) with typical performance.
- Report RMSE/MAE as percentages of average cost for easier interpretation.
</details>

**Scenario 2: House Price Prediction for a Loan Company**
- Task: Predict house values to determine loan amounts
- Considerations: Under-prediction could lead to insufficient loans (customer dissatisfaction), while over-prediction could lead to default risk
- Business decision: Different costs for over vs. under prediction

**Which metrics would you choose and why?**

<details>
<summary>Click for solution</summary>

**Solution:**
- **Custom weighted error**: Consider a metric that penalizes over-prediction more than under-prediction based on business risk assessment
- **RMSE**: Still valuable for overall error magnitude
- **Separate mean positive and negative error** metrics to understand bias direction
- Could also use plots showing error distribution to identify any systematic bias
</details>

### Detecting Over/Underfitting

**Scenario 3: Polynomial Regression Degree Selection**
- Task: Choose the appropriate polynomial degree for modeling a relationship
- Available data: 100 training samples, 50 test samples
- Models: Linear (degree=1) through 10th degree polynomials

Given the following performance metrics, identify which model is likely underfitting, which is overfitting, and which might be the best choice:

| Polynomial Degree | Training R² | Test R² |
|-------------------|-------------|---------|
| 1 (Linear)        | 0.45        | 0.42    |
| 2                 | 0.67        | 0.65    |
| 3                 | 0.75        | 0.74    |
| 4                 | 0.79        | 0.76    |
| 5                 | 0.82        | 0.77    |
| 6                 | 0.85        | 0.75    |
| 7                 | 0.89        | 0.71    |
| 8                 | 0.92        | 0.65    |
| 9                 | 0.94        | 0.62    |
| 10                | 0.97        | 0.58    |

<details>
<summary>Click for solution</summary>

**Solution:**
- **Underfitting**: Degrees 1-2 are likely underfitting, as both training and test R² are relatively low.
- **Good Fit**: Degrees 3-5 show good performance, with degree 5 achieving the highest test R² (0.77).
- **Overfitting**: Degrees 6-10 show clear signs of overfitting, with increasing training R² but decreasing test R².
- **Best Choice**: Degree 5 polynomial provides the best balance, maximizing test set performance.
- **Analysis**: As complexity increases, the gap between training and test performance widens, a classic sign of overfitting.
</details>

<a id="practice-questions"></a>

## Practice Questions

1. Compare and contrast MSE, RMSE, and MAE as evaluation metrics. When might you prefer one over the others?

2. If your model has an R² of 0.75, what does this tell you about how well your model explains the variance in the target variable?

3. Why might adjusted R² be preferred over regular R² when comparing models with different numbers of features? Provide an example scenario.

4. You've trained two linear regression models on housing data:
   - Model A: Training RMSE = 50,000, Test RMSE = 52,000
   - Model B: Training RMSE = 35,000, Test RMSE = 75,000
   
   Which model is likely overfitting? Which would you choose to deploy, and why?

5. Explain the bias-variance trade-off in your own words. How does it relate to model complexity?

6. For each of the following scenarios, identify whether the model is likely underfitting, overfitting, or a good fit:
   - Training R² = 0.95, Test R² = 0.65
   - Training RMSE = 15.3, Test RMSE = 16.1
   - Training MAE = 8.7, Test MAE = 25.4
   - Training R² = 0.42, Test R² = 0.40

7. How would you determine the appropriate level of model complexity for a regression problem to avoid both underfitting and overfitting?

8. A colleague says, "My model has an R² of 0.99 on the training data, so it's excellent!" What questions would you ask or concerns might you have?

9. Why is it important to consider both absolute metrics (like RMSE, MAE) and relative metrics (like R²) when evaluating regression models?

10. You're predicting stock prices with a model that achieves RMSE = $5 on the test set. Is this good or bad? What additional information would you need to determine if this is acceptable performance?sum_{i=1}^{n} (y_i - \bar{y})^2} = 1 - \frac{MSE}{Var(y)}$

Where:
- $\bar{y}$ is the mean of the actual target values
- $Var(y)$ is the variance of the actual target values

**Intuitive Explanation**:
- The denominator $\sum_{i=1}^{n} (y_i - \bar{y})^2$ represents how much variation there is in the target variable (total sum of squares)
- The numerator $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ represents how much variation remains unexplained after fitting our model (residual sum of squares)
- R² measures the proportion of variance explained by the model

**Characteristics**:
- R² = 1: Perfect predictions, model explains all variance
- R² = 0: Model is no better than predicting the mean value
- R² < 0: Model is performing worse than predicting the mean (possible but uncommon)
- Typically ranges from 0 to 1 for reasonable models

**Code Implementation**:

In [None]:
from sklearn.metrics import r2_score

r2 = r2_score(y_true, y_pred)
print(f"R²: {r2:.4f}")

**Advantages**:
- Unitless measure, allowing comparison across different target variables
- Intuitive interpretation as "proportion of variance explained"
- Well-known metric, widely used and understood

**Disadvantages**:
- Can increase just by adding more features, even if they're not useful
- Doesn't tell you if predictions are biased
- Can be misleadingly high in some cases

### Adjusted R-squared

**Definition**: A modified version of R² that adjusts for the number of predictors in the model.

**Formula**: 
$\text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$

Where:
- $n$ is the number of observations
- $p$ is the number of predictors (features)

**Characteristics**:
- Increases only if a new feature improves the model more than would be expected by chance
- Can decrease if unnecessary predictors are added
- Always ≤ R² (equal only when p = 0)
- Addresses the problem of R² automatically increasing with more predictors

**Code Implementation**:

In [None]:
python
# Using statsmodels for Adjusted R²
import statsmodels.api as sm

X_with_const = sm.add_constant(X)  # Adds a constant term for intercept
model = sm.OLS(y, X_with_const).fit()
adjusted_r2 = model.rsquared_adj
print(f"Adjusted R²: {adjusted_r2:.4f}")

**When to Use Adjusted R²**:
- When comparing models with different numbers of features
- When doing feature selection
- When concerned about overfitting due to too many predictors

### Contextual Interpretation of Metrics

**Business Context is Crucial:**
- Whether an RMSE of 0.5 is "good" depends entirely on the problem context
- For house prices in thousands, an RMSE of 50 means predictions are typically off by $50,000
- This might be acceptable for multi-million dollar homes but terrible for lower-priced markets

**Relative Performance:**
- Always compare your model metrics to a baseline model
- Common baselines:
  - Predicting the mean value of the target
  - Using only the most important feature
  - A simpler model (linear vs. complex)

**Example of Contextual Interpretation:**
Consider a model predicting weekly product sales with RMSE = $500 and R² = 0.7
- Is this good? It depends!
- For a product that typically sells `$50,000/week`, being off by `$500` is excellent (1% error)
- For a product that typically sells `$1,000/week`, being off by `$500` is terrible (50% error)
- The R² of 0.7 tells us we're explaining 70% of the variance, which might be good in a noisy retail environment

**Cost of Errors:**
- Consider asymmetric costs: Is over-prediction more costly than under-prediction?
- Example: In medical resource allocation, under-predicting demand could cost lives
- Example: In inventory management, over-predicting demand leads to waste, under-predicting leads to stockouts

<a id="overfitting-underfitting"></a>

## 2.2.2 Detecting Overfitting & Underfitting

### The Goal: Generalization

The ultimate goal of machine learning is to create models that not only perform well on the data they were trained on but also generalize well to new, unseen data. Generalization is the ability of a model to make accurate predictions on data it hasn't seen during training.

### Understanding the Bias-Variance Tradeoff

The bias-variance tradeoff is a fundamental concept that helps us understand the balance between underfitting and overfitting:

**Bias**: Error due to overly simplistic assumptions in the learning algorithm. High bias can cause an algorithm to miss relevant relations between features and target (underfitting).

**Variance**: Error due to excessive sensitivity to small fluctuations in the training data. High variance can cause an algorithm to model random noise in the training data (overfitting).

![Bias-Variance Tradeoff](https://i.imgur.com/2CJ33uN.png)

As model complexity increases:
- Bias tends to decrease
- Variance tends to increase
- The goal is to find the sweet spot that minimizes total error

### Underfitting

**What it is:** The model is too simple to capture the underlying patterns in the data. It performs poorly on *both* the training set and the test set.

**Symptoms:**
- High training error AND high test error
- Low R-squared on both training and test sets
- Residual plots show clear patterns
- Model predictions are systematically off

**Causes:**
- Model is not complex enough (e.g., using linear regression for a highly non-linear relationship)
- Important features are missing
- Too much regularization (penalties applied to model complexity)

**Example:** Using a linear model to fit data with a clearly quadratic relationship.

![Underfitting Example](https://i.imgur.com/1qy1tVl.png)

### Overfitting

**What it is:** The model learns the training data *too* well, including noise and random fluctuations. It performs extremely well on the training set but poorly on the test set. It fails to generalize.

**Symptoms:**
- Very low training error BUT high test error
- The gap between training and test performance is large
- Training R² is much higher than test R²
- Model is complex compared to data size

**Causes:**
- Model is too complex (e.g., too many features, high-degree polynomial regression)
- Insufficient training data
- Noisy data
- Training for too many iterations

**Example:** Using a high-degree polynomial that perfectly captures every wiggle in the training data, but performs poorly on new data.

![Overfitting Example](https://i.imgur.com/0hlVkC1.png)

### Good Fit

**What it looks like:** The model captures the underlying pattern well. Training error and test error are both low and relatively close to each other.

**Characteristics:**
- Good performance on both training and test sets
- Small gap between training and test metrics
- Model complexity appropriate for data size
- Residuals show no systematic patterns

**Example:** A model that follows the true underlying pattern in the data without fitting to noise.

![Good Fit Example](https://i.imgur.com/JURDlQX.png)

### How to Detect Problems

The key is to compare performance (using metrics like MSE, RMSE, MAE, R²) on the **training set** vs. the **test set**.

**Performance Pattern Interpretation:**
- `High Train Error, High Test Error` -> Likely Underfitting
- `Low Train Error, High Test Error` -> Likely Overfitting
- `Low Train Error, Low Test Error (close to train)` -> Good Fit

**Learning Curves:** Plotting training and test error as a function of the training set size or model complexity can provide deep insights:

![Learning Curves](https://i.imgur.com/sdEgwsB.png)

**Typical Underfitting Patterns:**
- Both training and test error are high
- Error curves are close together
- Adding model complexity helps both train and test performance

**Typical Overfitting Patterns:**
- Training error continues to decrease with complexity
- Test error initially decreases, then increases
- Large gap between training and test errors

### Visualizing Underfitting and Overfitting

The following plot illustrates the classic case of underfitting, good fit, and overfitting:

![Underfitting vs. Good Fit vs. Overfitting](https://i.imgur.com/G0KrXDy.png)

**Real-world Code Example:**

In [None]:
python
# Plotting training vs. test performance for evaluation
import matplotlib.pyplot as plt

def plot_train_test_performance(model_name, train_scores, test_scores):
    plt.figure(figsize=(10, 6))
    complexity = range(len(train_scores))
    
    plt.plot(complexity, train_scores, 'o-', color='blue', label='Training')
    plt.plot(complexity, test_scores, 'o-', color='orange', label='Test')
    
    plt.xlabel('Model Complexity')
    plt.ylabel('Performance (R² Score)')
    plt.title(f'{model_name}: Training vs. Test Performance')
    plt.legend()
    plt.grid(True)
    plt.show()
    
    # Analyzing the gap
    gap = [train - test for train, test in zip(train_scores, test_scores)]
    largest_gap_idx = gap.index(max(gap))
    
    print(f"Largest gap at complexity level {largest_gap_idx}")
    print(f"Training score: {train_scores[largest_gap_idx]:.4f}")
    print(f"Test score: {test_scores[largest_gap_idx]:.4f}")
    print(f"Gap: {gap[largest_gap_idx]:.4f}")
    
    if gap[largest_gap_idx] > 0.1:  # Arbitrary threshold
        print("WARNING: Significant gap between training and test performance suggests overfitting.")

### Strategies to Address Overfitting and Underfitting

While we'll cover these topics in more depth in future sessions, here's a preview of strategies:

**For Underfitting:**
- Increase model complexity (add more features, higher-degree polynomials)
- Reduce regularization
- Feature engineering to better capture patterns
- Try a more flexible model class

**For Overfitting:**
- Collect more training data
- Reduce model complexity
- Use regularization techniques (Ridge, Lasso - coming in future sessions)
- Early stopping
- Feature selection
- Cross-validation for more robust evaluation

<a id="interactive-learning"></a>
## Interactive Learning: Evaluating Models

### Metrics Calculation Exercise

Let's try calculating metrics for a simple example:

Imagine we have a model predicting house prices (in $1000s) with the following results:

| House | Actual Price (y) | Predicted Price (ŷ) |
|-------|-----------------|---------------------|
| 1     | 250             | 230                 |
| 2     | 300             | 320                 |
| 3     | 150             | 170                 |
| 4     | 500             | 450                 |
| 5     | 400             | 390                 |

**Task 1:** Calculate by hand (or in code):
1. MSE
2. RMSE
3. MAE
4. R²

**Solution:**

In [None]:
python
# Set up actual and predicted values
y_true = [250, 300, 150, 500, 400]
y_pred = [230, 320, 170, 450, 390]

# Calculate residuals
residuals = [y - y_hat for y, y_hat in zip(y_true, y_pred)]
print("Residuals:", residuals)

# 1. Calculate MSE
squared_residuals = [r**2 for r in residuals]
mse = sum(squared_residuals) / len(y_true)
print(f"MSE: {mse:.2f}")

# 2. Calculate RMSE
rmse = mse**0.5
print(f"RMSE: {rmse:.2f}")

# 3. Calculate MAE
absolute_residuals = [abs(r) for r in residuals]
mae = sum(absolute_residuals) / len(y_true)
print(f"MAE: {mae:.2f}")

# 4. Calculate R²
# First, calculate the mean of actual values
y_mean = sum(y_true) / len(y_true)
# Calculate total sum of squares
tss = sum([(y - y_mean)**2 for y in y_true])
# Calculate residual sum of squares
rss = sum(squared_residuals)
# Calculate R²
r2 = 1 - (rss /

**What is Un