In [None]:
from IPython.display import HTML
HTML(open("../style.css", "r").read())

# Overfitting in Linear Regression (with Scikit-Learn)

In this notebook, we demonstrate **overfitting** using the **Hitters** dataset (baseball statistics).

We will incrementally add features to our model, starting with the most "important" ones, to see how the model's performance changes on the **Training Set** versus the **Test Set**.

We will use the **Scikit-Learn** (`sklearn`) library, which is the industry standard for machine learning in Python.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import urllib.request

### 1. Data Loading and Preprocessing

We use the `pandas` library to load the data. `pandas` is excellent for handling tabular data (DataFrames).

In [None]:
# Download the file
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/ISLR/Hitters.csv"
urllib.request.urlretrieve(url, "Hitters.csv")

# Load into a DataFrame
df = pd.read_csv("Hitters.csv")
df

Drop the name column.

In [None]:
df = df.drop(columns=["rownames"])

Drop rows where the target `Salary` is missing, i.e. has the value `NaN`.

In [None]:
df = df.dropna(subset=['Salary'])
df

#### One-Hot Encoding
Machine Learning models generally require numerical input. Our dataset contains categorical text data (e.g., `League` is 'A' or 'N'). We use `pd.get_dummies` to convert these into numbers (0 or 1).

In [None]:
df = pd.get_dummies(df, drop_first=True)
df.head()

### 2. Feature Sorting by Importance

Before we train, we want to determine which features are the most important.
A simple heuristic for "importance" in linear regression is the **correlation** between a feature and the target variable.

- We calculate the correlation matrix.
- We sort the features based on the absolute value of their correlation with `Salary`.

Calculate the correlation Matrix.

In [None]:
correlations = df.corr()['Salary'].abs().sort_values(ascending=False)
correlations

Drop the feature `Salary` from the list.

In [None]:
sorted_features = correlations.drop('Salary').index.tolist()

In [None]:
print("Features sorted by importance (Correlation with Salary):")
for i, f in enumerate(sorted_features):
    print(f"{i+1}. {f} ({correlations[f]:.4f})")

### 3. Splitting the Data

We use `train_test_split` from `sklearn.model_selection`.

**Explanation of the function:**
- `train_test_split(X, y, test_size=..., random_state=...)`: This function randomly shuffles the data and splits it into two buckets.
- `test_size=0.5`: We set a very large test set (and consequently a **small training set**) to intentionally make it easier to overfit the model for this demonstration.
- `random_state=42`: Ensures the split is reproducible (we get the same random split every time we run the code).

In [None]:
from sklearn.model_selection import train_test_split

X = df[sorted_features]  # Features ordered by importance
y = df['Salary']         # Target

# Split the data
# We keep only 50 samples for training to simulate a 'low data' scenario where overfitting is common
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=50, random_state=42)

print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples:     {X_test.shape[0]}")

### 4. The Experiment

We will now loop through our sorted features. In each iteration $k$, we utilize the top $k$ features to train a linear regression model.

**Explanation of Scikit-Learn functions used:**
1.  `LinearRegression()`: Creates an instance of the model. It is mathematically equivalent to solving the Normal Equation.
2.  `.fit(X, y)`: This trains the model. It finds the optimal weights $\beta$ that minimize the error on the given data `X` and `y`.
3.  `.score(X, y)`: This evaluates the model. For regression, it returns the $R^2$ score (Coefficient of Determination). $1.0$ is perfect, $0.0$ is equivalent to guessing the mean.

In [None]:
from sklearn.linear_model import LinearRegression

train_scores = []
test_scores = []
num_features = []

# Iterate from using 1 feature to using all 19 features
for k in range(1, len(sorted_features) + 1):
    # Select the top k features
    top_k_features = sorted_features[:k]
    
    X_train_k = X_train[top_k_features]
    X_test_k  = X_test[top_k_features]
    
    # 1. Create the model
    model = LinearRegression()
    
    # 2. Train the model (Fit)
    model.fit(X_train_k, y_train)
    
    # 3. Evaluate the model (Score)
    # We record accuracy on both the data it studied (Train) and the data it hasn't seen (Test)
    r2_train = model.score(X_train_k, y_train)
    r2_test  = model.score(X_test_k, y_test)
    
    train_scores.append(r2_train)
    test_scores.append(r2_test)
    num_features.append(k)

### 5. Visualization

We plot the training and test scores. 

**What to look for:**
- The **Training Score** (Blue) should generally go up. Adding information allows the model to explain the specific training data better.
- The **Test Score** (Red) will eventually peak and then drop. This drop indicates **overfitting**: the model is using the additional (less important) features to memorize noise in the training set, which hurts its ability to predict real salaries.

In [None]:
plt.figure(figsize=(10, 6))

plt.plot(num_features, train_scores, 'o-', color='blue', label='Training Score ($R^2$)')
plt.plot(num_features, test_scores, 'o-', color='red', label='Test Score ($R^2$)')

plt.title('Overfitting Analysis: Baseball Salaries', fontsize=16)
plt.xlabel('Number of Features (Sorted by Importance)', fontsize=12)
plt.ylabel('$R^2$ Score', fontsize=12)
plt.xticks(num_features)
plt.grid(True, alpha=0.3)
plt.legend(fontsize=12)

# Annotate the "Sweet Spot"
best_k = np.argmax(test_scores) + 1
plt.axvline(x=best_k, color='green', linestyle='--', alpha=0.7)
plt.text(best_k + 0.5, 0.4, 'Sweet Spot', color='green', fontsize=12)

plt.show()