In [None]:
from IPython.display import HTML
HTML(open("../style.css", "r").read())

# Overfitting in Linear Regression (with Scikit-Learn)

In this notebook, we demonstrate **overfitting** using the **Hitters** dataset (baseball statistics).

We will incrementally add features to our model, starting with the most "important" ones, to see how the model's performance changes on the **Training Set** versus the **Test Set**.

We will use the **Scikit-Learn** (`sklearn`) library, which is the industry standard for machine learning in Python.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In order to be able to read the data from the web we import the module `request` from`urllib`.

In [None]:
import urllib.request

### 1. Data Loading and Preprocessing

The data is available at https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/ISLR/Hitters.csv.
We download it via the function `urlretrieve`.

In [None]:
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/ISLR/Hitters.csv"
urllib.request.urlretrieve(url, "Hitters.csv")

The data is now stored locally in the file `'Hitters.csv'`. 

In [None]:
!cat Hitters.csv || type Hitters.csv

We put the data into a *data frame*.

In [None]:
df = pd.read_csv("Hitters.csv")
df

Next, we drop the player names as they are useless for our statistical investigation.

In [None]:
df = df.drop(columns=["rownames"])

As our aim is to predict the players *salary* from the other attributes, we have to drop rows where the target `Salary` is missing, i.e. has the value `NaN`.

In [None]:
df = df.dropna(subset=['Salary'])
df

Observe that out of 322 rows only 263 have survived.

## One-Hot Encoding with `pd.get_dummies`

One-hot encoding is a technique used to convert categorical data (variables that contain label values rather than numbers) into a numerical format that machine learning algorithms can understand.

In the context of the Python library **pandas**, the function `pd.get_dummies` is the standard tool for performing this transformation. It takes a column of categorical data and expands it into multiple new columnsâ€”one for each unique category found in the original column.

### How It Works

1. **Identify Unique Categories:** The function scans the specified column to find all unique values (e.g., "Red", "Blue", "Green").
2. **Create New Columns:** It creates a new binary column for *each* unique category.
3. **Assign Binary Values:**
* It places a **1** (or `True`) in the column corresponding to the observation's category.
* It places a **0** (or `False`) in all other category columns for that row.



---

### Example: Baseball Divisions

Let's look at your specific example regarding the `Division` attribute for baseball players.

#### 1. The Original Data

Imagine you have a DataFrame of baseball players. One of the columns is `Division`, and it contains three possible values representing the division the team plays in:

* **W** (Western)
* **C** (Central)
* **E** (Eastern)

Here is what the raw data looks like:

| Player | Division |
| --- | --- |
| Player A | W |
| Player B | C |
| Player C | E |
| Player D | W |

#### 2. Applying `pd.get_dummies`

When you run `pd.get_dummies()` on the `Division` column, pandas creates three new columns, usually prefixed with the original column name.

The resulting table looks like this:

| Player | Division_W | Division_C | Division_E |
| --- | --- | --- | --- |
| Player A | **1** | 0 | 0 |
| Player B | 0 | **1** | 0 |
| Player C | 0 | 0 | **1** |
| Player D | **1** | 0 | 0 |

#### 3. Interpretation

* **Player A** was in the Western division, so `Division_W` is 1, while `Division_C` and `Division_E` are 0.
* **Player B** was in the Central division, so `Division_C` is 1.
* **Player C** was in the Eastern division, so `Division_E` is 1.

The algorithm can now treat these columns as independent numerical features rather than a single text string.

---

### Important Note: The Dummy Variable Trap

In many statistical models (like linear regression), including all three columns creates a problem called **multicollinearity** (or the "Dummy Variable Trap"). This happens because the variables are perfectly correlated: if you know a player is *not* in the West and *not* in the Central, you automatically know they *must* be in the East.

To solve this, we often drop one column (usually the first one) to serve as the "baseline" or "reference" category. Pandas supports this via the `drop_first=True` parameter.

If we used `drop_first=True` on the example above, the result would look like this:

| Player | Division_W | Division_E |
| --- | --- | --- |
| Player A | 1 | 0 |
| Player B | 0 | 0 |
| Player C | 0 | 1 |

* **Player B (Central)** is now represented by 0s in both columns. The model infers that if it's not West and not East, it is Central.

Now the dataset we are inspecting only has two attributes for the attribute `Division`: `W` and `E`.  There is no player from the Central Division.
Hence in our case, the attribute `Division` is replaced with just one new binary attribute `Divion_W`.  If this attribute is `True`, the player is
in the `West` divison, else he is in the `East` division.

In [None]:
df = pd.get_dummies(df, drop_first=True)
df.head()

### 2. Feature Sorting by Importance

Before we train, we want to determine which features are the most important.
A simple heuristic for "importance" in linear regression is the **correlation** between a feature and the target variable.
Therefore, we 
  * calculate the correlation matrix and then
  * sort the features based on the absolute value of their correlation with `Salary`.

In [None]:
correlations = df.corr()['Salary'].abs().sort_values(ascending=False)
correlations

Since `Salary` is the target variable, we drop the feature `Salary` from this list.

In [None]:
sorted_features = correlations.drop('Salary').index.tolist()
sorted_features

In [None]:
print("Features sorted by importance (Correlation with Salary):")
for i, f in enumerate(sorted_features):
    print(f"{i+1}. {f} ({correlations[f]:.4f})")

### 3. Splitting the Data

We use `train_test_split` from `sklearn.model_selection`.

**Explanation of the function:**
- `train_test_split(X, y, test_size=..., random_state=...)`: This function randomly shuffles the data and splits it into two buckets.
- `train_size=50`: We only take 50 samples for training to intentionally make it easier to overfit the model for this demonstration.
- `random_state=42`: This is used to seed the random number generator and ensures that the train/test split is reproducible 
   (we get the same random split every time we run the code).

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df[sorted_features]  # Features ordered by importance
X

In [None]:
y = df['Salary']         # Target
y

Next we split the data. We keep only 50 samples for training to simulate a 'low data' scenario where overfitting is 
easy to observe.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=50, random_state=42)

In [None]:
print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples:     {X_test.shape[0]}")

### 4. The Experiment

We will now loop through our sorted features. In each iteration $k$, we utilize the top $k$ features to train a linear regression model.

**Explanation of Scikit-Learn functions used:**
1.  `LinearRegression()`: Creates an instance of the model. It is mathematically equivalent to solving the Normal Equation.
2.  `.fit(X, y)`: This function trains the model, i.e. it solves the *normal equation* and thereby 
    finds the optimal weights that minimize the *mean squared error* on the given data `X` and `y`.
3.  `.score(X, y)`: This evaluates the model. For regression, it returns the $R^2$ score (Coefficient of Determination). $1.0$ is perfect, $0.0$ is equivalent to guessing the mean.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
train_scores = []
test_scores  = []
num_features = list(range(1, 19+1))

We iterate over all 19 features.  The $k^{\mathrm{th}}$ iteration uses the $k$ most important features.

In [None]:
for k in range(1, len(sorted_features) + 1):
    # Select the top k features
    top_k_features = sorted_features[:k]
    X_train_k = X_train[top_k_features]
    X_test_k  = X_test[top_k_features]
    # 1. Create the model
    model = LinearRegression()
    # 2. Train the model with the training data
    model.fit(X_train_k, y_train)
    # 3. Evaluate the model (Score)
    # We first record one score on the training data.
    r2_train = model.score(X_train_k, y_train)
    train_scores.append(r2_train)
    # Next, we record the accuracy on both the test data.
    # Note that the model hasn't seen the test data during training.
    r2_test = model.score(X_test_k, y_test)   
    test_scores.append(r2_test)

### 5. Visualization

We plot the training and test scores. 

**What to look for:**
- The **Training Score** (Blue) should generally go up. Adding information allows the model to explain the specific training data better.
- The **Test Score** (Red) will eventually peak and then drop. This drop indicates **overfitting**: the model is using the additional (less important) features to memorize noise in the training set, which hurts its ability to predict real salaries.

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(num_features, train_scores, 'o-', color='blue', label='Training Score ($R^2$)')
plt.plot(num_features, test_scores,  'o-', color='red' , label='Test Score ($R^2$)')

plt.title('Overfitting Analysis: Baseball Salaries', fontsize=16)
plt.xlabel('Number of Features (Sorted by Importance)', fontsize=12)
plt.ylabel('$R^2$ Score', fontsize=12)
plt.xticks(num_features)
plt.grid(True, alpha=0.3)
plt.legend(fontsize=12)

# Annotate the "Sweet Spot"
best_k = np.argmax(test_scores) + 1
plt.axvline(x=best_k, color='green', linestyle='--', alpha=0.7)
plt.text(best_k + 0.5, 0.4, 'Sweet Spot', color='green', fontsize=12)

plt.show()