# Classification

This notebook introduces the Scikit-Learn interface for classification models.

## Setup

Load the packages and configure environment.

In [None]:
import numpy as np
import pandas as pd

Using the Boston data from HW1.

In [None]:
# download the data set directly from the web using pandas
url = "https://raw.githubusercontent.com/olearydj/INSY7120/refs/heads/main/notebooks/data/Default.csv"
default = pd.read_csv(url)

default.head()

## Simple Logistic Regression

First, we'll predict `default` based only on credit card `balance`, using logistic regression.

We'll use several components from SKL to achieve this.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

Get `X` and `y`. Note that the method below for `y` is exactly equivalent to using `get_dummies` as we've done before:

```python
y = pd.get_dummies(default_df['default'], drop_first=True, dtype=int)
```

For this simple binary case, the code below creates a boolean array based on the comparison `default == 'Yes'`. In Python `True` is equivalent to `1` and `False` to zero, and `astype(int)` converts them to that representation.

In [None]:
# Extract feature and target
X = default[['balance']].values  # We need 2D array for sklearn
y = (default['default'] == 'Yes').astype(int)  # Convert to binary 0/1

# use value_counts to see class counts
y.value_counts()

There are 10,000 observations, 9667 of which are False (96.7%), and 333 are True (3.3%).

For this example, we'll use simple train-test split validation.

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

Logistic Regression uses gradient descent, which is sensitive to differences in scale between the features. Normalize them by z-score standardization using `StandardScalar`. Here we are taking care during fit to calculate the scalar parameters ($\mu$ and $\sigma$) using only the training data. The results are used to transform both the training and the test data. This avoids leaking information about the test data (e.g. scale) into the training data.

In [None]:
# Standardize features
scaler = StandardScaler()

# fit the scalar using training data AND apply that transformation to the same
X_train_scaled = scaler.fit_transform(X_train)

# use the scalar coefficients from training to transform the test data
X_test_scaled = scaler.transform(X_test)

It is simple to fit the `LogisticRegression` model.

In [None]:
# Train logistic regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

Finally, make predictions. There are two methods used below. `model.predict` generates labels where `model.predict_proba` generates the probabilities. SKL's binary classifiers use a threshold of 0.5 for label assignment. Observations with a predicted probability of greater than 50% get the positive label.

In [None]:
# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_prob = model.predict_proba(X_test_scaled)[:, 1]

Let's look at the relationship between probabilities, predictions, and the true labels.

In [None]:
# Create a DataFrame to show samples with predictions and probabilities

# Get a sample of predictions
results_df = pd.DataFrame({
    'actual': y_test,
    'predicted': y_pred,
    'probability': y_pred_prob
})

# Sort by probability to show examples around the threshold
results_df = results_df.sort_values('probability')

# Select informative examples (some below and some above the threshold)
threshold_examples = pd.concat([
    # Examples just below threshold (predict 0)
    results_df[(results_df['probability'] > 0.3) & (results_df['probability'] < 0.5)].head(3),
    # Examples just above threshold (predict 1)
    results_df[(results_df['probability'] >= 0.5) & (results_df['probability'] < 0.7)].head(3)
])

# Display with formatting
pd.set_option('display.precision', 4)
print("Prediction Examples (threshold = 0.5):")
print(threshold_examples)

You should be thinking, "what if I don't want to use a 50% threshold?" Good question and we will come back to it. TLDR; SKL doesn't support it directly, but it is an important consideration in model development.

Let's summarize the predictions:

In [None]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

You should be thinking, "what are all those numbers?" Good question, we'll come back to that. Included now for completeness.

### Interpreting the Predictions

These predictions are summarized above as a **Confusion Matrix**. In the case of binary classification it looks like this:

|            |                  |        **Predicted**         |        **Predicted**        |
| ---------- | ---------------- | :--------------------------: | :-------------------------: |
|            |                  |       **Negative (0)**       |      **Positive (1)**       |
| **Actual** | **Negative (0)** | True Negative (TN)<br>(2896) | False Positive (FP)<br>(10) |
| **Actual** | **Positive (1)** | False Negative (FN)<br>(70)  | True Positive (TP)<br>(24)  |

This table follows the scikit-learn format where:
- Rows represent actual values (0=No Default, 1=Default), the training labels
- Columns represent predicted values (0=No Default, 1=Default)

The four cells correspond to:

- **Top-Left (2896)**: True Negatives (TN) - Correctly predicted as "No Default"
- **Top-Right (10)**: False Positives (FP) - Incorrectly predicted as "Default"
- **Bottom-Left (70)**: False Negatives (FN) - Incorrectly predicted as "No Default"
- **Bottom-Right (24)**: True Positives (TP) - Correctly predicted as "Default"

This provides a complete picture of how the model's predictions align with the actual values, showing where the model succeeds and where it makes errors.

How would you interpret these results?

#### Answers - Hide Me!

The data has many more observations with a negative label (default = no) than positive. The first row of the CM shows 2896 + 10 = 2906 actual negatives and only 70 + 24 = 94 actual positives. That's a total of 3,000 observations, which matches the 30% test split we used.

If we think only in terms of the number of correct predictions, which are on the diagonal, this does quite well: 2896 + 24 / 3000 = 0.9733 → 97.3% correct.

It is also very good at identifying non-defaults (first row): 2896 / (2896 + 10) = 0.9966 → 99.7% correct.

Struggles to identify actual defaults (second row): 24 / (70 + 24) = 0.2553 → 25.5% correct.

Two types of errors, each with different frequencies and implications:

- False Positives (10) - predicting default when there is none
- False Negatives (70) - missing actual defaults

In these results, False Negatives are 7x more common than False Positives. In a lending business, failing to identify potential defaults (FNs) is typically more costly than incorrectly flagging bad borrowers (FPs). The high number of FNs suggests the model might not adequately address the business need despite seeming very accurate. Poor discriminator for this scenario.

Why might this model be biased towards predicting no default?

### Visualize the Results

Some fancy matplotlib work courtesy of Claude... Note that the actual and predicted dots should both appear on the 1.0 and 0.0 lines. They are shown stacked above and below those for clarity.

In [None]:
# Visualize the results with better annotations
import matplotlib.pyplot as plt
import matplotlib.patches as patches

plt.figure(figsize=(12, 7))

# Create small vertical offsets for clarity
offset_actual = 0.02  # Offset for actual values
offset_pred = -0.02   # Offset for predicted values

# Plot actual values (blue dots)
plt.scatter(X_test, y_test + offset_actual, color='blue', alpha=0.5, label='Actual')

# Plot predicted values (red dots)
plt.scatter(X_test, y_pred + offset_pred, color='red', alpha=0.5, label='Predicted')

# Plot the decision boundary
balance_range = np.linspace(X[:, 0].min(), X[:, 0].max(), 100).reshape(-1, 1)
balance_range_scaled = scaler.transform(balance_range)
y_prob = model.predict_proba(balance_range_scaled)[:, 1]
plt.plot(balance_range, y_prob, 'g-', label='Probability of Default')
plt.axhline(y=0.5, color='k', linestyle='--', label='Decision Boundary')

# Create annotation boxes for False Negatives and False Positives
# False Negatives (top region)
fn_rect = patches.Rectangle((1000, 0.99), 930, 0.06, linewidth=2, edgecolor='r', facecolor='none')
plt.gca().add_patch(fn_rect)
plt.text(1500, 1.08, "False Negatives", color='red', fontsize=16, ha='center')

# False Positives (bottom right region)
fp_rect = patches.Rectangle((1955, -0.01), 600, 0.06, linewidth=2, edgecolor='r', facecolor='none')
plt.gca().add_patch(fp_rect)
plt.text(2250, -0.07, "False Positives", color='red', fontsize=16, ha='center')

# Adjust y-axis to accommodate offsets and annotations
plt.ylim(-0.15, 1.15)

plt.xlabel('Credit Card Balance')
plt.ylabel('Default Probability')
plt.title('Logistic Regression: Predicting Default based on Balance')
plt.legend()
plt.tight_layout()
plt.show()

We can observe a few things here:

- The decision boundary (dashed line) at 0.5 probability corresponds to a balance of approximately \$2000, where the model transitions from predicting "No Default" to "Default".
- Most accounts with balances below \$2k have low default probabilities and are classified as non-defaults (red dots at bottom)
- Most accounts with balances above \$2k have high default probabilities and are classified as defaults (red dots at top)
- There are several misclassifications evident:
  - False negatives (actual = default, prediction = non-default): blue dots at y=1 without matching red dot
  - False positives (actual = non-default, prediction = default): blue dots at y=0 without matching red dot

Most FNs occur with balances between ≈\$1k and 2k. FPs are less common and occur at higher balances.

### Interpret Model Coefficients

Extract model coefficients from the fitted model and interpret.

In [None]:
# Print the model coefficients
print(f"Intercept: {model.intercept_[0]:.4f}")
print(f"Coefficient for balance: {model.coef_[0][0]:.4f}")

These don't match the results in ISL Chapter 4:

![](images/isl-tbl-4.1.png)

Why? We've standardized the features so they are in the scale of standard deviations of the feature values. Where ISL's result is interpreted in terms of $1 changes to the balance, our results must be interpreted as follows:

> For every one *standard deviation* increase in `balance`, the *log-odds* of `default` increase by 2.6678.

If we want to interpret the coefficients in the original scale, we need to reverse the effects of the `StandardScalar`. Here's the mathematical approach:

#### The Standardization Process

A feature $X$ is standardized using:

$$X_{scaled} = \frac{X - \mu_X}{\sigma_X}$$

Where $\mu_X$ is the mean and $\sigma_X$ is the standard deviation of $X$.

#### Logistic Regression with Standardized Data

For a model fit on standardized data:

$$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_{scaled}$$

Substituting the standardization equation:

$$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 \frac{X - \mu_X}{\sigma_X}$$

Rearranging:

$$\log\left(\frac{p}{1-p}\right) = \left[\beta_0 - \beta_1\frac{\mu_X}{\sigma_X}\right] + \frac{\beta_1}{\sigma_X}X$$

#### Coefficients in Original Scale

From this rearrangement, we can see:

1. The coefficient in original scale: $\beta_{1,orig} = \frac{\beta_1}{\sigma_X}$

2. The intercept in original scale: $\beta_{0,orig} = \beta_0 - \beta_1\frac{\mu_X}{\sigma_X}$

#### Interpretation

After transformation, we can properly interpret:
- For each 1-unit increase in the original feature, the log-odds increase by $\beta_{1,orig}$
- When the original feature equals zero, the log-odds equal $\beta_{0,orig}$

This manual transformation ensures we maintain the statistical benefits of standardization during model training while enabling interpretable coefficients in the original units.

In [None]:
# Get the scaling parameters from the scaler
# first index, only one feature
std_dev_balance = scaler.scale_[0]
mean_balance = scaler.mean_[0]

# Convert the coefficient back to the original scale
coef_original = model.coef_[0][0] / std_dev_balance

# Convert the intercept back to the original scale
intercept_original = model.intercept_[0] - (model.coef_[0][0] * mean_balance / std_dev_balance)

print(f"Standard deviation of balance: {std_dev_balance:.4f}")
print(f"Mean of balance: {mean_balance:.4f}")
print(f"Coefficient in original scale: {coef_original:.4f}")
print(f"Intercept in original scale: {intercept_original:.4f}")
print(f"For every $1 increase in balance, log-odds of default increases by {coef_original:.4f}")
print(f"The log-odds of default when balance = 0 is {intercept_original:.4f}")

# The equation in the original scale would be:
print("\nLogistic regression equation in original scale:")
print(f"log-odds(default) = {intercept_original:.4f} + {coef_original:.4f} × balance")

These results are close to those obtained by ISL (-10.6513 + 0.0055 x balance). The difference in the intercept could come down to some combination of the splitting method, randomization, precision limits, and optimization method used.

## Multiple Logistic Regression

Also, introducing SKL pipelines.

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Identify categorical and numerical columns
categorical_cols = ['student']
numerical_cols = ['balance', 'income']

# Create preprocessor for mixed data types
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(drop='first'), categorical_cols)
    ])

# Create a pipeline with preprocessing and logistic regression
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

# Prepare X and y
X = default[numerical_cols + categorical_cols]
y = (default['default'] == 'Yes').astype(int)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

# Fit the model pipeline
model_pipeline.fit(X_train, y_train)

# Make predictions
y_pred = model_pipeline.predict(X_test)
y_pred_prob = model_pipeline.predict_proba(X_test)[:, 1]

# Evaluate the model
print("Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Extract and interpret coefficients
# Get the column names after one-hot encoding
encoded_feature_names = numerical_cols + ['student_Yes']

# Get the coefficients from the model
coefficients = model_pipeline.named_steps['classifier'].coef_[0]

# Create a DataFrame to display the coefficients
coef_df = pd.DataFrame({
    'Feature': encoded_feature_names,
    'Coefficient': coefficients
})

print("\nModel Coefficients:")
print(coef_df)

# Calculate the intercept in original scale
intercept = model_pipeline.named_steps['classifier'].intercept_[0]
print(f"\nIntercept: {intercept:.4f}")


Note: only one prediction changes from FN to TP.