# Regression, continued.

## Setup

Load the packages and configure environment.

In [None]:
%matplotlib inline

import matplotlib.pylab as plt
import numpy as np
import pandas as pd

## Binary Qualitative Predictors

Using the Credit data from ISL.

In [None]:
# download the data set directly from the web using pandas
url = "https://raw.githubusercontent.com/olearydj/INSY7120/refs/heads/main/notebooks/data/Credit.csv"
credit = pd.read_csv(url)

In [None]:
# Basic structure
print(credit.head())
print("Dataset shape:", credit.shape)
print("\nData types:\n", credit.dtypes)

# Check for missing values
print("\nMissing values:\n", credit.isnull().sum())

# Basic statistics
print("\nSummary statistics:\n")
credit.describe()

In [None]:
# Convert all column names to lowercase
credit.columns = credit.columns.str.lower()

In [None]:
credit

In [None]:
import seaborn as sns

# Set figure size for better visualization
plt.figure(figsize=(10, 6))

# 1. Bar plots for individual categorical variables
plt.subplot(2, 2, 1)
sns.countplot(x='own', data=credit)
plt.title('Housing Ownership Status')
plt.xlabel('Owns Home')
plt.ylabel('Count')

plt.subplot(2, 2, 2)
sns.countplot(x='student', data=credit)
plt.title('Student Status')
plt.xlabel('Is Student')
plt.ylabel('Count')

plt.subplot(2, 2, 3)
sns.countplot(x='married', data=credit)
plt.title('Marital Status')
plt.xlabel('Is Married')
plt.ylabel('Count')

plt.subplot(2, 2, 4)
sns.countplot(x='region', data=credit)
plt.title('Region Distribution')
plt.xlabel('Region')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.tight_layout()
plt.savefig('categorical_variables.png')
plt.show()


### Dummy Variables

In [None]:
# Create dummy variables and keep only one (drop_first=True)
pd.get_dummies(credit)

**Multicollinearity** occurs when one variable can be perfectly predicted by another. In this result, `own_No` is simply the negated values of `own_Yes`. 

When creating dummies for categorical variables with $n$ categories, you only need $n-1$ dummies to capture all the information without redundancy. We can use the `drop_first` option to address this.

We also need to convert the True / False values into 1 / 0. We can accomplish this by specifying the `int` datatype.

In [None]:
credit_enc = pd.get_dummies(credit, drop_first=True, dtype=int)
credit_enc

How has `region`, a factor with three levels, been represented by two columns? First, the each level of `region` is given its own column, where `1` is used to indicate membership.

```text
# Step 1: Initial dummy creation (internally)
region_East   region_South   region_West
    1             0             0        # East region
    0             1             0        # South region
    0             0             1        # West region
```

From this we can see multicollinearity: any region can be perfectly predicted by the value of the other two. For example, if east and south are zero, west must be one. When `drop_first` is used, the `east` level is represented by zeroes in both `south` and `west`, ensuring that the transformation of `region` preserves the independence of all predictors.

```text
# Step 2: After drop_first=True
region_South   region_West
    0             0            # East region (both False because it was East)
    1             0            # South region
    0             1            # West region
```

When using dummies in this manner, the implicit level (e.g. `east`) is the baseline, as its coefficient is the intercept. Other levels are measured relative to it.

### Simple Linear Regression

Ownership as a single predictor for credit balance.

In [None]:
X = credit_enc[['own_Yes']]
y = credit_enc[['balance']]

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
slr_own = LinearRegression()
slr_own.fit(X, y)

In [None]:
# look at the estimated model parameters
print(f"Model Coefficients: {slr_own.coef_}")
print(f"Model Intercept: {slr_own.intercept_}")

These results match those from the text. The square brackets are an artifact of the data structures expected by SKL. The coefficient appears as `[[19.73]]` (nested list) because it came from a 2D DataFrame with a single column, while the intercept appears as `[509.80]` (single list) because it's always a 1D array even for multiple predictors. Both represent scalar values in this single-predictor regression.

In [None]:
# Make predictions
y_pred = slr_own.predict(X)

# Evaluate the model
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)

print(f"Simple Linear Regression Model:")
print(f"credit = {slr_own.intercept_[0]:.2f} + {slr_own.coef_[0][0]:.2f} * own_Yes")
print(f"Mean Squared Error: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.4f}")

Interpret these results.

### Multiple Linear Regression

The transformation of `region` into a dummy variable makes predicting based on it a MLR task.

In [None]:
X = credit_enc[['region_South', 'region_West']]
y = credit_enc[['balance']]

In [None]:
mlr_region = LinearRegression()
mlr_region.fit(X, y)

In [None]:
# look at the estimated model parameters
print(f"Model Coefficients: {mlr_region.coef_}")
print(f"Model Intercept: {mlr_region.intercept_}")

$balance = 531 - 12.50 \times region\_South - 18.69 \times region\_West$

In [None]:
# Make predictions
y_pred = mlr_region.predict(X)

# Evaluate the model
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)

print(f"Multiple Linear Regression Model:")
print(f"Mean Squared Error: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.4f}")