# Regression

## Setup

Load the packages and configure environment.

In [None]:
%matplotlib inline

import matplotlib.pylab as plt
import numpy as np
import pandas as pd

## Multiple Linear Regression

Using the Advertising data from ISL.

In [None]:
# download the data set directly from the web using pandas
url = "https://raw.githubusercontent.com/olearydj/INSY7120/refs/heads/main/notebooks/data/Advertising.csv"
sales = pd.read_csv(url)

In [None]:
# Basic structure
print(sales.head())
print("Dataset shape:", sales.shape)
print("\nData types:\n", sales.dtypes)

# Check for missing values
print("\nMissing values:\n", sales.isnull().sum())

# Basic statistics
print("\nSummary statistics:\n")
sales.describe()

What is the first column?

In [None]:
# recall: the columns attribute of a DataFrame gives a list of column names
col_name = sales.columns[0]

# drop the column by name
# by default drop refers to rows (axis=0), must specify cols
sales.drop(col_name, axis=1, inplace=True)

# inplace=True modifies sales in place; alternatively reassign the modified object
# sales = sales.drop(col_name, axis=1)

sales.tail()

Caution: the previous cell is destructive. It relies only on positional information to delete columns. If you run it multiple times, it will delete predictors!

So we have 200 rows of advertising spending categories (predictors) and associated sales (response) data.

### Modelling

Use the same process as before, condensed into a single section.

**Step 1** - split the data into features (`X`) and target (`y`)

In [None]:
X = sales[['TV', 'radio', 'newspaper']]
y = sales['sales']

**Steps 2 and 3** - initialize and fit the model, review results

In [None]:
from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
mlr.fit(X, y)

In [None]:
# look at the estimated model parameters
print(f"Model Coefficients: {mlr.coef_}")
print(f"Model Intercept: {mlr.intercept_}")

The MLR takes the form:

$$y =  \beta_0 + \beta_1 \times TV + \beta_2 \times radio + \beta_3 \times newspaper + \epsilon$$

Which we estimate (because $\epsilon$ is unaccounted for) as:

$$\hat{y} =  2.939 + 0.046 \times TV + 0.189 \times radio - 0.001 \times newspaper$$

where the coefficients and thus the response are estimates:

$$\hat{\beta}_0 = 2.939, \hat{\beta}_1 = 0.046, etc.$$

What does this tell us about the relative importance of the features? Can we compare them directly?

**Step 4** - evaluate model performance

We'll use $R^2$ and $RMSE$, both of which are measures of the prediction error ($RSS$).

In [None]:
# import r2 and mse functions
from sklearn.metrics import r2_score, mean_squared_error

# generate predictions
y_pred = mlr.predict(X)

# use predictions to score
r2 = r2_score(y, y_pred)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)

print(f"R² Score: {r2:.4f}")
print(f"RMSE: {rmse:.2f}")

In this case, what are we assessing the performance of? What are we generating predictions for? What do these results tell us?

**step 5** - generate predictions for new data

In [None]:
new_data = np.array([[0, 0, 180],    # all newspaper
                     [0, 180, 0],    # all radio
                     [180, 0, 0],    # all TV
                     [60, 60, 60]])  # equal mix

new_data = pd.DataFrame(new_data, columns=['TV', 'radio', 'newspaper'])
predicted_sales = mlr.predict(new_data)

print(predicted_sales)

All newspaper gives the worst results. All radio is the best of these options, with all TV performing worse than a mixture. As you might expect, an equal mix is somewhere in the middle.

**step 6** - interpret results (added)

This suggests that our model explains about 90% of the variance in sales, with radio having the most influence over sales, followed by tv. Newspaper has a slightly negative impact.

But there are several critical limitations:

- fitted and evaluated the same data (training performance metrics)
- didn't confirm linearity of predictors or investigate interactions between them
- no inferential statistics
- no domain expertise

Look at relationships between predictors and outcome:

In [None]:
import seaborn as sns

# Create a figure with 3 subplots in one row
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))

# Radio plot
sns.regplot(data=sales, x='radio', y='sales', ax=ax1)
ax1.set_title('Radio vs Sales')

# TV plot 
sns.regplot(data=sales, x='TV', y='sales', ax=ax2)
ax2.set_title('TV vs Sales')

# Newspaper plot
sns.regplot(data=sales, x='newspaper', y='sales', ax=ax3)
ax3.set_title('Newspaper vs Sales')

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()

Positive coefficient for all predictors. How do these results differ from the MLR estimates?

In [None]:
# Calculate correlations between all variables
correlations = sales[['TV', 'radio', 'newspaper', 'sales']].corr()

# Round to 3 decimal places
correlations_rounded = correlations.round(3)

print(correlations_rounded)

Note the correlation between radio and newspaper (0.35). This indicates that markets with high newspaper advertising also tend to have high radio advertising.

In the SLR setting we find that sales increases with newspaper spend, but MLR shows negligible effect. Newspaper advertising as a surrogate for radio advertising - the SLR chart above is really showing us the effect of increased radio spend that comes with increased newspaper.

To understand this better we must turn to inferential methods, for which we use `statsmodels`.

**Note:** the split focus of scikit-learn (prediction) and statsmodels (inference) is further evidence of the difference in focus between ML and Statistical Learning.

In [None]:
import statsmodels.api as sm

# Add a constant (intercept) to the predictors
X_sm = sm.add_constant(X)

# Fit the model
mlr_sm = sm.OLS(y, X_sm).fit()

The results of this are summarized in three tables. We are primarily interested in the first two: overall regression results and feature-level details.

In [None]:
# overall regression results
mlr_sm.summary().tables[0]  # regression results

Most importantly, this tells us that the overall model is significant, with a p-value of near zero (1.58e-96). Specifically, the null hypothesis of the F-test is that all coefficients are zero. We reject this in favor of the alternative, which suggests that at least one coefficient is non-zero. In other words, the model explains significantly more variance than one with no predictors (just the intercept - a horizontal line), so at least one predictor has a real relationship with sales. The $R^2$ value equals the fit obtained with scikit-learn.

The second table gives feature-level details.

In [None]:
# coefficients and significance
mlr_sm.summary().tables[1]

Again, the coefficients match those found by SKL. In addition, this table tells us a lot about the uncertainty associated with the predictions.

Of particular interest, it shows that newspaper is not significant.

Both results (coefficient and p-value) suggest simplifying the model by removing the newspaper predictor.

In [None]:
X2 = sales[['TV', 'radio']]
y2 = sales['sales']

mlr2 = LinearRegression()
mlr2.fit(X2, y2)

print(f"Model Coefficients: {mlr2.coef_}")
print(f"Model Intercept: {mlr2.intercept_}")

# generate predictions
y2_pred = mlr2.predict(X2)

# use predictions to score
r2_small = r2_score(y2, y2_pred)
mse_small = mean_squared_error(y2, y2_pred)
rmse_small = np.sqrt(mse)

print(f"R² Score: {r2_small:.4f}")
print(f"RMSE: {rmse_small:.2f}")

Very little change. Neither $R^2$ nor $RMSE$ improved. Despite lack of statistical significance, ML would likely keep all three predictors.