Q4. Train & Test RSS - Linear/Nonlinear Models
I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response.
I then fit a linear regression model to the data: Y=β0^+β1^X+ϵ
as well as a separate cubic regression: Y=β0^+β1^X+β2^X2+β3^X3+ϵ

(a) Suppose that the true relationship between X and Y is linear,
i.e. Y = #0 + #1X + ". Consider the training residual sum of
squares (RSS) for the linear regression, and also the training
RSS for the cubic regression. Would we expect one to be lower
than the other, would we expect them to be the same, or is there
not enough information to tell? Justify your answer

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

np.random.seed(0)
n = 100

# Generate linear data
X = np.random.uniform(-3, 3, n).reshape(-1, 1)
y_linear = 2 + 3*X.flatten() + np.random.normal(0, 2, n)

# Linear model
lin = LinearRegression()
lin.fit(X, y_linear)
y_train_pred_lin = lin.predict(X)
rss_train_lin = np.sum((y_linear - y_train_pred_lin)**2)

# Cubic model
poly = PolynomialFeatures(degree=3, include_bias=False)
X_cubic = poly.fit_transform(X)
cub = LinearRegression()
cub.fit(X_cubic, y_linear)
y_train_pred_cub = cub.predict(X_cubic)
rss_train_cub = np.sum((y_linear - y_train_pred_cub)**2)

print("=== (a) Linear True — TRAINING RSS ===")
print(f"Linear RSS: {rss_train_lin:.2f}")
print(f"Cubic RSS : {rss_train_cub:.2f}")
print("Justification:")
print("I would expect similar results, but with the cubic regression with a lower training RSS. "
      "We are basically increasing the model flexibility by going from linear → cubic, "
      "and we are working with a small sample of data, so I’d expect the cubic model to overfit "
      "to any nonlinearities and hence have a lower training RSS.\n")


=== (a) Linear True — TRAINING RSS ===
Linear RSS: 396.98
Cubic RSS : 388.46
Justification:
I would expect similar results, but with the cubic regression with a lower training RSS. We are basically increasing the model flexibility by going from linear → cubic, and we are working with a small sample of data, so I’d expect the cubic model to overfit to any nonlinearities and hence have a lower training RSS.



(b) Answer (a) using test rather than training RSS

In [23]:
from sklearn.model_selection import train_test_split

# Split data into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y_linear, test_size=0.3, random_state=1)

# Linear model
lin.fit(X_train, y_train)
y_test_pred_lin = lin.predict(X_test)
rss_test_lin = np.sum((y_test - y_test_pred_lin)**2)

# Cubic model
X_train_c = poly.fit_transform(X_train)
X_test_c = poly.transform(X_test)
cub.fit(X_train_c, y_train)
y_test_pred_cub = cub.predict(X_test_c)
rss_test_cub = np.sum((y_test - y_test_pred_cub)**2)

print("=== (b) Linear True — TEST RSS ===")
print(f"Linear RSS: {rss_test_lin:.2f}")
print(f"Cubic RSS : {rss_test_cub:.2f}")
print("Justification:")
print("Here I would expect the linear regression to perform slightly better than the cubic regression. "
      "I say slightly because the true relationship is linear and we are working from a sample of this data, "
      "so the cubic regression would likely have fit something close to a linear relationship "
      "(it’s not going to fit a wild cubic relationship if the data mostly follow a straight line).\n")


=== (b) Linear True — TEST RSS ===
Linear RSS: 104.02
Cubic RSS : 111.65
Justification:
Here I would expect the linear regression to perform slightly better than the cubic regression. I say slightly because the true relationship is linear and we are working from a sample of this data, so the cubic regression would likely have fit something close to a linear relationship (it’s not going to fit a wild cubic relationship if the data mostly follow a straight line).



(c) Suppose that the true relationship between X and Y is not linear,
but we don’t know how far it is from linear. Consider the training
RSS for the linear regression, and also the training RSS for the
cubic regression. Would we expect one to be lower than the
other, would we expect them to be the same, or is there not
enough information to tell? Justify your answer.

In [26]:
# Generate non-linear data
y_nl = 2 + X.flatten()**2 + np.random.normal(0, 2, n)

# Linear model
lin.fit(X, y_nl)
y_train_pred_lin_nl = lin.predict(X)
rss_train_lin_nl = np.sum((y_nl - y_train_pred_lin_nl)**2)

# Cubic model
X_cubic = poly.fit_transform(X)
cub.fit(X_cubic, y_nl)
y_train_pred_cub_nl = cub.predict(X_cubic)
rss_train_cub_nl = np.sum((y_nl - y_train_pred_cub_nl)**2)

print("=== (c) Non-linear True — TRAINING RSS ===")
print(f"Linear RSS: {rss_train_lin_nl:.2f}")
print(f"Cubic RSS : {rss_train_cub_nl:.2f}")
print("Justification:")
print("As in part (a), I expect the cubic regression to have a lower training RSS.\n"
      "If the relationship is very non-linear and can be well approximated by a cubic fit, "
      "I would expect the training RSS to be much lower.\n"
      "If, however, the relationship is very close to linear or cannot be well approximated by a cubic fit, "
      "we might expect more similar results.\n"
      "Either way, the cubic fit would outperform the linear fit, as increasing model flexibility "
      "will lead to a reduction in training error.\n")


=== (c) Non-linear True — TRAINING RSS ===
Linear RSS: 1353.45
Cubic RSS : 364.90
Justification:
As in part (a), I expect the cubic regression to have a lower training RSS.
If the relationship is very non-linear and can be well approximated by a cubic fit, I would expect the training RSS to be much lower.
If, however, the relationship is very close to linear or cannot be well approximated by a cubic fit, we might expect more similar results.
Either way, the cubic fit would outperform the linear fit, as increasing model flexibility will lead to a reduction in training error.



(d) Answer (c) using test rather than training RSS.

In [29]:
# Split non-linear data into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y_nl, test_size=0.3, random_state=1)

# Linear model
lin.fit(X_train, y_train)
y_test_pred_lin_nl = lin.predict(X_test)
rss_test_lin_nl = np.sum((y_test - y_test_pred_lin_nl)**2)

# Cubic model
X_train_c = poly.fit_transform(X_train)
X_test_c = poly.transform(X_test)
cub.fit(X_train_c, y_train)
y_test_pred_cub_nl = cub.predict(X_test_c)
rss_test_cub_nl = np.sum((y_test - y_test_pred_cub_nl)**2)

print("=== (d) Non-linear True — TEST RSS ===")
print(f"Linear RSS: {rss_test_lin_nl:.2f}")
print(f"Cubic RSS : {rss_test_cub_nl:.2f}")
print("Justification:")
print("It depends on what the nature of the non-linear relationship is. "
      "For most non-linear relationships I would expect the cubic regression to have a lower test RSS, "
      "but at the same time, a linear model may give a very good approximation for a mildly non-linear relationship.\n"
      "I think this completely depends. If the relationship is closer to linear, we would expect the linear regression "
      "to have a lower test RSS. In the case of a stronger non-linear relationship that is closer to cubic, "
      "we would obviously expect the cubic regression to have a lower test RSS.\n")


=== (d) Non-linear True — TEST RSS ===
Linear RSS: 479.23
Cubic RSS : 132.77
Justification:
It depends on what the nature of the non-linear relationship is. For most non-linear relationships I would expect the cubic regression to have a lower test RSS, but at the same time, a linear model may give a very good approximation for a mildly non-linear relationship.
I think this completely depends. If the relationship is closer to linear, we would expect the linear regression to have a lower test RSS. In the case of a stronger non-linear relationship that is closer to cubic, we would obviously expect the cubic regression to have a lower test RSS.

