#Part 1: Linear Regression and Model Complexity

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In this lecture, we will work with the `vehicles` dataset.

In [None]:
vehicles = sns.load_dataset("mpg").rename(columns={"horsepower":"hp"}).dropna().sort_values("hp")
vehicles.head()

In [None]:
vehicles.info()

We will attempt to predict a car's "mpg" from transformations of its "hp".



In [None]:
X = vehicles[["hp"]]
X["hp^2"] = vehicles["hp"]**2
X["hp^3"] = vehicles["hp"]**3
X["hp^4"] = vehicles["hp"]**4

Y = vehicles["mpg"]

Test Sets

To perform a train-test split, we can use the train_test_split function of the sklearn.model_selection module.

In [None]:
from sklearn.model_selection import train_test_split

# `test_size` specifies the proportion of the full dataset that should be allocated to testing.
# `random_state` makes our results reproducible for educational purposes.
# shuffle is True by default and randomizes the data before splitting.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.2,
                                                    random_state=100,
                                                    shuffle=True)

print(f"Size of full dataset: {X.shape[0]} points")
print(f"Size of training set: {X_train.shape[0]} points")
print(f"Size of test set: {X_test.shape[0]} points")

We then fit the model using the training set...



In [None]:
import sklearn.linear_model as lm

model = lm.LinearRegression()

**Insert a code block below to train the model with the training set.**

In [None]:
from sklearn.metrics import mean_squared_error

**Insert a code block below, make predictions on both training set and test set, and print the mean squared error.**

**Insert a txt block below and answer the question: Try to explain why the model performs more poorly on the test data.**

Validation Sets


To assess model performance on unseen data, then use this information to finetune the model, we introduce a validation set. You can imagine this as us splitting the training set into a validation set and a "mini" training set.



In [None]:
# Split X_train further into X_train_mini and X_val.
X_train_mini, X_val, Y_train_mini, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=100)


**Insert a code cell below to print the size of original training set, mini training set, and validation set.**

In the cell below, we fit several models of increasing complexity, then compute their errors. Here, we find the model's errors on the validation set to understand how model complexity influences performance on unseen data.

In [None]:
fig, ax = plt.subplots(1, 3, dpi=200, figsize=(12, 3))

for order in [2, 3, 4]:
    model = lm.LinearRegression()
    model.fit(X_train_mini.iloc[:, :order], Y_train_mini)
    val_predictions = model.predict(X_val.iloc[:, :order])

    output = X_val.iloc[:, :order]
    output["y_hat"] = val_predictions
    output = output.sort_values("hp")

    ax[order-2].scatter(X_val["hp"], Y_val, edgecolor="white", lw=0.5)
    ax[order-2].plot(output["hp"], output["y_hat"], "tab:red")
    ax[order-2].set_title(f"Model with degree {order}")
    ax[order-2].set_xlabel("hp")
    ax[order-2].set_ylabel("mpg")
    ax[order-2].annotate(f"Validation MSE: {np.round(mean_squared_error(Y_val, val_predictions), 3)}", (90, 30))

plt.subplots_adjust(wspace=0.3);

Let's repeat this process:

1. Fit an degree-x model to the mini training set
2. Evaluate the fitted model's MSE when making predictions on the validation set.  
We use the model's performance on the validation set as a guide to selecting the best combination of features. We are not limited in the number of times we use the validation set – we just never use this set to fit the model.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

def fit_model_dataset(degree):
    pipelined_model = Pipeline([
            ('polynomial_transformation', PolynomialFeatures(degree)),
            ('linear_regression', lm.LinearRegression())
        ])

    pipelined_model.fit(X_train_mini[["hp"]], Y_train_mini)
    return mean_squared_error(Y_val, pipelined_model.predict(X_val[["hp"]]))

errors = [fit_model_dataset(degree) for degree in range(0, 18)]
MSEs_and_k = pd.DataFrame({"k": range(0, 18), "MSE": errors})

plt.figure(dpi=120)
plt.plot(range(0, 18), errors)
plt.xlabel("Model Complexity (degree of polynomial)")
plt.ylabel("Validation MSE")
plt.xticks(range(0, 18));

**Insert a txt block and answer the question: Looking at the figure above, what values of degree of polynomial lead to underfitting and overfitting of the model?**

In [None]:
MSEs_and_k.rename(columns={"k":"Degree"}).set_index("Degree")


From this model selection process, we might choose to create a model with degree 8.



In [None]:
print(f'Polynomial degree with lowest validation error: {MSEs_and_k.sort_values("MSE").head(1)["k"].values}')


After this choice has been finalized, and we are completely finished with the model design process, we finally assess model performance on the test set. We typically use the entire training set (both the "mini" training set and validation set) to fit the final model.



In [None]:
# Update our training and test sets to include all polynomial features between 5 and 9
for degree in range(5, 9):
    X_train[f"hp^{degree}"] = X_train["hp"]**degree
    X_test[f"hp^{degree}"] = X_test["hp"]**degree

**Insert code blocks below, train a linear regression model, and show the mean square error on the test set.**

#Part 2: Cross-Validation


The validation set gave us an opportunity to understand how the model performs on a single set of unseen data. The specific validation set we drew was fixed – we used the same validation points every time.

It's possible that we may have, by random chance, selected a set of validation points that was not representative of other unseen data that the model might encounter (for example, if we happened to have selected all outlying data points for the validation set).

Different train/validation splits lead to different validation errors:

**Add code in the code block below, create a for-loop with i ranges from 1 to 4, split the training set (X_train, Y_train) into train_mini set and val set with random state equals i, create a linear regression model, train the model, and print the mean square error on val set**.

In [None]:
for i in range(1, 4):
    ### Add your code here ###

    ## Split the training set

    ## Create a linear regression model

    ## Train

    ## Predict
    y_hat = ...

    ### End of your code ###
    print(f"Val error from train/validation split #{i}: {mean_squared_error(y_hat, Y_val)}")

To apply cross-validation, we use the KFold class of sklearn.model_selection. KFold will return the indices of each cross-validation fold. Then, we iterate over each of these folds to designate it as the validation set, while training the model on the remaining folds.

In [None]:
from sklearn.model_selection import KFold
np.random.seed(25) # Ensures reproducibility of this notebook

# n_splits sets the number of folds to create
kf = KFold(n_splits=5, shuffle=True)
validation_errors = []

for train_idx, valid_idx in kf.split(X_train):
    # Split the data
    split_X_train, split_X_valid = X_train.iloc[train_idx], X_train.iloc[valid_idx]
    split_Y_train, split_Y_valid = Y_train.iloc[train_idx], Y_train.iloc[valid_idx]

    # Fit the model on the training split
    model.fit(split_X_train, split_Y_train)

    error = mean_squared_error(model.predict(split_X_valid), split_Y_valid)

    validation_errors.append(error)

print(f"Cross-validation error: {np.mean(validation_errors)}")

**Open txt blocks below, and answer the following question:**  
1. How many folds do we split the training set?
2. Based on the number of folds, how many percentages of training set (X_train) are split into split_X_train and split_X_valid? Add code to validate your answer.

#Part 3: Regularization

L1 (LASSO) Regularization  
To apply L1 regularization, we use the Lasso model class of sklearn. Lasso functions just like LinearRegression. The difference is that now the model will apply a regularization penalty. We specify the strength of regularization using the alpha parameter.

In [None]:
import sklearn.linear_model as lm

lasso_model = lm.Lasso(alpha=0.1) # In sklearn, alpha represents the lambda hyperparameter
lasso_model.fit(X_train, Y_train)

lasso_model.coef_

To increase the strength of regularization (decrease model complexity), we increase the
λ
 hyperparameter by changing alpha.

**Insert a code block below, create a Lasso model named "lasso_model_large_lambda" with alpha=10, train the model, and print the coefficients.**

Notice that these model coefficients are very small (some are effectively 0). This reflects L1 regularization's tendency to set the parameters of unimportant features to 0. We can use this in feature selection.

The features in our dataset are on wildly different numerical scales. To see this, compare the values of hp to the values of hp^8.

In [None]:
X_train.head()

In order for the feature hp to contribute in any meaningful way to the model, LASSO is "forced" to allocate disproportionately much of its parameter "budget" towards assigning a large value to the model parameter for hp. Notice how the parameter for hp is much, much greater in magnitude than the parameter for hp^8.



In [None]:
pd.DataFrame({"Feature":X_train.columns, "Parameter":lasso_model.coef_})


We typically scale data before regularization such that all features are measured on the same numeric scale. One way to do this is by standardizing the data such that it has mean 0 and standard deviation 1.



In [None]:
# Center the data to have mean 0
X_train_centered = X_train - X_train.mean()

# Scale the centered data to have SD 1
X_train_standardized = X_train_centered/X_train_centered.std()

X_train_standardized.head()

When we re-fit a LASSO model, the coefficients are no longer as uneven in magnitude as they were before.



**Insert a code cell below, create a Lasso model with alpha=0.1, train the model on the standardized set, and print the coefficient.**

L2 (Ridge) Regression


We perform ridge regression using sklearn's Ridge class.

In [None]:
ridge_model = lm.Ridge(alpha=0.1)
ridge_model.fit(X_train_standardized, Y_train)

ridge_model.coef_

**Insert a code cell below, print the mean squared error of the ridge model on the training set.**

#Part 4: Using cross-validation to optimize regularization parameters

**Add code in the code cell below, using cross-validation to find the best regularization parameter `alpha`**:


In [None]:
from sklearn.model_selection import KFold
np.random.seed(25) # Ensures reproducibility of this notebook

# n_splits sets the number of folds to create
kf = KFold(n_splits=5, shuffle=True, random_state=0)

for alpha in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    validation_errors = []

    ## Add your code here: create a Ridge model, with alpha=alpha
    model = ...

    for train_idx, valid_idx in kf.split(X_train):
        # Split the data
        split_X_train, split_X_valid = X_train.iloc[train_idx], X_train.iloc[valid_idx]
        split_Y_train, split_Y_valid = Y_train.iloc[train_idx], Y_train.iloc[valid_idx]


        # Add your code here, fit the model on the training split


        # Add your code here, calculate the mean square error on the validation set
        error = ...

        validation_errors.append(error)

    print(f"Cross-validation error for alpha = {alpha}: {np.mean(validation_errors)}")

Make sure all cells are visible and have been run (rerun if necessary).

The code below converts the ipynb file to PDF, and saves it to where this .ipynb file is. 

In [None]:
NOTEBOOK_PATH = # Enter here, the path to your notebook file, e.g. "/content/drive/MyDrive/ECEN250/ECEN250_Lab6.ipynb". Do not change the lines below, and make sure you do not have multiple notebooks with the same path.
! pip install playwright
! jupyter nbconvert --to webpdf --allow-chromium-download "$NOTEBOOK_PATH"

Download your notebook as an .ipynb file, then upload it along with the PDF file (saved in the same Google Drive folder as this notebook) to Canvas for Lab 4. Make sure that the PDF file matches your .ipynb file.