# `AA Workshop 6` — Coding Challenge

Complete the tasks below to practice implementing regression modeling from `W6_Regression_Advanced.ipynb`.

Guidelines:
- Work in order. Run each cell after editing with Shift+Enter.
- Keep answers short; focus on making things work.
- If a step fails, read the error and fix it.

By the end you will have exercised:
- implementing linear regression modelling with polynomial features
- implementing L1 & L2 regularization
- evaluating using proper cross-validation

## Task 1 - Predicting Shared Bike Demand in Seoul

You have been provided with a dataset (`SeoulBikeData.csv` in the `data` directory) that contains counts of public bicycles rented per hour in the Seoul Bike Sharing System, with corresponding weather data and holiday information. The dataset is publicly available and you can find further information [here](https://archive.ics.uci.edu/dataset/560/seoul+bike+sharing+demand). Your task is to predict the hourly count of shared bikes based on temperature. Do the following:
- Load and inspect the data; perform any clearning steps if necessary.
- Define and create a scatter plot of x and y.
- Perform a train-holdout-test split.
- Train simple, Ridge, and Lasso regression models with polynomial degrees from 1 to 5 and alphas of 0.01, 0.1, 1, 10 (for Ridge and Lasso). Tune hyperparameters based on holdout performance, select the best performing model based on the mean squared error, and calculate final performance on the test set.
- Create a scatter plot of the true values and predictions of the best performing model.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# load and inspect data
data = pd.read_csv("../data/SeoulBikeData.csv")

data.head()

In [None]:
# filter out out-of-service observations
data = data[data["Functioning Day"] == "Yes"].copy()

In [None]:
# define and plot x and y
x = data["Temperature(C)"].values.reshape((-1,1)) # remember: if we pass a 1-feature array we need to re-shape it!
y = data["Rented Bike Count"]

plt.figure(figsize = (8,6))
plt.scatter(x, y, marker="x")
plt.xlabel("Temperature (°C)")
plt.ylabel("Bike Count")
plt.show()

In [None]:
# perform train-holdout-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)
x_train, x_hold, y_train, y_hold = train_test_split(x_train, y_train, test_size=(0.2/0.7), random_state=1)

print(len(x_train), len(x_hold), len(x_test))

In [None]:
# configure model search
degrees = range(1, 6)
alphas = [0.01, 0.1, 1, 10]
models = ["linear", "Ridge", "Lasso"]
results = []

In [None]:
for model in models:
    for degree in degrees:
        # create polynomial features
        poly = PolynomialFeatures(degree=degree, include_bias=False)
        X_train_poly = poly.fit_transform(x_train)
        X_hold_poly = poly.transform(x_hold)

        # standardize
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train_poly)
        X_hold_scaled = scaler.transform(X_hold_poly)

        # for Ridge or Lasso, loop over alphas
        if model in ["Ridge", "Lasso"]:
            for alpha in alphas:
                # initialize
                if model == "Ridge":
                    reg = Ridge(alpha=alpha)
                else:
                    reg = Lasso(alpha=alpha, max_iter=10000)

                # fit
                reg.fit(X_train_scaled, y_train)

                # predict on validation and calculate error
                y_hold_pred = reg.predict(X_hold_scaled)
                mse_val = mean_squared_error(y_hold, y_hold_pred)

                results.append({
                    "model": model,
                    "degree": degree,
                    "alpha": alpha,
                    "mse": mse_val,
                    "poly": poly,
                    "scaler": scaler,
                    "reg": reg
                })

        # linear regression without regularization
        else:
            # initialize
            reg = LinearRegression()

            # fit
            reg.fit(X_train_scaled, y_train)

            # predict on validation and calculate error
            y_hold_pred = reg.predict(X_hold_scaled)
            mse_val = mean_squared_error(y_hold, y_hold_pred)

            results.append({
                "model": model,
                "degree": degree,
                "alpha": None,
                "mse": mse_val,
                "poly": poly,
                "scaler": scaler,
                "reg": reg
            })

results = pd.DataFrame(results)

results

In [None]:
# select best model
best_idx = results["mse"].idxmin()
best_model = results.loc[best_idx]

print(best_model)

In [None]:
# evaluate on test set
## extract best model components
best_poly = best_model["poly"]
best_scaler = best_model["scaler"]
best_reg = best_model["reg"]

## transform test set
X_test_poly = best_poly.transform(x_test)
X_test_scaled = best_scaler.transform(X_test_poly)

## predict
y_test_pred = best_reg.predict(X_test_scaled)

## evaluate
mse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)

print("\nTest MSE:", mse_test)
print("Test R2:", r2_test)

In [None]:
# plot
plt.figure(figsize = (8,6))
plt.scatter(x_train, y_train, marker="x", alpha=0.4, color="green", label="train")
plt.scatter(x_hold, y_hold, marker="x", alpha=0.4, color="orange", label="holdout")
plt.scatter(x_test, y_test, marker="x", alpha=0.4, color="blue", label="test")
plt.scatter(x_test, y_test_pred, marker="x", alpha=0.4, color="red", label="predictions")
plt.xlabel("Temperature (°C)")
plt.ylabel("Bike Count")
plt.legend()
plt.show()

---