## Data 335 &mdash; Winter 2025 &mdash; Assignment 1

#### Due: 2025.02.03 at 23:59

#### Instructions

- Save a copy of this file, changing its name to include your name and student ID as indicated.

- Solve the problems, adding code and markdown cells as needed.

- Your code should be runnable! Before submitting your work, *Restart* your environment and *Run All* to verify that everything works.

- Please submit a `.pdf` export of your notebook (with the same name stem) to the D2L dropbox in addition to the `.ipynb` file. If you're using VS Code, click the three dots (button bar, top right) and select *Export*.

#### Grading
- All problems have equal weight.

- Half-credit will be awarded for substantial progress towards a solution.

In [473]:
import numpy as np
import pandas as pd

### 1. The advantage of cross-validation

In this problem, we'll identify an advantage of the $5$-fold cross-validation estimate of predictive error over a simple average of test error quantities over 5 random 80%/20% train/test splits.

**To do:** Write two functions `f` and `g`.
They should each take, as input,
- a feature matrix `X` with `n` rows and a target vector `y` of length `n`;
- an integer `n_repeats` (default value `1000`);
- an integer `n_splits` (default value `5`);
- a random seed (default value `None`)

and produces, as output,

- a matrix of shape `(n_repeats, n_splits)`.

Each row of the ouput of `f` should consist of the `n_splits` test error quantities for a linear regression model fit to/predicted on the `n_splits` training/testing splits of `X` obtained through the cross-validation procedure. (Use `sklearn.model_selection.KFold`.)

Each row of the ouput of `g` should consist of the `n_splits` test error quantities for a linear regression model fit to/predicted on `n_splits` independent training/testing splits of `X`. (Use `sklearn.model_selection.train_test_split`.)

Applications of `f` with the same inputs should produce the same outputs. Randomizaton of splits should be determined by the specified seed. Same for `g`.

The means of the rows of the outputs of `f` and of `g` are estimates of the expected predictive error of a linear regression model fit to a random subset of the data of size `0.8*n`.

**To do:**

- Run your functions `f` on `g` on the `auto_preprocessed` dataset.

- Compute and compare the means and standard deviations of these row-means, and plot their histograms. What do the results suggest about the cross-validation splits versus random train/test splits? Can you explain your observations?

- How large do you need to set `n_splits` in `g` to match the efficiency (i.e., the standard deviation) of the 5-fold cross-validation estimator?

In [None]:
def f(X, y, *, n_repeats=1000, n_splits=5, seed=None):
    rng = np.random.default_rng(seed)
    kfold_mses = np.zeros((n_repeats, 5))

    # Your stuff here.

    return kfold_mses


def g(X, y, *, n_repeats=1000, n_splits=5, seed=None):
    rng = np.random.default_rng(seed)
    train_test_split_mses = np.zeros((n_repeats, 5))

    # Your stuff here.

    return train_test_split_mses

In [460]:
X = pd.read_csv("../data/auto_preprocessed.csv")
y = X.pop("mpg")

### 2. Bigger models: Worth it?

**To do:** Produce two feature-engineered versions, `X0` and `X1`, of the feature matrix `X` for the `data/auto_preprocessed.csv` dataset.

`X0` should include:
- all features in `X`, and
- the squares of all the non-binary features of `X`. (Why exclude the squares of the binary features?) The binary features are the origin-indicators, `is_european` and `is_japanese`.

`X1` should include:
- `horsepower`, `weight`, `acceleration`, `year`, `is_european`, `is_japanese`; and
- `horsepower**2`, `weight**2`, `acceleration**2`, `year**2`

(`X1` removes all features involving `cylinders` or `displacement` from `X0`.)

**To do:** Use 5-fold cross-validation, repeated 1000 times, to estimate the predictive errors of linear regression models fit to `(X0, y)` and `(X1, y)`. Which is better?

**To do:** Use an *F*-test to compare the two models. Precisely state the null-hypothesis of the test.

### 3. "Forward" variable selection 

We continue working with the `data/auto_preprocessed.csv` dataset

Let `X0` the the submatrix of the feature matrix `X` containing only the binary origin columns `is_european` and `is_japanese`.

**(a)** Which numerical (non-binary) feature, when added to `X0`, yields the largest decrease in predictive error, as estimated by 5-fold cross-validation, repeated 100 times? 

**(b)** Let `X1` denote the feature matrix obtained by adding the feature identified in (a). Which numerical (non-binary) feature, when added to `X1`, yields the largest decrease in predictive error, as estimated by 5-fold cross-validation, repeated 100 times?

...

**(f)** Let `X5` denote the feature matrix obtained by adding the feature identified in (e). Which numerical (non-binary) feature, when added to `X5`, yields the largest decrease in predictive error, as estimated by 5-fold cross-validation, repeated 100 times?


In [472]:
X = pd.read_csv("../data/auto_preprocessed.csv")
y = X.pop("mpg")

X0 = X[["is_european", "is_japanese"]].copy()

### 4. "Backward" variable selection 

We continue working with the `data/auto_preprocessed.csv` dataset

**(a)** Which numerical (non-binary) feature, when removed from `X`, yields the smallest increase in predictive error, as estimated by 5-fold cross-validation, repeated 100 times? 

**(b)** Let `X1` denote the feature matrix obtained by removing the feature identified in (a). Which numerical (non-binary) feature, when removed from `X1`, yields the smallest increase in predictive error, as estimated by 5-fold cross-validation, repeated 100 times?

...

**(f)** Let `X5` denote the feature matrix obtained by removing the feature identified in (e). Which numerical (non-binary) feature, when removed from `X5`, yields the smallest increase in predictive error, as estimated by 5-fold cross-validation, repeated 100 times?

**(g)** Both this exercise and exercise 3 rank features by importance. How do these rankings compare?

In [None]:
X = pd.read_csv("../data/auto_preprocessed.csv")
y = X.pop("mpg")

### 5. Lab 3, Exercise 2

Here's a fake dataset describing the relationship between cholesterol level (a heart disease risk factor), age (ordinal, four categories, ages 10-30, 30-50, 50-70, and 70-90), and weekly hours of exercise.

In [None]:
def make_data():
    n = 100
    np.random.seed(0)
    age = np.random.choice([0, 1, 2, 3], size=n)
    exercise = 2 * age + 3 * np.random.normal(size=n) + 6
    colesterol = 200 + 30 * age - 5 * exercise + 10 * np.random.normal(size=n)
    df = pd.DataFrame({"age": age, "exercise": exercise, "colesterol": colesterol})
    return df


data = make_data()
data.head()

- Plot the (simple) linear regression of `cholesterol` on `exercise`, overlaid on a scatterplot of the data. Do the results surprise you?

- Fit a multivariate linear regression of `cholesterol` onto `exercise` and `age`. Plot the regression lines corresponding to each of the age groups, overlaid with a scatterplot of the data. Use a different color for each age group. Comment.

This exercise demonstrates a phenomenon known as *Simpson's Paradox*. The inspiration for this exercise comes from &sect;1.2 of **Causal Inference in Statistics** by Pearl, Gylmour, and Jewel.

### 6. Lab 4, Exercise 1

In this exercise, we use data about gas mileage, horsepower, and other information for 392 vehicles. See [here](https://islp.readthedocs.io/en/latest/datasets/Auto.html) for details. Use the file `data/auto.csv`.

##### a)
Without looking at the data, guess the sign of the slope of the regression of `mpg` onto `acceleration` and `mpg` onto `year`. Then fit two simple linear regression models and compare the regression slopes with your guesses. Discuss.

##### b)
Compute the group means `mpg` grouped by `origin`. Fit a regression of `mpg` onto the categorical predictor `origin` *without intercept* term. What do you observe about the coefficient estimates. Can you explain this? Now fit a regression of `mpg` onto `origin` *with intercept*. Show how the coefficients of this regression fit can be expressed in terms of the intercept-free regression coefficients, and vice-versa.

##### c)
Let `X` be the matrix of dummy variables associated to `origin`:
```
X = pd.get_dummies(autos["origin"])
```
Find a number `a` and a vector `b` of shape `(3,)` such that `b.sum() == 0` and such that `a + X @ b` coincides with the predictions from the models fit in (b).

##### d)
Let `X` be as in (c) and let `n` be the vector of row counts associated to the three origins. Find a number `a` and a vector `b` of shape `(3,)` such that `np.sum(n*b) == 0` and such that `a + X @ b` coincides with the predictions from the models fit in (b). Observe that `a` equals the overall mean `auto["mpg"].mean()`. The entries of `b` are "treatments effects" associated to the origins.

##### e)
Fit the linear regression of all the quantitative features (i.e., all the features except for `origin` and `name`) and note which non-intercept coefficient estimates have *p*-values < 0.05. Repeat with all the quantitative features except for `weight`. Compare and discuss.

##### f)
Perform an *F*-test the null hypothesis that means of the `mpg` variable are equal across all three origins after adjusting for all the continuous covariates.

