# Assignment 2

This assignment is due on __Wednesday March 20, by 11:59PM	__. It pertains to content taught in classes 4-5.

This assignment should be completed in Python, and an PDF file should be submitted, containing both code and written answers. If you like, you may create your own Jupyter Notebook file from scratch, but it is likely easier to modify this one.

As before, questions that require identification and/or interpretation will not penalized for brevity of response: if a question can be answered with 'yes/no', or a numeric value, you may simply state as much. If you incorporate code from the internet (which is not required and generally not advisable), please cite the source within your code (providing a URL is sufficient).

If you like, you may collaborate with others in the class. If you choose to do so, please indicate with whom you have worked at the top of your PDF. Separate submissions are required.

Any questions can be addressed to Kamilah ([kamilah.ebrahim@mail.utoronto.ca]()) and/or Ananya ([ananya.jha@mail.utoronto.ca]()) and/or Vishnou ([vishnouvina@cs.toronto.edu]()) before the due-date. Please submit your assignments through your Drive Folder.
### Set up

In [None]:
# Import standard libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib.pyplot import subplots
import statsmodels.api as sm

# Import specific objects
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence \
     import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm

%pip install l0bnb
from l0bnb import fit_path
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)

# Install, import, and load specific package
%pip install faraway > nul 2>&1 # "> nul 2>&1" means that the install messages have been surpressed
import faraway as fw
import faraway.datasets.fat

### Question 1: Regularization via best subset selection

First, we'll use the `swiss` dataset, which is a built-in dataset in R, but can be added to Python. As always, start by reviewing a description of the dataset, by typing `swiss?` in the console.  To perform model selection via "best subsets", we will use the `regsubsets` function in the `leaps` package.

In [None]:
# Import
swiss = sm.datasets.get_rdataset("swiss")
df = pd.DataFrame(swiss.data)

# Explore the dataset

Answer the following questions:

_(i)_ What will be the size (number of observations) of each LOOCV training sample?

_(ii)_ What will be the size (number of observations) of each LOOCV testing sample?

_(iii)_ How many "folds" (i.e., k) will our LOOCV model have?  

_(iv)_ Now, fit a linear model, with `Fertility` as the response variable, and all other variables as predictors. Use the `sm.OLS` function

In [None]:
# Add your code here

_(v)_ Next, perform LOOCV, using the appropriate function.  

In [None]:
# Add your code here

_(vi)_ What is the MSE for the LOOCV?  

In [None]:
# Add your code here

_(vii)_ Run the LOOCV for a second time (no need to repeat the code; simply, run your existing code in in v and vi again). Do you obtain different results? Why or why not?  

_(viii)_ Manually compute MSE for the linear model (without LOOCV) that you fit with the `sm.OLS` function, in iv. (Hint: recall that MSE is defined as the sum of squared residuals, divided by n. You can "look inside" your linear model object to find residual values). 

In [None]:
# Add your code here

_(ix)_ Does the LOOCV-linear model, or the non-validated linear model, appear to have greater error? Why might this be the case?   

Imagine that the `swiss` dataset has just announced a major new release, which will include data from all provinces of Europe (not just those in Switzerland), and records all the way to the present day (not just 1888).  

_(x)_ Would you choose LOOCV as a validation method for this new release? Why or why not?  

_(xi)_ What validation method might you choose instead?  


In [None]:
# Write your answer here

### Question 2: Regularization via best subset selection

Now, let's use the `fat` dataset, in the `faraway` library. Please make sure you have installed the `faraway` library and loaded specific objects listed at the start of this notebook.

In [None]:
# Load data
fat = faraway.datasets.fat.load()
fat

# Explore dataset

_(i)_ Using the `l0bnb` library, fit a best subset model with `brozek` (body fat) as the response, and all variables except for `free`, `siri`, and `density` as predictors. Provide the `nvmax` argument, with a value equal to the number of predictors.

In [None]:
# Add your code here

The plot below shows (unadjusted) $R^2$ estimates for all subset models.

![](https://drive.google.com/uc?id=1omSUpJrARF6gYnZ2hWv61EWbdlWklBdG)

In [None]:
# example code

# plt.plot(best_sub['num_vars'], best_sub['rsq'], marker='o', linestyle='-')
# plt.xlabel('Number of Variables/Predictors')
# plt.ylabel('R-squared value')
# plt.title('R-squared vs Number of Variables/Predictors')
# plt.show()

_(ii)_ Why can't we use (unadjusted) $R^2$ estimates to select the best model? 

_(iii)_ Create a plot similar to that above, but showing adjusted $R^2$. Add a coloured point, highlighting the number of variables/predictors with the most desirable adjusted $R^2$ value

In [None]:
# Add your code here

_(iv)_  Write code to pull out the highest and lowest $R^2$ values (Hint: use the `max` and `min` functions.). Does the difference in percent variance explained (i.e., $R^2$) appear to be meaningful? (No statistics needed: interpret in the context of the `fat` dataset).

In [None]:
# Add your code here

_(v)_ What is the best model according to BIC?

In [None]:
# Add your code here

_(vi)_ What is the best model according to $C_p$? 

In [None]:
# Add your code here

_(vii)_ Are you surprised that BIC and $C_p$ compute differing estimates of prediction error? Why or why not?

Let's be more rigorous, and compute a direct (cf. indirect) estimate of prediction error via k-fold cross validation. The `predict_regsubsets` function provided below achieves something comparable to `predict` (in brief, it extracts the fitted coefficients for each model size and then multiplies the corresponding predictors for each test observations). <u> This code does not need to be edited.</u>

In [None]:
def predict_regsubsets(object, newdata, id):
    form = object['call'][1]  # pull out the formula
    mat = pd.get_dummies(newdata, drop_first=True)  # make a matrix
    coefs = object['coef'][id]  # pull out coefficients
    vars = list(coefs.keys())  # pull out associated predictors
    result = np.dot(mat[vars], coefs)  # matrix multiplication
    return result

Now, we need to define several variables/objects for our k-fold cross validation. In the code chunk below:  

_(viii)_ Define a variable `k`, with a value of 5. 

_(ix)_ Define a variable `n`, with a value reflecting the number of observations in our data.

_(x)_ Define a variable `p`, with a value reflecting the maximum number of predictors in our subset selection/

_(xi)_ Define a variable `kfolds`. This variable is a vector, containing randomly selected integers from 1 through k, and should be of length n. (Hint: review the `sample` and `rep` functions.)

_(xii)_ Define a variable `cv_error`. This variable is a matrix. Its number of rows should be equal to k, and it number of columns should be equal to p. (You may choose to fill the matrix with NAs or 0 values or something else; these will be overwritten.)


In [None]:
# define k

# define n

# define p

# define folds

# define cv_error

Great! Now that we have our required variables/objects, and the (provided) `skm.cross_val_predict()` function, we must write a for loop that will (a) fit our model on all but the held-out (test) fold, (b) predict the response in the held-out fold, and compute MSE in the held-out fold.

The image below shows most of this code, with some crucial bits occluded. The occluded bits are 5 variables defined above: `k`, `regsubsets`, `p`, `folds`, and `cv_error`. ![](https://drive.google.com/uc?id=1MdbaIhq82sU-230WX8qi1vR7T8hCJBV7)

_(xiii)_ Type this code into the chunk below, filling in the missing bits. (You are free to omit comments and change the code structure -- as long as it works!)

In [None]:
# Add your code here

_(xiv)_ Review your `cv_error` matrix. It should contain 14 columns (one for each number of predictors) and 5 rows (one for each of the k-folds). The contained values are MSE estimates. Find the mean of the MSE estimates, for each number of predictors.

In [None]:
# Add your code here

_(BONUS)_. The "one-standard-error rule" states that if there are several models with similar estimates of test MSE, we can choose between them by: (a) calculating the MSE standard error model (number of predictors), (b) consider all models with an MSE within one standard error of the model with the smallest MSE, and (c) from the models within this band, select the one with the smallest number of predictors. Perform computations over `cv_error` to select the ideal number of predictors, in accordance with the "one-standard-error rule".

In [None]:
# Add your code here