# Assignment 3

This assignment is due the following Monday. It pertains to content taught in classes 7-9.

This assignment should be completed in Python, and a PDF file should be submitted, containing both code and written answers. If you like, you may create your own Jupyter Notebook file from scratch, but it is likely easier to modify this one.

As before, questions that require identification and/or interpretation will not penalized for brevity of response: if a question can be answered with 'yes/no', or a numeric value, you may simply state as much. If you incorporate code from the internet (which is not required and generally not advisable), please cite the source within your code (providing a URL is sufficient).

If you like, you may collaborate with others in the class. If you choose to do so, please indicate with whom you have worked at the top of your PDF. Separate submissions are required.

### Question 1: Regularization via best subset selection

First, we'll use the `swiss` dataset, which is a built-in dataset in R, but can be added to Python. As always, start by reviewing a description of the dataset, by typing `swiss?` in the console.  To perform model selection via "best subsets", we will use the `regsubsets` function in the `leaps` package.

In [None]:
# Import
swiss = sm.datasets.get_rdataset("swiss")
df = pd.DataFrame(swiss.data)

# Explore the dataset

Answer the following questions:

_(i)_ What will be the size (number of observations) of each LOOCV training sample?


In [None]:
# Your answer here


_(ii)_ What will be the size (number of observations) of each LOOCV testing sample?


In [None]:
# Your answer here


_(iii)_ How many "folds" (i.e., k) will our LOOCV model have?  


In [None]:
# Your answer here


_(iv)_ Now, fit a linear model, with `Fertility` as the response variable, and all other variables as predictors. Use the `sm.OLS` function

In [None]:
# Add your code here

_(v)_ Next, perform LOOCV, using the appropriate function.  

In [None]:
# Add your code here

_(vi)_ What is the MSE for the LOOCV?  

In [None]:
# Add your code here

_(vii)_ Run the LOOCV for a second time (no need to repeat the code; simply, run your existing code in in v and vi again). Do you obtain different results? Why or why not?  

_(viii)_ Manually compute MSE for the linear model (without LOOCV) that you fit with the `sm.OLS` function, in iv. (Hint: recall that MSE is defined as the sum of squared residuals, divided by n. You can "look inside" your linear model object to find residual values). 

In [None]:
# Add your code here

_(ix)_ Does the LOOCV-linear model, or the non-validated linear model, appear to have greater error? Why might this be the case?   

Imagine that the `swiss` dataset has just announced a major new release, which will include data from all provinces of Europe (not just those in Switzerland), and records all the way to the present day (not just 1888).  

_(x)_ Would you choose LOOCV as a validation method for this new release? Why or why not?  

_(xi)_ What validation method might you choose instead?  


In [None]:
# Add your code here

### Question 2: Regularization via Shrinkage

For this assignment, we'll use the in-built dataset credit from ISLP library.


In [None]:
# Import standard libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib.pyplot import subplots
import statsmodels.api as sm

# Import specific objects
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence \
     import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm

%pip install l0bnb
from l0bnb import fit_path
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)

# Install, import, and load specific package
%pip install faraway > nul 2>&1 # "> nul 2>&1" means that the install messages have been surpressed
import faraway as fw
import faraway.datasets.fat

It is very important to understand if our data has missing values, which R represents as NA. Below, show that there are 0 NA values in the dataset. (Hint: you can use the function np.isnan() to search for NA values, and wrap that with the sum() function, to provide a total count.)

In [None]:
# Add your code here

It is also very important to visualize our data before modeling. The sns.pairplot() function visualizes the pair-wise correlations between all variables.

In [None]:
numeric_columns = my_df.select_dtypes(include=['float64', 'int64'])

# Create a scatterplot matrix
sns.set_style("ticks")
sns.pairplot(numeric_columns)

For much of our modelling, we'll make use of a separate training and testing set. Choose your favourite method to split my_df into equally-sized training and testing sets. For clarity, call your training set train, and your testing set test.

In [None]:
# Add your code here


Shrinkage methods can "extend" or improve upon linear model fits, by pushing coefficients towards (ridge regression) or to zero (lasso), and thus reducing variance. Let's perform ridge regression, using the `skl.ElasticNet()` function. 

In [None]:
# Load data

Use our `my_df` dataset (deriving from Credit). Let's use `Balance` as the response variable, and all other variables as predictors. 

_(i)_ A necessary first step is to get our data into the format expected. Specifically, we must provide predictor variables in a matrix, and the response variable in a vector. For clarity, call the predictor matrix `x`, and the response vector `y`. (Hint: your `x` matrix should have should 400 rows and 11 columns. Verify that this is true, using in-built functions of your choice).  

In [None]:
# create a matrix of the predictor variables 

# create a vector of the response variable

Let's check out how `ModelSpec()` has transformed our data. Compare the names of variables in our matrix `x`, compared to `my_df` (hint: use the `columns` function), and answer:

_(ii)_ Which "type" of variables (numeric, character, factor, etc.) have a new name in `x`?  

_(iii)_ Which variable in `x` has two columns dedicated to it? Why? 

_(iv)_ What variable in `my_df` is missing in x? Why might this be?    

Now that we understand how our data is represented, we can move on to modelling. Fit a ridge regression model, using `skl.ElasticNet()`. (Hint: remember to set the alpha value!)

In [None]:
# Add your code here

_(v)_ An essential part of ridge regression (and shrinkage methods more broadly) is to identify an 'ideal' lambda value. Use the appropriate function from `sk.learn` to identify this lambda value via cross-validation. (Hint: remember that `x` and `y` should not consist of the complete dataset!)

In [None]:
# Add your code here

_(vi)_ By default, cross validation via `skl.ElasticNet()` considers n=100 lambda values. The cross-validated model object that you created in the step above stores these n=100 lambda values within it. Print them here (Hint: use the `$` to "look inside" your model.)

In [None]:
# Add your code here

_(vii)_ Visualize your cross-validation results using `plot`. 

In [None]:
# Add your code here

_(viii)_ Now, look inside your cross-validated object to pull out the lambda value with the smallest error (Hint: the value will be that shown by the first, left-most vertical dotted line.)

In [None]:
# Add your code here

_(ix)_ In your plot, what does the second (right-most) vertical dotted represent? (Hint: read the help documentation pertaining to `l1_ratio=0`.). 

_(x)_ We can now refit ridge regression, for the entire dataset, with the ideal lambda value. Use the lambda value with the smallest error. Provide an argument to print the estimated coefficients (Hint: check out the `type` argument.

In [None]:
# Add your code here

_(xi)_ Did you expect any coefficients to be exactly 0? Why or why not?  

_(xii)_ The plot created above shows that the ideal 'tuning' (penalty) provided by lambda is comparatively small (one of the smallest considered by `skl.ElasticNet()`, if not the smallest). What might this suggest? In your answer, consider the nature of the `Credit` dataset.