Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list anyone you collaborated with on this workbook

## Lab 8: Model selection and regularization

**This lab was distributed Monday 10/21/2019 and should be completed by Monday 10/28/2019 at 11:59PM.**

Welcome to the eighth lab of the semester!

In this lab, we'll cover model selection and regularization (ISLR 6.1-6.2). Before we get into that, let's review what we've done so far in terms of modeling:
* in lab 5, we covered linear regression by using ordinary least-squares (OLS) optimization to find the intercept and slope coefficient for a a linear relationship
* in lab 7, we covered gradient descent, which is a way to find the coefficients of a model (in our case, a linear model) by iteratively calculating the coefficients using the gradient (the partial derivative of the model loss with respect to the coefficient) until we find a set of coefficients that minimize the loss function
* also in lab 7, we covered cross-validation, which is a way to find out how well a model performs with new data

Model selection and regularization relate in some way to all of these topics. Model selection and regularization approaches give us a method to select which variables (i.e. features) to include in our model, as well as to select the value of the oefficients that should be associated with those variables. 

Cross-validation, which we covered most recently, also allows us to select variables by comparing the cross-validation error of different models that include different number of variables (homework 7 is a good example of this). However, while cross-validation works by running a separate cross-validation for each potential combination of variables, the model selection approaches (like subset selection) covered in ISLR 6.1-6.2 work by iteratively and strategically adding variables to the model until the "best" model is achieved.

Model selection and regularization are also related to OLS in that they provide an alternative approach to choosing model coefficients. Rather than just minimizing the loss, techniques like ridge or lasso regression add a penalty term that penalizes the model if it's too large, or includes too many features, by pushing some of the coefficients to zero or close to zero.

Regularization methods like ridge or lasso can also work *with* cross-validation - for instance, in this lab, we're going to see what happens when we change the tunable parameter $\lambda$ in ridge and lasso regression. We're going to see how different values of $\lambda$ perform on a random train-test split of the data. Although we don't do it in this lab, a more systematic and rigorous way of evaluating different values of $\lambda$ would be to see how different values of $\lambda$ perform using leave one out or k-fold cross validation.

In this lab, we'll focus on ridge and lasso regression, focusing on how to implement them and examining their prediction error and the coefficients that result from using these methods.

### Setup

In [None]:
# Run this block.
import numpy as np
import pandas as pd
import sklearn

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
sns.set_context("talk")
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

### Section 1: EDA and data filtering

You will be working with the Novotny et al. land-use regression dataset used in HW5.  Here's a refresher about the data:

* The data is an accumulation of GIS land-use characteristics from EPA land-monitoring, and in situ NO2 measurements from satellite sensors.
* The goal of land-use regression (LUR) is to estimate outdoor air pollution geospatially across the contiguous United States.
* The reason for the high number of data points is that the data keeps track of readings from monitors at a high resolution, up to ~30 meters.

We will only be working with a small subset of this data in the lab.

In [None]:
#run to load the dataset we'll be working with
df = pd.read_csv('data/BechleLUR_2006_allmodelbuildingdata.csv')

In [None]:
df.head()

**Question 1.1** What is column name of the target (response) variable for our model? Display the Pandas series containing the response variable.

Reminder: the target variable will allow us to estimate atmospheric $NO_2$ levels at different points in space.

In [None]:
# Your code here

**Question 1.2** Create a dataframe `df_model` that contains only the response and predictor variables (i.e. you should drop Monitor_ID, State, Latitude, Longitude, and Predicted_NO2_ppb).

In [None]:
df_model = ...

**Question 1.3** We have a lot of potential features in our dataset, and it's hard to visualize all of them in relation to our response variable. To gain some familiarity with the data, however, create a plot with 4 subplots below, and generate 4 scatterplots, each showing a different potential feature on the x-axis and the response variable on the y-axis. Do you observe any trends or relationships? Visually, would you expect a model selection algorithm to prioritize or minimize any of these features? Why?

The code below is mostly written for you - you just need to choose four features to plot. You're welcome to change the formatting if you'd like.

In [None]:
y = df['Observed_NO2_ppb'] # y axis variable
ylab = "Observed atmospheric $NO_2$ (ppb)" # y axis label

msize = 70 # marker size
afsize = 15 # axis font size
tfsize = 20 # title font size

plt.figure(figsize = (20,10))

plt.subplot(221)
plt.scatter(..., y, s = msize)
plt.xlabel(..., fontsize = afsize)
plt.ylabel(ylab, fontsize = afsize)

plt.subplot(222)
plt.scatter(..., y, s = msize)
plt.xlabel(..., fontsize = afsize)
plt.ylabel(ylab, fontsize = afsize)

plt.subplot(223)
plt.scatter(..., y, s = msize)
plt.xlabel(..., fontsize = afsize)
plt.ylabel(ylab, fontsize = afsize)

plt.subplot(224)
plt.scatter(..., y, s = msize)
plt.xlabel(..., fontsize = afsize)
plt.ylabel(ylab, fontsize = afsize)

plt.suptitle(..., fontsize = tfsize)

plt.subplots_adjust(top=0.5) # avoid overlapping title and plots
plt.tight_layout()

*Your answer here*

---

Now that we've loaded and done some basic exploration of the data, we can think about how to choose which features to include in the model. Features can provide important information and predictive power. However, adding features to the data means we risk increasing the variance of our model (meaning our model performs poorly with test data relative to training data) and also reduces the interpretability of a model (it's harder to make sense of a model with lots of features). Rather than throwing out features entirely, we can turn to a technique called regularization to reduce the variance of our model while still incorporating as much information about the data as possible.

More generally, we can adopt the framework of regularized loss minimization.

$$ \large \hat{\theta} = \arg \min_\theta \frac{1}{n} \sum_{i=1}^n \textbf{Loss}\left(y_i, \hat{y_i}\right) + \lambda \textbf{R}(\theta) $$

The regularization term $\textbf{R}(\theta)$ penalizes for $\theta$ values that result in more complex and therefore higher variance models. The regularization parameter $\lambda$ determines the degree of regularization to apply and is typically determined through cross validation.

The two regularlization methods that we're exploring in this lab (ridge regression and lasso regression) use different functions for the loss function and for the regularization term $\textbf{R}(\theta)$.

### Section 2: L2 Regularization with Ridge Regression


$L_2$ regularization is a method of penalizing large weights in a cost function in order to lower model variance.

Ridge regression (L2 regularization) uses the *penalty* term $\large R_{L^2}(\theta) = \sum_{k=1}^p (\theta_k)^2$, where $p$ is the number of model features.

Note that $\lambda$ is a tunable parameter - as the person creating the model, you can choose to increase or decrease $\lambda$ based on how much you want to penalize the addition of model features. The higher the value of $\lambda$, the more a model is penalized on its higher order terms. This penalization decreases the model's variance at the cost of increasing its bias.

In scikit-learn, the value of $\lambda$ is passed in through the argument `alpha` as follows:
$$\alpha = \frac{1}{\lambda}$$

**Question 2.1** Separate the `df_model` dataframe into train and test sets, with 20% of the data in the test set. Begin by setting `X` to the matrix of predictor variables (all quantitative columns in the dataframe except the response variable) and set `y` equal to the response variable `Observed_NO2_ppb`.Then apply `train_test_split` to `X` and `y` to split the data.

In [None]:
from sklearn.model_selection import train_test_split

X = ...
y = ...

X_train, X_test, y_train, y_test = ...

In [None]:
#run this to make sure you split the data correctly
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

**Question 2.2** Import and create a Ridge regression model with `alpha` value set to 1. Fit the training data into the model, then return a list of the coefficients that the model predicts for each feature in the training data. The [scikit-learn documentation for Ridge()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) is helpful here.

In [None]:
from sklearn.linear_model import Ridge

ridge = ...
ridge.fit(...)
ridge_coefficients = ...

print(ridge_coefficients)

**Question 2.3** Now fit a `LinearRegression` model without regularization and print the resulting list of coefficients.

In [None]:
from sklearn.linear_model import LinearRegression

lm = ...
lm.fit(...)
lm_coefficients = ...

print(lm_coefficients)

**Question 2.4** Run the code below to generate a bar chart that shows the coefficient values from simple linear regression in blue, and from ridge regression in red. Then, in the markdown cell below, comment on the results. Can you explain your observations based on your understanding of L2 ridge regression?

In [None]:
# run this cell
ind = np.arange(len(lm_coefficients))
width = 0.5

plt.figure(figsize = (15,7))

plt.bar(ind-(width/2), width = width, height = lm_coefficients, label = "simple linear regression")
plt.bar(ind+(width/2), width = width, height = ridge_coefficients, label = r"ridge regression, $\alpha$ = 1")
plt.xlabel("feature number")
plt.ylabel("coefficient")
plt.title("Coefficient values with simple linear regression and ridge regression")
plt.legend()
plt.show()

*Your answer here*

**Question 2.5**: We just observed how the Ridge Regression model generates coefficients when `alpha` is set to one. Complete the following code which generalizes the fitting and predicting process we just did in Question 2.3 for various values of `alpha`.

Then calculates the mean squared error (MSE) between our predictions and the test dataset. The MSE in this case is a measure of the accuracy of our predictions.

In [None]:
from sklearn.metrics import mean_squared_error
alphas = [1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9]
mses = []

for a in alphas:
    model = ...
    model.fit(...)
    y_pred = model.predict(...)
    mses.append(mean_squared_error(y_pred, y_test))

print(mses)

a_log = np.log10(alphas)

plt.figure(figsize = (15,5))
sns.barplot(a_log, mses, color = 'cadetblue')
plt.xlabel(r'$log_{10}(\alpha)$')
plt.ylabel('MSE')
plt.title('Ridge regression MSE for each value of alpha');

**Question 2.6** What is the lowest MSE observed and which value of alpha did it come from? What value of $\lambda$ does that correspond to? Does the value of $\lambda$ that minimizes MSE more heavily or less heavily penalize additional coefficients than our initial value of $\lambda$ that we used to produce the plot in question 2.4?

In [None]:
# Your code here

*Your answer here*

**Question 2.7** How does ridge regression using the value of `alpha` identified in question 2.6 perform relative to simple linear regression with respect to the mean squared error?

In [None]:
# Your code here

*Your answer here*

### Section 3: L1 Regularization with Lasso Regression

While ridge regression minimizes coefficients, it incorporates *all* the features into your model. It won't actually drive any coefficients to 0 (unless $\lambda$ = $\infty$!). This can make your model less *interpretable* - for instance, in the case of the model we created in Section 2, we have over 130 non-zero coefficients and thus over 130 features.

Lasso regression (also called L1 Regularization) avoids the issue of including too many unimportant variables by using a model formulation that can drive some coefficients to 0.

Lasso regression uses the *penalty* term $\large R_{L^1}(\theta) = \sum_{k=1}^p \Big|\theta_k\Big|$, where $p$ is the number of model features.

**Question 3.1** Let's repeat the steps we did above for Ridge Regression, this time for Lasso Regression. Create a Lasso model with an `alpha` of 1 and fit on the X_train and y_train dataset.

In [None]:
from sklearn.linear_model import Lasso

lasso = ...
lasso.fit(...)
lasso_coefficients = ...
print(lasso_coefficients)

**Question 3.2** Output a plot that shows the coefficients from the simple linear regression in part 2, the ridge regression in part 2, and the lasso regression above side-by-side. You can adapt the code from question 2.4 or write your own, and can choose whatever plot format makes the most sense.

In [None]:
# Your code here

**Question 3.3:** Comment on the results in question 3.2. Can you explain your observations based on your understanding of L1 lasso regression?

*Your answer here*

**Question 3.4** What proportion of the datasets features are "ignored" by this lasso model? What are the column names of the features that are **not** ignored by this lasso model?

In [None]:
# Your code here

**Question 3.5**: Look back to the features you plotted in question 1.3. Were any of those features ignored or included by the lasso model?

*Your answer here*

**Question 3.6**: Remember how we calculated the test MSE for different values of $\alpha$ in question 2.5? Now, we're going to write a function that automates that process, taking as input a list of alphas `alphas` and a model (`Ridge` or `Lasso`). Complete the function below, and then define a list of alphas and call the function using the `Lasso` model to return a list of MSEs.

In [None]:
def calculate_mses(alphas, Model):
    """
    Input:
        alphas (array): contains floats of various alpha values
        Model (sklearn model): the type of sklearn model on which to fit the data
    Output:
        an array of floats containing the mean-squared-errors from the predictions
    """
    mses = []

    for a in alphas:
        # Your code here
        
    return mses

In [None]:
# output lasso MSEs

**Question 3.7**: How well does the Lasso Regression model perform against the Ridge Regression model from before? Calculate the ridge MSEs and the lasso MSEs using the same set of `alphas` then plot the two series against each other using whatever type of plot makes the most sense.

In [None]:
alphas = ...
lasso_mses = ...
ridge_mses = ...

# plot lasso vs ridge MSES

**Question 3.8** Explain the plot we generated above. Which model performs more consistently on the test data across various values of alpha? Why might this be the case?

*Your answer here*

# Hooray, you're done! 

Please remember to submit your lab work, after clicking Kernel -> Restart & Run All, in .html and .ipynb format on bCourses.

Further Reading:

Regularization - https://www.textbook.ds100.org/ch/16/reg_intro.html
    
Notebook developed by Alex McMurry, Kevin Marroquin, and Melissa Ly

Data Science Modules: http://data.berkeley.edu/education/modules