Homework 4
==========

In this homework, we...

* extend the zeolite exercise by fitting a multivariable linear regression with Lasso regularization

* compare Lasso and Ridge regression

Problem Statement
-----------------

1. Why do we apply regularization techniques? Comment on the differences between Ridge and Lasso regularization.

<br/>

2. We provide standardized data from the zeolite discussion, augmented with random covariates. Fit this data using the following models and report the resulting $R^2$ values. Which model produces the strongest $R^2$?
    
    - **Hint:** Apply the `LinearRegression`, `Ridge`, and `Lasso` classes in `sklearn.linear_model`.
    
    - **Another Hint:** `sklearn` uses the `alpha` argument as the regularization parameter, instead of $\lambda$. In the case of `sklearn.linear_model.Ridge()`, the `alpha` argument is the same as $\lambda$. In the case of `sklearn.linear_model.Lasso()`, the `alpha` argument should be set to $\lambda/2$. This is based on how the cost function is defined in the [`sklearn.linear_model.Lasso()` function](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html).

    (a) Standard linear regression

    (b) Ridge regression ($\lambda = 1$)

    (c) Lasso regression ($\lambda = 1$)

<br/>

3. Randomly isolate 50 % of the zeolite examples in what is called the "training set." Assign the remaining 50 % of examples to a "test set." You will learn more about training/test splits during week 5. **For each model** from part 2:

    (a) Fit the model to the **training set**

    (b) Evaluate $R^2$ of the fitted model on the training data

    (c) Apply this fitted model to the testing data, and report the $R^2$

    - **Note:** We are expecting six $R^2$ values to be reported for this problem.

<br/>

4. Let's compare the $R^2$ values we obtain from standard regression and Lasso regression. How do the $R^2$ values of standard and Lasso regression differ for:

    (a) the training data

    (b) the test data

<br/>

5. Why could applying Lasso regularization to your regression model cause a change in $R^2$ evaluated on test data?

    - **Hint:** You may or may not see this change in $R^2$, depending on the random train/test split that you obtain. Regardless, consider how Lasso regularization affects model generalizability.

<br/>

6. Re-fit the Lasso regression model from problem 2 on the training set for the following values of $\lambda$: 0.1, 0.25, 0.5, 0.75, 0.9. Evaluate $R^2$ of each model on the test set. Describe the effect of $\lambda$ on model generalizability.

<br/>

7. For each Lasso regression model from problem 6, plot the magnitude of coefficients for the top 20 contributing covariates (i.e., the 20 covariates with the largest magnitude coefficients). How does changing $\lambda$ affect the magnitude of these top coefficients?

Import modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split

Load and standardize data from: [DFT training set from Evans, Jack D., and François-Xavier Coudert. "Predicting the mechanical properties of zeolite frameworks by machine learning." Chemistry of Materials 29, no. 18 (2017): 7833-7839.]

For demonstration purposes, we will add 1000 normally distributed covariates to the data set.
These covariates contain no information, and including the covariates in the model is expected to cause overfitting. 

In [2]:
# Load
data = pd.read_csv("data/zeolite_mech.csv", low_memory = False)
covariate_name = ['density', 'spg', 'volume', 'SiOSi_average', 'SiO_average', 'max_dim',
                  'largest_free_sphere', 'VolFrac', 'ASA', 'AV']
x = np.array(data['g_gbr'])
t = np.array(data[covariate_name])

# Add normally distributed, random features
len_rand_coefs = 1000
fake_data = np.random.normal(size=(len(data), len_rand_coefs))
t = np.concatenate((t, fake_data), axis=1)
fake_coefs = ["rand_"+str(i) for i in range(len_rand_coefs)]
covariate_name = covariate_name + fake_coefs

# Standardize
n_e = len(x)
n_t = len(t[0,:])
t_cen = np.zeros((n_e, n_t))
x_cen = (x - np.sum(x) / n_e)
x_cen /= np.sqrt(np.sum(x_cen ** 2) / (n_e - 1))
for i in range(n_t):
    t_cen[:, i] = (t[:, i] - np.sum(t[:, i]) / n_e)
    t_cen[:, i] /= np.sqrt(np.sum(t_cen[:, i] ** 2) / (n_e - 1))

Answers
-------

1. Why do we apply regularization techniques? Comment on the differences between Ridge and Lasso regularization.

2. We provide standardized data from the zeolite discussion, augmented with random covariates. Fit this data using the following models and report the resulting $R^2$ values. Which model produces the strongest $R^2$?
    
    - **Hint:** Apply the `LinearRegression`, `Ridge`, and `Lasso` classes in `sklearn.linear_model`.
    
    - **Another Hint:** `sklearn` uses the `alpha` argument as the regularization parameter, instead of $\lambda$. In the case of `sklearn.linear_model.Ridge()`, the `alpha` argument is the same as $\lambda$. In the case of `sklearn.linear_model.Lasso()`, the `alpha` argument should be set to $\lambda/2$. This is based on how the cost function is defined in the [`sklearn.linear_model.Lasso()` function](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html).

    (a) Standard linear regression

    (b) Ridge regression ($\lambda = 1$)

    (c) Lasso regression ($\lambda = 1$)

3. Randomly isolate 50 % of the zeolite examples in what is called the "training set." Assign the remaining 50 % of examples to a "test set." You will learn more about training/test splits during week 5. **For each model** from part 2:

    (a) Fit the model to the **training set**

    (b) Evaluate $R^2$ of the fitted model on the training data

    (c) Apply this fitted model to the testing data, and report the $R^2$

    - **Note:** We are expecting six $R^2$ values to be reported for this problem.

4. Let's compare the $R^2$ values we obtain from standard regression and Lasso regression. How do the $R^2$ values of standard and Lasso regression differ for:

    (a) the training data

    (b) the test data

5. Why could applying Lasso regularization to your regression model cause a change in $R^2$ evaluated on test data?

    - **Hint:** You may or may not see this change in $R^2$, depending on the random train/test split that you obtain. Regardless, consider how Lasso regularization affects model generalizability.

6. Re-fit the Lasso regression model from problem 2 on the training set for the following values of $\lambda$: 0.1, 0.25, 0.5, 0.75, 0.9. Evaluate $R^2$ of each model on the test set. Describe the effect of $\lambda$ on model generalizability.

7. For each Lasso regression model from problem 6, plot the magnitude of coefficients for the top 20 contributing covariates (i.e., the 20 covariates with the largest magnitude coefficients). How does changing $\lambda$ affect the magnitude of these top coefficients?