# Module 2 homework

**This homework has 4 questions.**

In [9]:
import numpy as np

from sklearn.datasets import make_regression
from sklearn.linear_model import (
    Lasso,
    LinearRegression
)
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import sklearn.linear_model
import sklearn.metrics
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing

Let's simulate some data for a regression problem.

In [10]:
n = 1000  # number of observations in the data
p = 100   # number of predictors in the data
k = 10    # number of relevant predictors in the data

In [20]:
X, y, coef = make_regression(n_samples=n,
                             n_features=p,
                             n_informative=k,
                             noise=0.005,  # add a little gaussian noise to the data
                             coef=True,
                             random_state=42)

Keep in mind that *we* generate the data here, therefore we know the "truth" about the data generating process: only 10 predictors are relevant out of the $p = 100$ predictors available in the data.

In fact, here are the true coefficients of the data generating process (only the 10 non-zero coefficients are printed):

In [21]:
print('\n'.join([f'beta_{i}: {coef[i]}' for i in range(p) if coef[i] != 0.0]))

beta_6: 8.88918861576965
beta_15: 19.365117777553486
beta_47: 48.24477094388765
beta_56: 82.84660305904961
beta_66: 1.8559304095025264
beta_73: 29.506960083972224
beta_80: 68.28011328758366
beta_85: 78.46795175040386
beta_87: 25.37933359399561
beta_90: 57.00013284018336


We would expect that given a sample of $n = 1000$ data points from this data generating process, a LASSO model will be able to select exactly 10 predictors and discard the remaining ones (i.e. set their $\widehat{\beta}$ estimates to 0).

Let's verify empirically whether this is in fact the case.

## Question 1 (3 points)

Set up a model pipeline that uses the `StandardScaler` to normalize the data and feed them to a `Lasso` model.

- For this, you can use the `make_pipeline` utility in the `sklearn` library.

- Assign the pipeline object to a variable named `lasso_pipeline`.

## Answer 1

In [22]:
lasso_pipeline = sklearn.pipeline.make_pipeline(sklearn.preprocessing.StandardScaler(),
                                                sklearn.linear_model.Lasso())

Here we create a grid of 50 (by default) equally spaced (in the log scale) candidate values for the regularization parameter $\alpha$.

In [16]:
alpha_candidates = np.logspace(-4.0, -2.0)

## Question 2 (3 points)

- Use `GridSearchCV` with 2-fold cross-validation to find the best value of $\alpha$ for these data.

- Assign the fitted `GridSearchCV` object to a variable named `alpha_search`.

This will look something like

```
GridSearchCV(lasso_pipeline,
             {'lasso__alpha': alpha_candidates},
             cv=<number of folds>).fit(<data>)
```

where `<number of folds>` and `<data>` are placeholders for code that you need to write.

## Answer 2

In [23]:
alpha_search = sklearn.model_selection.GridSearchCV(lasso_pipeline,
                                                    {'lasso__alpha': alpha_candidates},
                                                    cv=2).fit(X,y)

Here we count how many of the model coefficients for the best LASSO model (i.e. the one
corresponding to the optimal value of $\alpha$ selected via cross-validation) are non-zero.

In [25]:
try:
    print(np.sum(alpha_search.best_estimator_[1].coef_ != 0))
except NameError:
    print('The object `alpha_search` does not exist! Did you forget to create it in Question 2?')

10


In particular, *which* coefficients are non-zero?

In [27]:
try:
    print('\n'.join([f'beta_hat_{i}: {alpha_search.best_estimator_[1].coef_[i]}'
           for i in range(p)
           if alpha_search.best_estimator_[1].coef_[i] != 0.0]))
except NameError:
    print('The object `alpha_search` does not exist! Did you forget to create it in Question 2?')

beta_hat_6: 8.687958221359063
beta_hat_15: 19.726802408850183
beta_hat_47: 48.347385382805285
beta_hat_56: 84.70297559430583
beta_hat_66: 1.905424924453863
beta_hat_73: 28.963729026956308
beta_hat_80: 62.93734556951447
beta_hat_85: 76.56912272269763
beta_hat_87: 25.280346286439105
beta_hat_90: 58.505816871254616


For reference, here are the true $\beta$ values that these coefficients are trying to estimate:

In [28]:
print('\n'.join([f'beta_{i}: {coef[i]}' for i in range(p) if coef[i] != 0.0]))

beta_6: 8.88918861576965
beta_15: 19.365117777553486
beta_47: 48.24477094388765
beta_56: 82.84660305904961
beta_66: 1.8559304095025264
beta_73: 29.506960083972224
beta_80: 68.28011328758366
beta_85: 78.46795175040386
beta_87: 25.37933359399561
beta_90: 57.00013284018336


Let's see what would happen if we fitted a conventional multiple regression model on these data instead.

In [29]:
linear_regression = LinearRegression().fit(X, y)

## Question 3 (2 points)

Count how many coefficients are non-zero in the multiple linear regression model.

You can use the same code as above with minor modifications:

```
np.sum(<model>.coef_ != 0)

```

`<model>` is a placeholder for code that you need to write.

## Answer 3

In [31]:
try:
    non_zero_coefficients = np.sum(linear_regression.coef_ != 0)
    print(non_zero_coefficients)
except NameError:
    print('The object `linear_regression` does not exist! Did you forget to create it in Question 3?')

100


## Question 4 (2 point)

Which of the two models learns best the data-generating process that produced the sample data? Why?

## Answer 4

Your answer here.

In [None]:
k=10 relevant predictors out of p=100 of total predictors. 
This means 90 of the features are not informative to the target `y`.

- The Lasso model is better in learning the data-generating process because 
it ignores the non-informative 90 features.
Lasso will set non-informative coefficients to zero, as seen above.
When we focus on the 10 relevant predictors, it is better for learning.
as defined for lasso models: 
    "The `alpha` parameter controls the strength of regularization. A grid search over `alpha` values was 
    conducted to find the best setting that minimizes cross-validated loss. This process helps in identifying 
    a model that captures the underlying data-generating process by keeping the model complexity 
    (number of non-zero coefficients) in check."
- The Linear Regression model is worse because
it uses all 100 predictors, including the 90 that are not informative, as seen above. 
We have the possibility of overfitting when we have 100 predictors, not 10. which leads to poor prediction for new data
