Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# Ridge and Lasso Regression: Problem solving

In this session, we'll use the `boston` dataset, which has been used to examine the relationship between clean air and house prices:


| Variable | Type | Description |
|:----|:-----|:----------|
|crim | Ratio | per capita crime rate by town | 
|zn | Ratio | proportion of residential land zoned for lots over 25,000 sq. ft. | 
|indus | Ratio | proportion of non-retain business acres per town | 
|chas | Nominal (binary) | Charles River dummy variable (=1 if tract bounds river, =0 otherwise) | 
|nox | Ratio | nitrogen oxides concentration (parts per 10 million) | 
|rm | Ratio | average number of rooms per dwelling | 
|age | Ratio | proportion of owner-occupied units built prior to 1940 | 
|dis | Ratio | weighted mean of distances to five Boston employment centers | 
|rad | Ordinal | index of accessibility to radial highways | 
|tax | Ratio | full-value property tax rate per \$10,000 | 
|ptratio | Ratio | pupil-teacher ratio by town | 
|lstat | Ratio | percent lower status of population (defined as non-high school graduate, manual labor) | 
|medv | Ratio | median value of owner-occupied homes in $1000s | 

<div style="text-align:center;font-size: smaller">
    <b>Source:</b> This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
</div>

As before, we'll try to predict `medv` using the rest of the variables.

**Because `medv` is a ratio variable, we will use linear regression not logistic regression.**

## Load data

Import `pandas` to load a dataframe.

Load the dataframe.

## Explore data

Describe the data.

-----------
**QUESTION:**

Do the min, mean, and max look reasonable to you, given what these variables mean (see the data description above)?

**ANSWER: (click here to edit)**


<hr>

Make a correlation heatmap.

First import `plotly.express`.

And show the heatmap

----------------------

**QUESTION:**

Do we have strong positive correlations, strong negative correlations, or both?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

Given the nature of these variables, do these correlations surprise you? 

**ANSWER: (click here to edit)**


<hr>

## Prepare train/test sets

Do the imports needed to split the dataframe into `X` and `Y`.

Create `X` by dropping the response variable `medv` from the dataframe.

Create `Y` by pulling just `medv` from the dataframe.

Now do the splits.

## Model 1: Linear regression

Do the imports needed to build and evaluate a linear regression model that uses scaling.

### Training

Make a pipeline to scale and train in one step.

**Hint: `with linear_model create LinearRegression using`**

Fit the pipeline.

### Evaluation

Get the $r^2$ on the test splits (see linear regression notebooks).

**Hint: `do score using a list containing Xtest and Ytest` (from splits)**

------------------------
**QUESTION:**

Is this a good $r^2$?

**ANSWER: (click here to edit)**


<hr>

Print the coefficients of the model

In [15]:
print(pd.DataFrame( {"variable":X.columns, "coefficient":np.ravel(std_clf[1].coef_) }).to_string())

   variable  coefficient
0      crim    -1.162316
1        zn     1.530809
2     indus     0.062541
3      chas     0.526039
4       nox    -1.941847
5        rm     2.283061
6       age    -0.042285
7       dis    -3.423062
8       rad     2.575558
9       tax    -2.146754
10  ptratio    -1.872294
11    lstat    -4.001713


------------------------
**QUESTION:**

What are the two variables that most positively impact house price?
What are the two variables that most negatively impact house price?

**ANSWER: (click here to edit)**


<hr>

<!-- TODO: might be worth doing diagnostics for each model, though I'm concerned about how long that would take -->

## Model 2: Lasso regression (alpha=.25)

### Training

Make a pipeline to scale and train in one step.

**Hint: `with linear_model create Lasso using alpha=.25`**

**Alpha is different from C so bigger numbers mean bigger penalties.**

Fit the pipeline.

<!-- Now we can get predictions from the model for our test data: -->

<!-- predictions_lasso75 = std_clf_lasso75.predict(splits[1])

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="+c@nB4b@*dAXK60P0,vq">predictions_lasso75</variable><variable id="q1J}F(vBK]9@{^K9+Yni">std_clf_lasso75</variable><variable id="_ut$e0PL4OMi4o1MXTpw">splits</variable></variables><block type="variables_set" id="NHw$/HH988vNLbZgn)CM" x="88" y="212"><field name="VAR" id="+c@nB4b@*dAXK60P0,vq">predictions_lasso75</field><value name="VALUE"><block type="varDoMethod" id="N}3ds6:i%0MtTA:(2im4"><field name="VAR" id="q1J}F(vBK]9@{^K9+Yni">std_clf_lasso75</field><field name="MEMBER">predict</field><data>std_clf_lasso75:predict</data><value name="INPUT"><block type="lists_create_with" id="3Ru6U*^.a`oD7$bu/I%y"><mutation items="1"></mutation><value name="ADD0"><block type="lists_getIndex" id="@lpyN+:CEcPQ#Q:Svm|9"><mutation statement="false" at="true"></mutation><field name="MODE">GET</field><field name="WHERE">FROM_START</field><value name="VALUE"><block type="variables_get" id="ng))4fZyb@U1|eswo1}:"><field name="VAR" id="_ut$e0PL4OMi4o1MXTpw">splits</field></block></value><value name="AT"><block type="math_number" id="Lo)w=2LL|Tf-L/gkeTdT"><field name="NUM">2</field></block></value></block></value></block></value></block></value></block></xml> -->


### Evaluation

Get the $r^2$ on the test splits.

------------------------
**QUESTION:**

How does this compare to the previous $r^2$? 
Should we be concerned?

**ANSWER: (click here to edit)**


<hr>

Print how many coefficients are not zero.

Print the coefficients of the model

In [20]:
print(pd.DataFrame( {"variable":X.columns, "coefficient":np.ravel(std_clf_lasso25[1].coef_) }).to_string())

   variable  coefficient
0      crim    -0.579396
1        zn     0.696999
2     indus    -0.000000
3      chas     0.428924
4       nox    -0.820872
5        rm     2.586669
6       age    -0.000000
7       dis    -1.716571
8       rad     0.000000
9       tax    -0.000000
10  ptratio    -1.630620
11    lstat    -3.965615


------------------------
**QUESTION:**

What are the two variables that most positively impact house price?
What are the two variables that most negatively impact house price?
How is this different from before?

**ANSWER: (click here to edit)**


<hr>

## Model 3: Lasso regression (alpha=.75)

### Training

Make a pipeline to scale and train in one step, using alpha=.75.

Fit the pipeline.

<!-- Now we can get predictions from the model for our test data: -->

<!-- predictions_lasso75 = std_clf_lasso75.predict(splits[1])

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="+c@nB4b@*dAXK60P0,vq">predictions_lasso75</variable><variable id="q1J}F(vBK]9@{^K9+Yni">std_clf_lasso75</variable><variable id="_ut$e0PL4OMi4o1MXTpw">splits</variable></variables><block type="variables_set" id="NHw$/HH988vNLbZgn)CM" x="88" y="212"><field name="VAR" id="+c@nB4b@*dAXK60P0,vq">predictions_lasso75</field><value name="VALUE"><block type="varDoMethod" id="N}3ds6:i%0MtTA:(2im4"><field name="VAR" id="q1J}F(vBK]9@{^K9+Yni">std_clf_lasso75</field><field name="MEMBER">predict</field><data>std_clf_lasso75:predict</data><value name="INPUT"><block type="lists_create_with" id="3Ru6U*^.a`oD7$bu/I%y"><mutation items="1"></mutation><value name="ADD0"><block type="lists_getIndex" id="@lpyN+:CEcPQ#Q:Svm|9"><mutation statement="false" at="true"></mutation><field name="MODE">GET</field><field name="WHERE">FROM_START</field><value name="VALUE"><block type="variables_get" id="ng))4fZyb@U1|eswo1}:"><field name="VAR" id="_ut$e0PL4OMi4o1MXTpw">splits</field></block></value><value name="AT"><block type="math_number" id="Lo)w=2LL|Tf-L/gkeTdT"><field name="NUM">2</field></block></value></block></value></block></value></block></value></block></xml> -->


### Evaluation

Get the $r^2$ on the test splits.

------------------------
**QUESTION:**

How does this compare to the previous $r^2$? 
Should we be concerned?

**ANSWER: (click here to edit)**


<hr>

Print how many coefficients are not zero.

Print the coefficients of the model

In [25]:
print(pd.DataFrame( {"variable":X.columns, "coefficient":np.ravel(std_clf_lasso75[1].coef_) }).to_string())

   variable  coefficient
0      crim    -0.123574
1        zn     0.000000
2     indus    -0.000000
3      chas     0.075127
4       nox    -0.000000
5        rm     2.578374
6       age    -0.000000
7       dis    -0.000000
8       rad    -0.000000
9       tax    -0.000000
10  ptratio    -1.427197
11    lstat    -3.653555


------------------------
**QUESTION:**

What are the two variables that most positively impact house price?
What are the two variables that most negatively impact house price?
How is this different from before?

**ANSWER: (click here to edit)**


<hr>

### Comparing Lasso models

------------------------
**QUESTION:**

Which model do you prefer?
Why?

**ANSWER: (click here to edit)**


<hr>

------------------------
**QUESTION:**

Is there any model that you don't trust?
Why?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

If you were to more seriously investigate multicolinearity in this situation, what are other things you could do?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

What is the interpretation of the four most positive/negative coefficients in model 3 and their impact on house price?

**ANSWER: (click here to edit)**


<hr>