Regularization imposes a penalty on the size of the coefficients in
the model.



## Sample data



As is our usual process, let&rsquo;s generate some fake data.  This time,
our target is not classification but regression.



In [1]:
import numpy as np

dimension = 25 # should be a perfect square
N = 1000

y = np.random.normal( 0, 1, size=N )

vs = np.linspace(0.01,2,dimension)
np.random.shuffle(vs)

xs = []
for v in vs:
    xs.append( np.random.normal(0,1) * np.random.normal( y, v ) )
X = np.array( xs ).transpose()

What does the code above actually achieve?  As always, ****look at your
data****.  (Here we also see how to use `matplotlib` to produce a **grid**
of plots which can be useful.)



In [1]:
import matplotlib.pyplot as plt

grid = int(np.sqrt(dimension))
for i in range(dimension):
    plt.subplot(grid, grid, 1+i)
    plt.scatter( y, X[:,i], s=1 )
plt.show()

We&rsquo;ve baked in a great deal of multicollinearity!

This is also an invitation to **high-dimensional data**; our input data
is 25 dimensional.



## Linear regression



Let&rsquo;s do some linear regression.



In [1]:
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit( X, y )

This works well!



In [1]:
model.score(X, y)

But it is arguably upsetting that the model coefficients are involving **all** the features of the data.



In [1]:
model.coef_

From the plot, we can see that we can explain `y` by looking just at
**one** feature of the data.



## Lasso



Our usual linear regression involves the cost function
$||X w - y||_2^2$.

For &ldquo;lasso&rdquo; we instead minimize the cost function

${ \frac{1}{2n}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}$.

Here $\alpha$ is the regularization parameter, and $||w||_1$ is the
sum of the coefficients of the model.  By adding this term to the cost
function, we are effectively penalizing the model based on the size of
the coefficients.



In [1]:
from sklearn.linear_model import Lasso
model = Lasso().fit( X, y )

Does this really change the coeffiicients at all?



In [1]:
model.coef_

Consider how important this could be, because not only does this make
the model &ldquo;simpler&rdquo; (perhaps avoiding overfitting?) but it also makes
the model more **explainable** in that we are identifying the key pieces
of the input that explain the output.

A part of data science involves creating models that you can actually
&ldquo;sell&rdquo; in that someone else (who might know nothing of data) will
believe the patterns you&rsquo;ve discovered.  As first noted by Hamming,
the purpose of computing is not numbers but **insight**.

