### Week 7 studio assignment: Linear Regression and Regularization
This is a notebook exploring sk-learn's tools for linear regression. It covers:
- using different metrics to evaluate regression performance
- fitting with Ordinary Least Squares and Stochastic Gradient Descent
- the effect of varying loss functions (L1, L2, and Huber)
- how to introduce additional polynomial features
- the effect of Lasso and Ridge regularization

Copyright: Julieta Gruszko (2025) 
Some materials from Viviana Acquaviva (2023)

License: [BSD-3-clause](https://opensource.org/license/bsd-3-clause/)

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_validate, cross_val_predict
from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model #New!

font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
#matplotlib.rcParams.update({'figure.autolayout': False})
matplotlib.rcParams['figure.dpi'] = 300

#### We begin by generating some data. We'll start with just 1 feature, and an underlying linear model.

In [None]:
np.random.seed(16) #set seed for reproducibility purposes

x = np.arange(100) 

yp = 3*x + 3 + 2*(np.random.poisson(3*x+3,100)-(3*x+3)) #generate some data with scatter following Poisson distribution 
                                                    #with exp value = y from linear model, centered around 0

In [None]:
#Let's take a look!

plt.scatter(x, yp);

#### Here comes the linear regression model. 

In [None]:
model = linear_model.LinearRegression()

### Questions:
Take a quick look at the documentation for the model used above. 
- What fit method does it use?
- What loss function does this model use? If it can use multiple different loss functions as options, list them.

In [None]:
model

I can fit the model (right now, I will do it using the entire data set just to compare with the analytic solution). When only one predictor is present, I need to reshape it to column form.

In [None]:
model.fit(x.reshape(-1,1),yp) 

The fitted model has attributes "coef_", "intercept_":

In [None]:
slope, intercept = model.coef_, model.intercept_

In [None]:
print(slope, intercept)

We can plot the original and the fitted line.

In [None]:
plt.figure(figsize = (10,6))
plt.scatter(x,yp, s = 20, c = 'gray', label = 'Data')
plt.plot(x, slope*x + intercept, c ='k', label = 'Ordinary least squares fit')
plt.plot(x, 3*x + 3, c = 'r', label = 'True regression line')
plt.legend(fontsize = 14)
plt.xlabel('X')
plt.ylabel('Y')

This matches the analytic prediction of the coefficients. Below, they're calculated from the data points with two methods: the expanded formula we derived in class to get the slope and intercept, and using numpy's variance and covariance functions to get the slope. 

In [None]:
#the analytic formula

theta1 = np.sum((x - np.mean(x))*(yp - np.mean(yp)))/np.sum((x - np.mean(x))*(x - np.mean(x)))

theta0 = np.mean(yp) - theta1*np.mean(x)

print('Theta_0, Theta_1:', theta0, theta1)

In [None]:
# using numpy's variance/covariance functions (note: a small difference is due to 1/n vs 1/(n-1) in the definition; use bias = True for consistency)
print('Sample Cov / Sample var:', np.cov(x,yp, bias=True)[0,1]/np.var(x))

#### We can (and should!) do cross validation and all the nice things we have learned to do for classification problems.

In [None]:
cv = KFold(n_splits = 5 , shuffle = True , random_state = 10)

In [None]:
scores = cross_validate(model, x.reshape(-1,1), yp, cv = cv, return_train_score = True)

In [None]:
scores

In [None]:
print('Test scores:', '{:.3f}'.format(scores['test_score'].mean()), '{:.3f}'.format(scores['test_score'].std()))
print('Train scores:', '{:.3f}'.format(scores['train_score'].mean()), '{:.3f}'.format(scores['train_score'].std()))

### Questions: 

- What are the scores that are being printed out? 

Note: If you don't set a scoring method parameter, cross\_validate returns whatever the default is for the model you passed it. You'll probably need to check documentation to find out what that is.

- How are the scores? 

- Does it suffer from high variance? High bias?

- What would happen to the scores if we increased the scatter (noise)?

### <font color='green'> Scoring in regression problems. </font>

### Here is a way to visualize all the available scorers.

In [None]:
print(sorted(sklearn.metrics.get_scorer_names()))

### Do you recognize some of them?

Let's try out the MSE.

In [None]:
scores = cross_validate(model, x.reshape(-1,1), yp, cv = cv, scoring = 'neg_mean_squared_error', return_train_score = True)

In [None]:
print('Test scores:','{:.3f}'.format(scores['test_score'].mean()), '{:.3f}'.format(scores['test_score'].std()))
print('Train scores:','{:.3f}'.format(scores['train_score'].mean()), '{:.3f}'.format(scores['train_score'].std()))

Something to note is that estimators of performance of the "error" type (in other words, the lower, the better) receive a negative sign in sklearn. This is just to maintain consistency with the "higher score = better" framework.

Can also try the Median Absolute Error:

In [None]:
scores = scores = cross_validate(model, x.reshape(-1,1), yp, cv = cv, scoring = 'neg_mean_absolute_error', return_train_score = True)

In [None]:
print('Test scores:','{:.3f}'.format(scores['test_score'].mean()), '{:.3f}'.format(scores['test_score'].std()))
print('Train scores:','{:.3f}'.format(scores['train_score'].mean()), '{:.3f}'.format(scores['train_score'].std()))

Finally, by plotting the residuals, we can see that they are not independent of x (the assumptions of the probabilistic linear model are not satisfied). But that doesn't mean we can't create a model.

In [None]:
plt.scatter(x, slope*x + intercept - yp, color = 'b', label = 'Residuals')

plt.legend();

#### Note: as we already discussed, so far we have not changed the loss function (MSE), or the coefficients of the model. We have only looked at different evaluation metrics.

### Questions:
- Would the best fit line change if we optimize a different loss function?
- Will the fit method used above (the one used by sk-learn's $\texttt{LinearRegression}$ model) work for a different loss function?
- What are some options for implementing a fit using a different loss function? Give at least 2 methods. 

You already got the experience of writing a grid search in Week 1's studio, so we'll focus our attention on more efficient methods and use the built-in options from sk-learn. 

We'll be using stoachastic gradient descent, using sk-learn's SGDRegressor model, to try out different loss functions. 

However, because these data are so regular, it's kind of boring, so before trying the different losses let's inject some outliers in the data.

### What happens when we add outliers?

In [None]:
np.random.seed(12) #set 
out = np.random.choice(100,15) #select 15 outliers indexes
yp_wo = np.copy(yp)
np.random.seed(12) #set again
yp_wo[out] = yp_wo[out] + 5*np.random.rand(15)*yp[out]

In [None]:
plt.scatter(x,yp_wo, label = 'Data + outliers')
plt.scatter(x,yp, label = 'Original data')
plt.legend();

We can see the effect for the MSE loss right away, still using the OLS method as before:

In [None]:
model.fit(x.reshape(-1,1),yp_wo)

slope, intercept = model.coef_, model.intercept_

print(slope, intercept)

In [None]:
plt.figure(figsize = (10,6))
plt.scatter(x,yp_wo, s = 20, c = 'gray', label = 'Data')
plt.plot(x, slope*x + intercept, c ='k', label = 'Ordinary least squares fit')
plt.legend(fontsize = 14)
plt.xlabel('X')
plt.ylabel('Y')

Let's also compare the result using SGD for MSE loss, to check for consistency. 

### Note:

sk-learn's SGD struggles to find the best fit when the features or targets span a large range of values, since the default settings are designed for scaled data. Really, you should apply StandardScaler to $\textbf{both}$ the features and targets first. 
This is different than what we've done in the past! Before, we only scaled the features, so we could rely on the pipeline object to do it for us. Now, we'd also want to scale the targets. 

The annoying thing about doing that is that it will force you to invert the scaling to get back coefficients that are meaningful in your original units. 


Another option is to tweak the SGD step size (initial learning rate), number of iterations, and possibly the learning rate to compensate for this. When you can get away with it, this is conceptually simpler, but it incurs a bigger computational cost. Changing other settings (like the loss function used) will change the initial values you need, so beware if you decide to do this! Using the verbose mode of SGDRegressor is useful for checking if the fit is actually working.


In [None]:
sgdmodel = linear_model.SGDRegressor(penalty=None, eta0=.0000001, max_iter=10000, tol=1e-5) 
#the default loss function used by SGD is MSE
# We're turning off the regularization with penalty = 'None'
# We're using a very small initial learning rate and extra iterations (defaults are .01 and 1000) since we didn't scale the data
# We're using the default learning rate setting, which starts large and levels off as t^-0.25, where t is the iteration number

In [None]:
sgdmodel.fit(x.reshape(-1, 1), yp_wo.flatten())


In [None]:
sgdslope, sgdintercept =sgdmodel.coef_, sgdmodel.intercept_
print("SGD MSE coefficients:", sgdslope, sgdintercept)


Not exactly the same, but pretty close! As a numerical method, SGD won't match the analytic solution exactly.

Now that we have our SGD set up, we can try some different loss functions.

### Varying loss functions

We'll try out these loss functions:
- L1 loss (MAE)
- Huber loss (hybrid between L1 and L2)

SGD doesn't explicitly have an L1 loss option, but Huber loss's epsilon parameter, which modifies where the transition occurs between L2 and L1 error, can be set to a very small value to mimic using an L1 loss. 

### Question:
Why do you think it is that sk-learn's gradient descent method doesn't implement a purely-L1 loss? What could go wrong if it did?

In [None]:
# First, using L1-style loss:
sgdmodel_l1 = linear_model.SGDRegressor(loss = "huber", epsilon = .001, eta0=.1, max_iter=10000, penalty=None, tol=1e-5) 
sgdmodel_l1.fit(x.reshape(-1, 1), yp_wo.flatten()) 

l1slope, l1intercept =sgdmodel_l1.coef_, sgdmodel_l1.intercept_
print("Scaled L1 SGD coefficients:", l1slope, l1intercept)


In [None]:
# Now using Huber loss with epsilon of 500:
sgdmodel_huber500 = linear_model.SGDRegressor(loss = "huber", eta0=.000001, max_iter=10000, penalty=None, epsilon=500) 
sgdmodel_huber500.fit(x.reshape(-1, 1), yp_wo.flatten()) 

huber500slope, huber500intercept =sgdmodel_huber500.coef_, sgdmodel_huber500.intercept_
print("Huber SGD coefficients (epsilon = 100):", huber500slope, huber500intercept)


In [None]:
plt.figure(figsize = (10,6))
plt.scatter(x,yp_wo, s = 20, c = 'gray', label = 'Data')
plt.plot(x, sgdslope*x + sgdintercept, c ='b', label = 'SGD MSE fit')
plt.plot(x, l1slope*x + l1intercept, c ='r', label = 'L1 fit')
plt.plot(x, huber500slope*x + huber500intercept, c ='m', label = 'Huber fit, epsilon = 500')
plt.legend(fontsize = 14)
plt.xlabel('X')
plt.ylabel('Y')

Note: the Huber loss is a hybrid between MSE and MAE (behaves like MAE when the error is larger than a certain amount, often called epsilon, so it's less sensitive to outliers). One possibility is to use the std of the y values to set epsilon.

### Exercise and Question: 
Try a different value of epsilon in the Huber loss and check what happens to the resulting best fit. Make sure the fit is still working correctly with your new value (e.g. if you use verbose mode and see that there's < ~10 iterations, you may need to tweak settings).

Explain what you see.




Epsilon is a hyperparameter that you can optimize like any other one. 

### New features and the effect of regularization

To study how regularization works, we need to have more features. 

We'll add a couple of new uncorrelated features, and some correlated features using the PolynomialFeatures method.

In [None]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures

In [None]:
x0 = x #lets call our old feature x1
#and add a new feature:
x1 = np.logspace(2,3,num=100) 

#make the target values depend on both features
ypb = 3*x0 + 2*x0**2 + 15*x1 + 3 + 5*(np.random.poisson(3*x0+2*x0**2 + 15*x1,100)-(3*x0 + 2*x0**2 + 15*x1)) 
                                                    #generate some data with scatter following Poisson distribution 
                                                    #with exp value = y from linear model, centered around 0

xb = np.vstack((x0,x1)).T #stack the features into 2D array
xb.shape

### Let's take a look at plots of the target vs. each of the features:

In [None]:
fig, ax = plt.subplots(1, 2, sharey=True, figsize=(7, 3))
ax[0].scatter(x0, ypb, marker='.')
ax[0].set_xlabel('x0')
ax[0].set_ylabel('target')
ax[1].scatter(x1, ypb, marker='.')
ax[1].set_xlabel('x1')

### Add correlated features (polynomial transformation)

In [None]:
poly = PolynomialFeatures(2, include_bias=False)

In [None]:
new_xb = poly.fit_transform(xb)
new_xb.shape

In [None]:
poly.get_feature_names_out()  # a convenient function to check the ordering of new features

### Questions
- What features did we add using $\texttt{PolynomialFeatures}$?
- Which of those new features do you think may be useful in doing a LinearRegression fit on the data?


sk-learn has many options for regression with regularization, including:
- Using $\texttt{SGDRegressor}$, set $\texttt{penalty = \{`l2', `l1', `elasticnet'\}}$ : optimizes using stochastic gradient descent, with whichever regularizer you set (elasticnet is a combination of l1 and l2 with a parameter that governs the mix you use)

- $\texttt{Ridge}$ and $\texttt{RidgeCV}$ are specifically for Ridge regularization, with many different optimizer options built-in. You can choose your optimizer manually, or use the default 'auto' option, which chooses an appropriate solver based on your data. The "CV" version has built-in cross-validation. 

- $\texttt{Lasso}$ and $\texttt{LassoCV}$: same as above, but for Lasso regularization

### Let's start with Ridge regression, and compare the coefficients of the linear model for different amounts of regularization.


This time we will apply scaling to the features to help the fits converge correctly, but we'll keep the coefficient re-scaling problem simpler by 1) not scaling y values and 2) not shifting the mean of each feature to 0 as part of scaling. That leaves us with just 1 re-scaling parameter we have to correct for. If you try this and have trouble with convergence, you can improve the scaling as needed.



#### Let's pick alpha = 1000.

In [None]:
#Let's use Ridge to try something new, plus is should converge better then SGD
ridgemodel = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=1000))

ridgemodel.fit(new_xb,ypb)

For a pipeline model, you can access each part of the pipeline, in order, by index.

In [None]:
#To access the Scaler object:
print(ridgemodel[0])
#To access the Ridge model object:
print(ridgemodel[1])

In [None]:
print("Scaled coefficients: ", ridgemodel[1].coef_) # the scaled coefficients of the Ridge model

coef_alpha_1000 = np.hstack([ridgemodel[1].coef_/ridgemodel[0].scale_, ridgemodel[1].intercept_])
                         
print("The coefficients and y intercept: ", coef_alpha_1000)

#### Now let's see for alpha = 1.0.

### Question: 
Make a prediction. Will the coefficients be larger or smaller?

In [None]:
#Note: the notation changed in sklearn 1.2 and higher; 
#To reproduce book results, we need to use alpha = alpha * n_samples

newmodel = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=1))

newmodel.fit(new_xb,ypb)

coef_alpha_1 = np.hstack([newmodel[1].coef_/newmodel[0].scale_, newmodel[1].intercept_])
                         
print(coef_alpha_1)

#### Finally, below we use a trick to get coefficients for "zero" alpha (no regularization); we could have also used LinearRegressor.

In [None]:
newmodel = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=1e-11))

newmodel.fit(new_xb,ypb)

coef_alpha_noreg = np.hstack([newmodel[1].coef_/newmodel[0].scale_, newmodel[1].intercept_])
                         
print(coef_alpha_noreg)

Comparison with the same procedure for the linear model (no regularization) shows consistency.

In [None]:
lmodel = make_pipeline(StandardScaler(with_mean=False),linear_model.LinearRegression())
lmodel.fit(new_xb,ypb)
print(lmodel[1].coef_/lmodel[0].scale_, lmodel[1].intercept_)

### Now, let's try Lasso Regularization



### We can compare the coefficients of the linear model for different amounts of regularization.

In [None]:
plt.figure(figsize = (12,6))
plt.bar(np.arange(6)-0.2, np.abs(coef_alpha_1000), color = 'maroon',width=0.05, label = 'Ridge, alpha = 1000')
plt.bar(np.arange(6)-0.1, np.abs(coef_alpha_1), color = 'orangered',width=0.05, label = 'Ridge, alpha = 1.0')
plt.bar(range(6), np.abs(coef_alpha_noreg), color = 'grey',width=0.05, label = 'Linear (no regularization)')
plt.yscale('log')

plt.xticks(np.arange(6), ['x0','x1', 'x0^2','x0x1','x1^2', 'Intercept'])  # Set text labels.

plt.xlabel('Feature',fontsize=14)

plt.ylabel('Coefficients (absolute value)',fontsize=14)

plt.legend(fontsize=13);

### Now let's take a look at LASSO.

In [None]:
from sklearn.linear_model import Lasso, LassoCV 

Let's look at the coefficients for alpha = 10,000, alpha = 1,000 and alpha = 1. Lasso regularization tends to induce sparse coefficients, so we can check that that's true!

In [None]:
L10k = make_pipeline(StandardScaler(with_mean=False), Lasso(alpha = 10000, max_iter = 1000000, tol = 0.005))

L10k.fit(new_xb, ypb)

coef_L10k =  np.hstack([L10k[1].coef_, L10k[1].intercept_])

print(coef_L10k)

In [None]:
L1000 = make_pipeline(StandardScaler(with_mean=False), Lasso(alpha = 1000, max_iter = 1000000, tol = 0.005))

L1000.fit(new_xb, ypb)

coef_L1000 =  np.hstack([L1000[1].coef_, L1000[1].intercept_])

print(coef_L1000)

In [None]:
L1 = make_pipeline(StandardScaler(with_mean=False), Lasso(alpha = 1, max_iter = 1000000, tol = 0.005))

L1.fit(new_xb, ypb)

coef_L1 =  np.hstack([L1[1].coef_/L1[0].scale_, L1[1].intercept_])

print(coef_L1)

### Finally, we can plot all the coefficients together.

In [None]:
plt.figure(figsize = (12,6))
plt.bar(np.arange(6)-0.2, np.abs(coef_alpha_1000), color = 'maroon',width=0.05, label = 'Ridge, alpha = 1000')
plt.bar(np.arange(6)-0.1, np.abs(coef_alpha_1), color = 'orangered',width=0.05, label = 'Ridge, alpha = 1.0')
plt.bar(range(6), np.abs(coef_alpha_noreg), color = 'grey',width=0.05, label = 'Linear (no regularization)')
plt.bar(np.arange(6)+0.1, np.abs(coef_L1), color = 'tab:cyan',width=0.05, label = 'Lasso, alpha = 1.0')
plt.bar(np.arange(6)+0.2, np.abs(coef_L1000), color = 'tab:blue', width=0.05, label = 'Lasso, alpha = 1000')
plt.bar(np.arange(6)+0.3, np.abs(coef_L10k), color = 'xkcd:indigo', width=0.05, label = 'Lasso, alpha = 10000')

plt.yscale('log')

plt.xticks(np.arange(6), ['x0','x1', 'x0^2','x0x1','x1^2', 'Intercept'])  # Set text labels.

plt.xlabel('Feature',fontsize=14)

plt.ylabel('Coefficients (absolute value)',fontsize=14)

plt.legend(fontsize=13, bbox_to_anchor=(1.05, 1));


### Questions:
- In general, what was the effect of increasing alpha in Ridge regularization?
- What was the effect of applying Lasso regularization with a high alpha (= 1000) parameter? What about applying it with very high alpha (= 10000)? In each case, were any coefficients set to 0? Which?


### Cross-validation for regularization:

To actually pick an alpha, we should use cross-validation. sk-learn has built-in methods for this. 

In [None]:
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV

In [None]:
regm = make_pipeline(StandardScaler(with_mean=False),
                     RidgeCV(alphas=np.logspace(-6,6,13), \
            cv = KFold(n_splits=5, shuffle=True, random_state=1),\
             scoring = 'neg_mean_squared_error'))

regm.fit(new_xb,ypb); 

print('The best alpha is', regm[1].alpha_)

In [None]:
#Note: LassoCV re-orders alphas in DESCENDING ORDER! Scores will be messed up unless you use model.alphas_ object

lassomodel = make_pipeline(StandardScaler(with_mean = False), \
                      LassoCV(alphas = np.logspace(-1,4,6), 
                        cv = KFold(n_splits=5, shuffle=True, random_state=1), \
              max_iter = 10000000, tol = 1e-6))

lassomodel.fit(new_xb,ypb)

print('Best alpha:', lassomodel[1].alpha_)

# Example of how to print all the results
#print('Alphas', lassomodel[1].alphas_)
#for i, alpha in enumerate(lassomodel[1].alphas_):
#    print('Score for alpha', alpha, np.mean(lassomodel[1].mse_path_[i,:])) #for each alpha (row), mean of CV estimate of MSE

In this case, the Ridge and Lasso regularization are giving very different answers! This might be a case where we'd want to check how ElasticNet regularization behaves, since it allows us to vary smoothly between the two, optimizing the mix with a hyperparameter. You'll try that out on Homework 3.

### Acknowledgement Statement:

### Once you're done, upload the completed studio to Gradescope.