# CS342 Machine Learning
# Lab 4: Linear regression

## Department of Computer Science, University of Warwick

This lab focuses on the use of regularization for linear regression.

# Data files for the lab

If working on one of the DCS machines, the data may be found here:

```/modules/cs342/2020/lab4/data/prostate_data.csv ```

You may load the data directly from that directory.

Alternatevely, the data is also available on the CS342 website.

The prostate dataset (file *prostate_data.csv* see: https://web.stanford.edu/~hastie/ElemStatLearn//datasets/prostate.data) will be used to predict the numerical target variable *lpsa* based on 8 features (*lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45*). There are 97 samples in total. The last column is a Train/Predict flag to be used to separate the samples into two subsets. The *train = T* subset will be used for model fitting and cross-validation (CV), while the *train = F* subset will be used for testing after model selection and training.

Import the data into a Pandas data frame and standardize the features and the targets. 

In [14]:
#import prostate dataset
import pandas as pd
prostate = pd.read_csv('prostate_data.csv')

for col in prostate.columns:
    if col != 'train':
        mean = prostate[col].mean()
        std = prostate[col].std()
        
        for i in range(0, len(prostate)):
            current = prostate[col].iloc[i]
            new = (current - mean)/std
            prostate.loc[i, col] = new

print(prostate)


#standardize features and targets


      lcavol   lweight       age      lbph       svi       lcp   gleason  \
0  -1.637356 -2.006212 -1.862426 -1.024706 -0.522941 -0.863171 -1.042157   
1  -1.988980 -0.722009 -0.787896 -1.024706 -0.522941 -0.863171 -1.042157   
2  -1.578819 -2.188784  1.361163 -1.024706 -0.522941 -0.863171  0.342627   
3  -2.166917 -0.807994 -0.787896 -1.024706 -0.522941 -0.863171 -1.042157   
4  -0.507874 -0.458834 -0.250631 -1.024706 -0.522941 -0.863171 -1.042157   
..       ...       ...       ...       ...       ...       ...       ...   
92  1.255920  0.577607  0.555266 -1.024706  1.892548  1.073572  0.342627   
93  2.096506  0.625489 -2.668323 -1.024706  1.892548  1.679542  0.342627   
94  1.321402 -0.543304 -1.593794 -1.024706  1.892548  1.890377  0.342627   
95  1.300290  0.338384  0.555266  1.004813  1.892548  1.242632  0.342627   
96  1.800367  0.807764  0.555266  0.232904  1.892548  2.205279  0.342627   

       pgg45      lpsa train  
0  -0.864467 -2.520226     T  
1  -0.864467 -2.287827   

### Non-regularized linear regression 

Scikit-learn has a plethora of linear models:http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model. It includes all the implementations needed for this lab, including implementations that employ cross-validation to select hyperparameter values for regularization. 

Fit a non-regularized linear regression model to the *train = T* subset. Use least squares to fit the model. Once fitted, use the model to predict the target variable in the *train = F* subset. Use the coefficient of determination, $R^2$, to evaluate the model performance on the test data. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of $\mathbf{y}$, disregarding the input features, would get a  score of 0.0. See http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model for details about how $R^2$ is computed.

In [37]:
#Fit and test a non-regularized linear regression model using least squares

import numpy as np
from sklearn.linear_model import LinearRegression

X = prostate[prostate['train'] == 'T']
y = X['lpsa']
X = X.drop('train', axis=1)
X = X.drop('lpsa', axis=1)

reg = LinearRegression().fit(X,y)

X2 = prostate[prostate['train'] == 'F']
y2 = X2['lpsa']
X2 = X2.drop('train', axis=1)
X2 = X2.drop('lpsa', axis=1)

reg.predict(X2)
reg.score(X2, y2)





0.5033798502381805

### L2-regularized linear regression (ridge regression)

Fit an L2-regularized linear regression model (using least squares) to the *train = T* subset after using 3-fold CV  to select the hyperparameter value for regularization. The range of hyperparameter values to select from is: $alpha=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]$. Use the model fitted with the best hyperparameter value to predict the target variable in the *train = F* subset. Use the coefficient of determination, $R^2$, to evaluate the model performance on the test data.

In [40]:
#Fit and test an L2-regularized linear regression model using least squares
from sklearn.model_selection import KFold
from sklearn.linear_model import RidgeCV

folds = 3
alphas= [1,2,3,4,5,6,7,8,9,10,11,12]
kf = KFold(n_splits=folds)

clf = RidgeCV(alphas=alphas, cv=kf).fit(X,y)
clf.predict(X2)
clf.score(X2,y2)


0.5370542358730208

### L1-regularized linear regression (lasso regression)

Fit an L1-regularized linear regression model (using least squares) to the *train = T* subset after using 3-fold CV to select the hyperparameter value for regularization. The range of hyperparameter values to select from is: $alpha=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]$. Note that for L1-regularization, we use a set of low values as this type of regularization can easily force the weights to be $0$ if alpha is relatively large. Use the model fitted with the best hyperparameter value to predict the target variable in the *train = F* subset. Use the coefficient of determination, $R^2$, to evaluate the model performance on the test data.

In [46]:
#Fit and test an L1-regularized linear regression model using least squares
from sklearn.linear_model import LassoCV

alphas2= [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
kf = KFold(n_splits=folds)

clf = LassoCV(alphas=alphas2, cv=kf).fit(X,y)
clf.predict(X2)
clf.score(X2,y2)

coeffs = clf.coef_
print(coeffs)


[ 0.47383974  0.18554689 -0.          0.07233373  0.13220524  0.
  0.          0.05012802]


1. Which model performs the best for this dataset?
2. Which features are irrelevant for this task? **Hint:** display the learned coefficients (weights) for each model and recall which type of regularization allows for feature selection.