# CS342 Machine Learning
# Lab 4: Linear regression

## Department of Computer Science, University of Warwick

This lab focuses on the use of regularization for linear regression.

# Data files for the lab

If working on one of the DCS machines, the data may be found here:

```/modules/cs342/2020/lab4/data/prostate_data.csv ```

You may load the data directly from that directory.

Alternatevely, the data is also available on the CS342 website.

The prostate dataset (file *prostate_data.csv* see: https://web.stanford.edu/~hastie/ElemStatLearn//datasets/prostate.data) will be used to predict the numerical target variable *lpsa* based on 8 features (*lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45*). There are 97 samples in total. The last column is a Train/Predict flag to be used to separate the samples into two subsets. The *train = T* subset will be used for model fitting and cross-validation (CV), while the *train = F* subset will be used for testing after model selection and training.

Import the data into a Pandas data frame and standardize the features and the targets. 

In [12]:
#import prostate dataset
import pandas as pd
prostate = pd.read_csv('prostate_data.csv')

for col in prostate.columns:
    if col != 'train':
        prostate[col].mean()
        prostate[col].std()
        
        for i in range(0, len(prostate)):
            current = prostate[col].iloc[i]
            #print(current)


#standardize features and targets


      lcavol   lweight  age      lbph  svi       lcp  gleason  pgg45  \
0  -0.579818  2.769459   50 -1.386294    0 -1.386294        6      0   
1  -0.994252  3.319626   58 -1.386294    0 -1.386294        6      0   
2  -0.510826  2.691243   74 -1.386294    0 -1.386294        7     20   
3  -1.203973  3.282789   58 -1.386294    0 -1.386294        6      0   
4   0.751416  3.432373   62 -1.386294    0 -1.386294        6      0   
..       ...       ...  ...       ...  ...       ...      ...    ...   
92  2.830268  3.876396   68 -1.386294    1  1.321756        7     60   
93  3.821004  3.896909   44 -1.386294    1  2.169054        7     40   
94  2.907447  3.396185   52 -1.386294    1  2.463853        7     10   
95  2.882564  3.773910   68  1.558145    1  1.558145        7     80   
96  3.471966  3.974998   68  0.438255    1  2.904165        7     20   

        lpsa train  
0  -0.430783     T  
1  -0.162519     T  
2  -0.162519     T  
3  -0.162519     T  
4   0.371564     T  
..       

### Non-regularized linear regression 

Scikit-learn has a plethora of linear models:http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model. It includes all the implementations needed for this lab, including implementations that employ cross-validation to select hyperparameter values for regularization. 

Fit a non-regularized linear regression model to the *train = T* subset. Use least squares to fit the model. Once fitted, use the model to predict the target variable in the *train = F* subset. Use the coefficient of determination, $R^2$, to evaluate the model performance on the test data. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of $\mathbf{y}$, disregarding the input features, would get a  score of 0.0. See http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model for details about how $R^2$ is computed.

In [6]:
#Fit and test a non-regularized linear regression model using least squares


### L2-regularized linear regression (ridge regression)

Fit an L2-regularized linear regression model (using least squares) to the *train = T* subset after using 3-fold CV  to select the hyperparameter value for regularization. The range of hyperparameter values to select from is: $alpha=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]$. Use the model fitted with the best hyperparameter value to predict the target variable in the *train = F* subset. Use the coefficient of determination, $R^2$, to evaluate the model performance on the test data.

In [7]:
#Fit and test an L2-regularized linear regression model using least squares


### L1-regularized linear regression (lasso regression)

Fit an L1-regularized linear regression model (using least squares) to the *train = T* subset after using 3-fold CV to select the hyperparameter value for regularization. The range of hyperparameter values to select from is: $alpha=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]$. Note that for L1-regularization, we use a set of low values as this type of regularization can easily force the weights to be $0$ if alpha is relatively large. Use the model fitted with the best hyperparameter value to predict the target variable in the *train = F* subset. Use the coefficient of determination, $R^2$, to evaluate the model performance on the test data.

In [8]:
#Fit and test an L1-regularized linear regression model using least squares


1. Which model performs the best for this dataset?
2. Which features are irrelevant for this task? **Hint:** display the learned coefficients (weights) for each model and recall which type of regularization allows for feature selection.