# Case 1 - Elastic-Net Regression

1. **Importing Libraries**

2. **Loading Data**

3. **Elastic Net Regression**
- Elastic Net Regression because we have many variables that we do not know. Elastic Net combines the strengths of Lasso regression (L1) and Ridge regression. Lasso regression can shrink parameters to 0 which is useful for large dataset where some parameters might be useless. Ridge regression tends to perform better when parameters are not useless. Therefore, elastic net is useful in this case as we do not know our parameters.
- Find optimal model parameters, lamdba_1 and lambda_2 by using 5-fold cross validation.
- Get root mean squared error (RMSE) by applying the model with the optimal model parameters on the test data.

## 1. Importing Libraries

In [1]:
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set() # Set searborn as default

from sklearn.linear_model import ElasticNetCV

from sklearn.metrics import mean_squared_error

import warnings

# Set seed for reproducibility
import random
random.seed(42)

## 2. Loading Data

In [2]:
# Loading the data into numpy arrays
X_train = np.loadtxt('../data/case1Data_Xtrain.csv', delimiter=',')
X_test = np.loadtxt('../data/case1Data_Xtest.csv', delimiter=',')
y_train = np.loadtxt('../data/case1Data_ytrain.csv', delimiter=',')
y_test = np.loadtxt('../data/case1Data_ytest.csv', delimiter=',')

### Summary of the data

This should be similar to the summary in case_1_data_wrangling.ipynb.

In [3]:
# Printing the shape of the data
print("X_train: ", X_train.shape)
print("X_test: ", X_test.shape)
print("y_train: ", y_train.shape)
print("y_test: ", y_test.shape)

# Size of the training and test data
n_train = X_train.shape[0]
n_test = X_test.shape[0]
p = X_train.shape[1]

# Printing the size of the training and test data
print("n_train: ", n_train) # number of training samples
print("n_test: ", n_test) # number of test samples
print("p: ", p) # number of features/variables/columns/parameters

# Checking for missing values in the wrangled data
missing_values_X_train = np.isnan(X_train)
print("Number of missing values in X_train: ", np.sum(missing_values_X_train))
missing_values_X_test = np.isnan(X_test)
print("Number of missing values in X_test: ", np.sum(missing_values_X_test))
missing_values_y_train = np.isnan(y_train)
print("Number of missing values in y_train: ", np.sum(missing_values_y_train))
missing_values_y_test = np.isnan(y_test)
print("Number of missing values in y_test: ", np.sum(missing_values_y_test))

X_train:  (80, 116)
X_test:  (20, 116)
y_train:  (80,)
y_test:  (20,)
n_train:  80
n_test:  20
p:  116
Number of missing values in X_train:  0
Number of missing values in X_test:  0
Number of missing values in y_train:  0
Number of missing values in y_test:  0


# 3. Elastic Net Regression

In [4]:
# Setting a range of alphas to test
alphas = np.logspace(-4, 0, 100)

# Setting a range of l1_ratios
l1_ratios = np.logspace(-10, 0, 100)

with warnings.catch_warnings(): # done to disable all the convergence warnings from elastic net
    warnings.simplefilter("ignore")
    
    # Fitting the Elastic Net model on the training data
    model = ElasticNetCV(cv=3, l1_ratio = l1_ratios, alphas=alphas, fit_intercept=False).fit(X_train, y_train)

# Printing the optimal alpha
print(f'Optimal alpha: {model.alpha_}\n')
print('For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.')
print(f'Optimal l1_ratio: {model.l1_ratio_}')

# Plotting the cross-validated mean squared error of the Elastic Net Fit
#plt.close('all')
#plt.figure()
#plt.semilogx(model.alphas_, model.mse_path_.mean(axis=-1), 'k', label='Average across folds', linewidth=2)
#plt.xlabel(r'$\alpha$ (Regularization strength)')
#plt.ylabel('Mean squared error')
#plt.title(f'Cross-validated MSE of Elastic Net Fit (Optimal alpha = {model.alpha_:.3f})')
#plt.show()

Optimal alpha: 0.016681005372000592

For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
Optimal l1_ratio: 1.0


In [5]:
# Using the optimal lambda from the ElasticNetCV model to predict the target values on the test data
y_hat = model.predict(X_test)

# Calculating the RMSE of the ElasticNetCV model
rmse = np.sqrt(mean_squared_error(y_test, y_hat))

# Printing the RMSE
print('Root MSE from OLS with elastic net regression and cross validation to find optimal lambda and model parameters:')
print(f'RMSE: {rmse}')

Root MSE from OLS with elastic net regression and cross validation to find optimal lambda and model parameters:
RMSE: 0.39768204205170693


## 3.1 Predicting $\hat{y}$ in the new data set (case1Data_Xnew.csv)

### 3.1.1 Loading case1Data_Xnew_wrangled.csv

In [12]:
# Loading the data into numpy arrays
X_new = np.loadtxt('../data/case1Data_Xnew_wrangled.csv', delimiter=',')
print("X_new: ", X_new.shape)


X_new:  (1000, 116)


### 3.1.2 Predicting and saving predictions in a new file

In [13]:
# Predicting y_hat for the data in case1Data_Xnew.csv
y_hat_new = model.predict(X_new)

# Printing the shape of the new data
print(y_hat_new.shape)

# Saving the predictions to a csv file
np.savetxt('../results/sample_predictions_s183220_s225001.csv', y_hat_new, delimiter='\n')

(1000,)


### 3.1.3 Calculating the expected prediction error

In [None]:
# Calculating the expected prediction error for the new data in case1Data_Xnew.csv with no true y-values
# The expected prediction error is the sum of the squared bias and the variance of the model
# The bias is the difference between the expected value of the predictions and the true value
