# Linear-Regression Model Exploration (CSC311 Machine Learning Project)

## Introduction

In this notebook, we will explore the linear regression model, using the `sklearn` library.
The goal of this exploration is to determine the optimal hyperparameters for our model, and to evaluate the model's performance on the class provided dataset.

## Data Splitting

For this model I decided to do three different splits on the data
 1. 60% Training, 20% Testing, 20% Validation
 2. 70% Training, 15% Testing, 15% Validation
 3. 80% Training, 10% Testing, 10% Validation

Reasoning: from in class experience supported by several online articles (_link articles in footnotes_), it became apparent that having the largest subset of the data devoted just for training was necessary.

Then, having _N = 1440_ data points, that gives us:
 1. `n_train, n_test, n_valid = 0.6 * N, 0.2 * N, 0.2 * N`
 2. `n_train, n_test, n_valid = 0.7 * N, 0.15 * N, 0.15 * N`
 3. `n_train, n_test, n_valid = 0.8 * N, 0.1 * N, 0.1 * N`

The data is split in the following way (this is implemented in `/model/encoding.py` in `data_split()`:
- `x_test, y_test = x[:n_train], y[:n_train]`
- `x_train, y_train = x[n_train:n_train + n_test], y[n_train:n_train + n_test]`
- `x_valid, y_valid = x[n_train + n_test:], y[n_train + n_test]`

## Hyperparameter Tuning
TBD

In [None]:
import sys
sys.path.append('../../') # Path to root directory

from model.encoding import encode, split_data
from sklearn.linear_model import LinearRegression as LR
from sklearn.metrics import mean_squared_error as mse

FILE_NAME = "clean_dataset.csv"
FILE_PATH = "../../model/"

# Encode the data
X, t = encode(FILE_PATH + FILE_NAME)
print("Number of data points: " + str(x.shape[0]))

In [48]:
def process(splits=[(0.6, 0.2), (0.7, 0.15), (0.8, 0.1), (0.9, 0.05), (0.95, 0.025)]):
    for split in splits:
        # Split the data
        print("\tSplit:", split)
        X_train, t_train, X_test, t_test, X_valid, t_valid = split_data(X, t, split[0], split[1])
        
        # Print in a nice format
        print(f"Training data: {X_train.shape[0]} samples")
        print(f"Testing data: {X_test.shape[0]} samples")
        print(f"Validation data: {X_valid.shape[0]} samples")
        
        # Fit, train and test
        linreg = LR(fit_intercept = False)
        linreg.fit(X_train, t_train)
        
        y_train_pred = linreg.predict(X_train)
        y_test_pred = linreg.predict(X_test)
        y_valid_pred = linreg.predict(X_valid)
        
        # Calculate mse for training and testing predicions
        train_mse = mse(t_train, y_train_pred)
        test_mse = mse(t_test, y_test_pred)
        valid_mse = mse(t_valid, y_valid_pred)
        
        # print(linreg.coef_) # weights
        print("\t\tTraining MSE:", train_mse)
        print("\t\tTesting MSE:", test_mse)
        print("\t\tValidation MSE:", valid_mse)

In [49]:
process()

	Split: (0.6, 0.2)
Training data: 880 samples
Testing data: 293 samples
Validation data: 295 samples
		Training MSE: 0.05380860790365482
		Testing MSE: 0.5629065305453214
		Validation MSE: 51.332363194031785
	Split: (0.7, 0.15)
Training data: 1027 samples
Testing data: 220 samples
Validation data: 221 samples
		Training MSE: 0.06427158602302369
		Testing MSE: 0.7265653181467783
		Validation MSE: 0.07251590801827494
	Split: (0.8, 0.1)
Training data: 1174 samples
Testing data: 146 samples
Validation data: 148 samples
		Training MSE: 0.0641254242836685
		Testing MSE: 1.2763779565186741
		Validation MSE: 0.06821763648267096
	Split: (0.9, 0.05)
Training data: 1321 samples
Testing data: 73 samples
Validation data: 74 samples
		Training MSE: 0.06595748219133354
		Testing MSE: 0.06259121488700696
		Validation MSE: 0.07745315543323072
	Split: (0.95, 0.025)
Training data: 1394 samples
Testing data: 36 samples
Validation data: 38 samples
		Training MSE: 0.06539519485981833
		Testing MSE: 0.091797