# Ridge Regression
This notebook contains the code for running multiple models on the ML Higgs boson data using Ridge regression with normal equations.

In [2]:
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('..')
from helpers import *
from implementations import *

## Load train and test data

In [3]:
DATA_TRAIN_PATH = '../data/train.csv'
y_train, X_train, ids = load_csv_data(DATA_TRAIN_PATH)

In [4]:
DATA_TEST_PATH = '../data/test.csv'
_, X_test, ids_test = load_csv_data(DATA_TEST_PATH)

In [5]:
X_train.shape

(250000, 30)

## Baseline model using raw data
First let's run Ridge regression algorithm on our raw data without doing any preprocessing. We will use K-fold cross validation to report the metrics on the test data and grid search to tune our hyperparameters.

In [6]:
param_grid = {
    'lambda_': np.logspace(-4,0,5) ,
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = ridge_regression_cv(y_train, X_train, param_grid=param_grid, transform=False)

In [7]:
metrics, params

({'loss': 0.3397855566120538,
  'accuracy': 74.42840000000001,
  'f1_score': 0.5688096990110691},
 {'lambda_': 0.0001, 'max_iters': 1000, 'gamma': 0.01})

Ridge regression seems to be working decently with raw data.

## Baseline model using lightly feature engineered data
Now let's now preprocess our data a bit to handle the missing values (-999) in various ways.

### All features with NaN values imputed
First let's impute all missing values with median of their respective columns. So we will set the `imputable_th` to `1` which means impute all columns with a nan value ratio less than 1, or in other words all columns.

In [8]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=1, encodable_th=0)

In [9]:
tX_train.shape, tX_test.shape

((236483, 31), (568238, 31))

We have now all the columns imputed and plus one more column for the bias.

In [10]:
param_grid = {
    'lambda_': np.logspace(-4,0,5) ,
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = ridge_regression_cv(ty_train, tX_train, param_grid=param_grid, transform=False)

In [11]:
metrics, params

({'loss': 0.3256406318235878,
  'accuracy': 76.44029093369419,
  'f1_score': 0.6129394851260843},
 {'lambda_': 0.0001, 'max_iters': 1000, 'gamma': 0.01})

Accuracy, f1_score and loss have gone up, compared to the raw values, indicating that this pre-processing is useful. 

### All features with NaN values encoded
Now let's instead encode these features with NaN values into new indicator features where the new feature takes on a value of 1 if the value for the feature is missing, otherwise 0.

In [12]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=0, encodable_th=1)

In [13]:
tX_train.shape, tX_test.shape

((243430, 31), (568238, 31))

In [14]:
param_grid = {
    'lambda_': np.logspace(-4,0,5) ,
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = ridge_regression_cv(ty_train, tX_train, param_grid=param_grid, transform=False)

In [15]:
metrics, params

({'loss': 0.33508274817893047,
  'accuracy': 75.14151912254036,
  'f1_score': 0.604671339546958},
 {'lambda_': 0.0001, 'max_iters': 1000, 'gamma': 0.01})

Accuracy, f1_score and loss have gone up, compared to the raw values, indicating that this pre-processing is useful but it is lower than only imputing values. This indicates that a mix of the two might be more useful 

### Mixed imputing and encoding approach
Finally,  let's try a more reasonable approach to the imputing and encoding scheme. As we saw in the exploration notebook, we have some features that have less than 15% of them missing, some around 40% and some more than 70%. Let's impute the columns in the first group, encode the ones in the second group and drop completely the ones in the third group.


In [16]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=0.3, encodable_th=0.7)

In [17]:
tX_train.shape, tX_test.shape

((242240, 24), (568238, 24))

In [18]:
param_grid = {
    'lambda_': np.logspace(-4,0,5) ,
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = ridge_regression_cv(ty_train, tX_train, param_grid=param_grid, transform=False)

In [19]:
metrics, params

({'loss': 0.3360989798895204,
  'accuracy': 75.44129788639366,
  'f1_score': 0.61151010562385},
 {'lambda_': 0.0001, 'max_iters': 1000, 'gamma': 0.01})

Seems like we get the best performance when we impute all of the NaN values. Let's continue our feature engineering with these preprocessing thresholds fixed.

## Baseline model using heavily feature engineered data
In this step, we are going to apply more feature engineering. First, we will apply polynomial features of some degree that we will tune through grid search and cross validation.

In [34]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=1, encodable_th=0)

In [35]:
tX_train.shape, tX_test.shape

((236483, 31), (568238, 31))

In [36]:
param_grid = {
    'lambda_': np.logspace(-4,0,5) ,
    'degree': list(range(1, 4)),
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = ridge_regression_cv(ty_train, tX_train, param_grid=param_grid)

In [37]:
metrics, params

({'loss': 0.2940684456343238,
  'accuracy': 79.8680649526387,
  'f1_score': 0.6785882304073018},
 {'lambda_': 0.0001, 'degree': 3, 'max_iters': 1000, 'gamma': 0.01})

It looks like polynomial feature expansion is useful for ridge regression, as the degree =3 was chosen and accuracy has gone up.

Next, we are going to split our datasets based on the number of jets (`PRI_jet_num`) and create 3 subsets of the data for observations with 0, 1 and more than 1 jet respectively. Each subset will also only have the relevant columns (based on the original paper) All other missing values in the new subsets will be imputed with median values.

In [24]:
X_train_zero, y_train_zero, X_train_one, y_train_one, X_train_many, y_train_many = split_by_jet_num(DATA_TRAIN_PATH, X_train, y_train)
X_test_zero, ids_test_zero, X_test_one, ids_test_one, X_test_many, ids_test_many = split_by_jet_num(DATA_TRAIN_PATH, X_test, ids_test)

In [25]:
X_train_zero.shape, X_train_one.shape, X_train_many.shape

((99913, 15), (77544, 22), (72543, 29))

In [26]:
def train(X_train, y_train, X_test):
    tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=1, encodable_th=0)
    
    param_grid = {
    'lambda_': np.logspace(-4,0,5) ,
    'degree': list(range(1, 4)),
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
    metrics, params = ridge_regression_cv(ty_train, tX_train, param_grid=param_grid)
    return metrics, params

In [27]:
metrics_zero, params_zero = train(X_train_zero, y_train_zero, X_test_zero)

In [28]:
metrics_one, params_one = train(X_train_one, y_train_one, X_test_one)

In [29]:
metrics_many, params_many = train(X_train_many, y_train_many, X_test_many)

In [30]:
metrics_zero, params_zero

({'loss': 0.24364324945993907,
  'accuracy': 83.20020560267284,
  'f1_score': 0.6209447368524985},
 {'lambda_': 0.0001, 'degree': 3, 'max_iters': 1000, 'gamma': 0.01})

In [31]:
metrics_one, params_one

({'loss': 0.31804954150970566,
  'accuracy': 78.1462314225053,
  'f1_score': 0.6755361956263567},
 {'lambda_': 0.0001, 'degree': 3, 'max_iters': 1000, 'gamma': 0.01})

In [32]:
metrics_many, params_many

({'loss': 0.2968016621447154,
  'accuracy': 80.30970648271979,
  'f1_score': 0.7802551375347928},
 {'lambda_': 0.0001, 'degree': 3, 'max_iters': 1000, 'gamma': 0.01})

In [33]:
a = X_train_zero.shape[0]
b =  X_train_one.shape[0] 
c = X_train_many.shape[0]
avg_accuracy = ((metrics_zero['accuracy']*a) +  (metrics_one['accuracy']*b) + (metrics_many['accuracy']*c))/(a+b+c)

print(f"Average accuracy with jet_num training is {avg_accuracy}")

Average accuracy with jet_num training is 80.79384219673017


Splitting by jet_num lets us arrive in the 80% accuracy territory.