# Least Squares using Gradient Descent
This notebook contains the code for running multiple models on the ML Higgs boson data using the Least Squares Gradient Descent algorithm.

In [1]:
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('..')
from helpers import *
from implementations import *

## Load train and test data

In [2]:
DATA_TRAIN_PATH = '../data/train.csv'
y_train, X_train, ids = load_csv_data(DATA_TRAIN_PATH)

In [3]:
DATA_TEST_PATH = '../data/test.csv'
_, X_test, ids_test = load_csv_data(DATA_TEST_PATH)

In [4]:
X_train.shape

(250000, 30)

## Baseline model using raw data
First let's run the Least Squares GD algorithm on our raw data without doing any preprocessing. We will use K-fold cross validation to report the metrics on the test data and grid search to tune our hyperparameters.

In [23]:
param_grid = {
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = least_squares_GD_cv(y_train, X_train, param_grid=param_grid, transform=False)

  return (1 / len(y)) * (tx.T @ ((tx @ w) - y))
  return (1 / len(y)) * (tx.T @ ((tx @ w) - y))
  recall = tp / (tp + fn)
  return 2 * precision * recall / (precision + recall)


In [24]:
metrics, params

({'loss': nan, 'accuracy': 0.0, 'f1_score': nan},
 {'max_iters': 1000, 'gamma': 0.01})

Raw data didnt seem to be working with the LS Gradient Descent algorithm as expected because we have some outliers and undefined values in our data. Let's preprocess our data and handle missing values.

## Baseline model using lightly feature engineered data
Now let's now preprocess our data a bit to handle the missing values (-999) in various ways.

### All features with NaN values imputed
First let's impute all missing values with median of their respective columns. So we will set the `imputable_th` to `1` which means impute all columns with a nan value ratio less than 1, or in other words all columns.

In [80]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=1, encodable_th=0)

In [81]:
tX_train.shape, tX_test.shape

((236483, 31), (568238, 31))

We have now all the columns imputed and plus one more column for the bias.

In [82]:
param_grid = {
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = least_squares_GD_cv(ty_train, tX_train, param_grid=param_grid, transform=False)

In [83]:
metrics, params

({'loss': 0.32577149508425773,
  'accuracy': 76.43690798376186,
  'f1_score': 0.6127606134039422},
 {'max_iters': 1000, 'gamma': 0.1})

### All features with NaN values encoded
Now let's instead encode these features with NaN values into new indicator features where the new feature takes on a value of 1 if the value for the feature is missing, otherwise 0.

In [84]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=0, encodable_th=1)

In [85]:
tX_train.shape, tX_test.shape

((243430, 31), (568238, 31))

In [86]:
param_grid = {
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = least_squares_GD_cv(ty_train, tX_train, param_grid=param_grid, transform=False)

In [87]:
metrics, params

({'loss': 0.33525773755082533,
  'accuracy': 75.19369017787454,
  'f1_score': 0.6050729505315338},
 {'max_iters': 1000, 'gamma': 0.1})

### Mixed imputing and encoding approach
Finally,  let's try a more reasonable approach to the imputing and encoding scheme. As we saw in the exploration notebook, we have some features that have less than 15% of them missing, some around 40% and some more than 70%. Let's impute the columns in the first group, encode the ones in the second group and drop completely the ones in the third group.


In [88]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=0.3, encodable_th=0.7)

In [89]:
tX_train.shape, tX_test.shape

((242240, 24), (568238, 24))

In [90]:
param_grid = {
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = least_squares_GD_cv(ty_train, tX_train, param_grid=param_grid, transform=False)

In [91]:
metrics, params

({'loss': 0.3361204296770073,
  'accuracy': 75.41074966974901,
  'f1_score': 0.6110440853653278},
 {'max_iters': 1000, 'gamma': 0.1})

Seems like we get the best performance when we impute all of the NaN values. Let's continue our feature engineering with these preprocessing thresholds fixed.

## Baseline model using heavily feature engineered data
In this step, we are going to apply more feature engineering. First, we will apply polynomial features of some degree that we will tune through grid search and cross validation.

In [5]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=1, encodable_th=0)

In [6]:
tX_train.shape, tX_test.shape

((236483, 31), (568238, 31))

In [7]:
param_grid = {
    'max_iters': 500,
    'degree': list(range(1, 4)),
    'gamma': [0.01, 0.1],
    'cont_features': [cont_features]
}
metrics, params = least_squares_GD_cv(ty_train, tX_train, param_grid=param_grid)

  return (1/ (2 * N)) * np.sum(e ** 2)
  return (1 / len(y)) * (tx.T @ ((tx @ w) - y))
  precision = tp / (tp + fp)
  recall = tp / (tp + fn)
  return (1 / len(y)) * (tx.T @ ((tx @ w) - y))


In [8]:
metrics, params

({'loss': 0.3261084584963338,
  'accuracy': 76.38700947225979,
  'f1_score': 0.610534309696581},
 {'max_iters': 500,
  'degree': 1,
  'gamma': 0.1,
  'cont_features': (1,
   2,
   3,
   4,
   5,
   6,
   7,
   8,
   9,
   10,
   11,
   12,
   13,
   14,
   15,
   16,
   17,
   18,
   19,
   20,
   21,
   22,
   23,
   24,
   25,
   26,
   27,
   28,
   29,
   30)})

Next, we are going to split our datasets based on the number of jets (`PRI_jet_num`) and create 3 subsets of the data for observations with 0, 1 and more than 1 jet respectively. Each subset will also only have the relevant columns (based on the original paper) All other missing values in the new subsets will be imputed with median values.

In [96]:
X_train_zero, y_train_zero, X_train_one, y_train_one, X_train_many, y_train_many = split_by_jet_num(DATA_TRAIN_PATH, X_train, y_train)
X_test_zero, ids_test_zero, X_test_one, ids_test_one, X_test_many, ids_test_many = split_by_jet_num(DATA_TRAIN_PATH, X_test, ids_test)

In [97]:
X_train_zero.shape, X_train_one.shape, X_train_many.shape

((99913, 15), (77544, 22), (72543, 29))

In [107]:
def train(X_train, y_train, X_test):
    tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=1, encodable_th=0)
    param_grid = {
        'max_iters': 100,
        'degree': list(range(1, 4)),
        'gamma': [0.01, 0.1],
        'cont_features': [cont_features]
    }
    metrics, params = least_squares_GD_cv(ty_train, tX_train, param_grid=param_grid)
    return metrics, params

In [108]:
metrics_zero, params_zero = train(X_train_zero, y_train_zero, X_test_zero)

In [109]:
metrics_one, params_one = train(X_train_one, y_train_one, X_test_one)

In [110]:
metrics_many, params_many = train(X_train_many, y_train_many, X_test_many)

In [111]:
metrics_zero, params_zero

({'loss': 0.2557358491565016,
  'accuracy': 82.53302492932407,
  'f1_score': 0.5970975358666013},
 {'max_iters': 100, 'degree': 2, 'gamma': 0.1})

In [112]:
metrics_one, params_one

({'loss': 0.36305163913716787,
  'accuracy': 72.48274946921444,
  'f1_score': 0.5664581549105681},
 {'max_iters': 100, 'degree': 1, 'gamma': 0.1})

In [113]:
metrics_many, params_many

({'loss': 0.35838994126340273,
  'accuracy': 73.46237770113324,
  'f1_score': 0.7029992679585275},
 {'max_iters': 100, 'degree': 1, 'gamma': 0.1})

In [None]:
a = X_train_zero.shape[0]
b =  X_train_one.shape[0] 
c = X_train_many.shape[0]
avg_accuracy = ((metrics_zero['accuracy']*a) +  (metrics_one['accuracy']*b) + (metrics_many['accuracy']*c))/(a+b+c)

print(f"Average accuracy with jet_num training is {avg_accuracy}")