# Least Squares using Gradient Descent
This notebook contains the code for running multiple models on the ML Higgs boson data using the Least Squares Gradient Descent algorithm.

In [42]:
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('..')
from helpers import *
from implementations import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load train and test data

In [17]:
DATA_TRAIN_PATH = '../data/train.csv'
y_train, X_train, ids = load_csv_data(DATA_TRAIN_PATH)

In [18]:
DATA_TEST_PATH = '../data/test.csv'
_, X_test, ids_test = load_csv_data(DATA_TEST_PATH)

In [19]:
X_train.shape

(250000, 30)

## Baseline model using raw data
First let's run the Least Squares GD algorithm on our raw data without doing any preprocessing. We will use K-fold cross validation to report the metrics on the test data and grid search to tune our hyperparameters.

In [23]:
param_grid = {
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = least_squares_GD_cv(y_train, X_train, param_grid=param_grid, transform=False)

  return (1 / len(y)) * (tx.T @ ((tx @ w) - y))
  return (1 / len(y)) * (tx.T @ ((tx @ w) - y))
  recall = tp / (tp + fn)
  return 2 * precision * recall / (precision + recall)


In [24]:
metrics, params

({'loss': nan, 'accuracy': 0.0, 'f1_score': nan},
 {'max_iters': 1000, 'gamma': 0.01})

Raw data didnt seem to be working with the LS Gradient Descent algorithm. Let's preprocess our data and handle missing values.

## Baseline model using lightly feature engineered data
Now let's now preprocess our data a bit to handle the missing values (-999) in various ways.

### All features with NaN values imputed
First let's impute all missing values with median of their respective columns. So we will set the `imputable_th` to `1` which means impute all columns with a nan value ratio less than 1, or in other words all columns.

In [25]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=1, encodable_th=0)

In [26]:
tX_train.shape, tX_test.shape

((250000, 31), (568238, 31))

We have now all the columns imputed and plus one more column for the bias.

In [28]:
param_grid = {
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = least_squares_GD_cv(ty_train, tX_train, param_grid=param_grid, transform=False)

In [29]:
metrics, params

({'loss': 0.33068260944513705,
  'accuracy': 76.1336,
  'f1_score': 0.6118115383641042},
 {'max_iters': 1000, 'gamma': 0.1})

### All features with NaN values encoded
Now let's instead encode these features with NaN values into new indicator features where the new feature takes on a value of 1 if the value for the feature is missing, otherwise 0.

In [30]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=0, encodable_th=1)

In [31]:
tX_train.shape, tX_test.shape

((250000, 31), (568238, 31))

In [32]:
param_grid = {
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = least_squares_GD_cv(ty_train, tX_train, param_grid=param_grid, transform=False)

In [33]:
metrics, params

({'loss': 0.3417458297187233,
  'accuracy': 74.7828,
  'f1_score': 0.5931444471916829},
 {'max_iters': 1000, 'gamma': 0.1})

### Mixed imputing and encoding approach
Finally,  let's try a more reasonable approach to the imputing and encoding scheme. As we saw in the exploration notebook, we have some features that have less than 15% of them missing, some around 40% and some more than 70%. Let's impute the columns in the first group, encode the ones in the second group and drop completely the ones in the third group.


In [34]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=0.3, encodable_th=0.7)

In [35]:
tX_train.shape, tX_test.shape

((250000, 24), (568238, 24))

In [36]:
param_grid = {
    'max_iters': 1000,
    'gamma': [0.01, 0.1]
}
metrics, params = least_squares_GD_cv(ty_train, tX_train, param_grid=param_grid, transform=False)

In [37]:
metrics, params

({'loss': 0.3431351336922191,
  'accuracy': 74.9112,
  'f1_score': 0.5947335599969834},
 {'max_iters': 1000, 'gamma': 0.1})

Seems like we get the best performance when we impute all of the NaN values. Let's continue our feature engineering with these preprocessing thresholds fixed.

## Baseline model using heavily feature engineered data
In this step, we are going to apply more feature engineering. First, we will apply polynomial features of some degree that we will tune through grid search and cross validation.

In [43]:
tX_train, ty_train, tX_test, ty_test, cont_features = preprocess(X_train, y_train, X_test, imputable_th=1, encodable_th=0)

In [44]:
tX_train.shape, tX_test.shape

((250000, 31), (568238, 31))

In [45]:
tX_train

array([[ 1.00000000e+00,  6.83319669e-02,  4.07680272e-01, ...,
        -2.62877883e-01,  1.14262161e+00, -2.52683989e+00],
       [ 1.00000000e+00,  5.52504823e-01,  5.40136414e-01, ...,
        -1.59461192e-01,  4.89238340e-04, -1.23840760e-04],
       [ 1.00000000e+00,  3.19515553e+00,  1.09655998e+00, ...,
        -1.59461192e-01,  4.89238340e-04, -1.23840760e-04],
       ...,
       [ 1.00000000e+00,  3.19316447e-01, -1.30863670e-01, ...,
        -1.59461192e-01,  4.89238340e-04, -1.23840760e-04],
       [ 1.00000000e+00, -8.45323970e-01, -3.02973380e-01, ...,
        -1.59461192e-01,  4.89238340e-04, -1.23840760e-04],
       [ 1.00000000e+00,  6.65336083e-01, -2.53522760e-01, ...,
        -1.59461192e-01,  4.89238340e-04, -1.23840760e-04]])

In [49]:
param_grid = {
    'max_iters': 1000,
    'degree': 2,
    'gamma': [0.01, 0.1]
}
metrics, params = least_squares_GD_cv(ty_train, tX_train, param_grid=param_grid)

  return (1 / len(y)) * (tx.T @ ((tx @ w) - y))
  w = w - gamma * gradient
Traceback (most recent call last):
  File "_pydevd_bundle/pydevd_cython.pyx", line 1078, in _pydevd_bundle.pydevd_cython.PyDBFrame.trace_dispatch
  File "_pydevd_bundle/pydevd_cython.pyx", line 297, in _pydevd_bundle.pydevd_cython.PyDBFrame.do_wait_suspend
  File "/Users/mismayil/opt/anaconda3/envs/ml/lib/python3.9/site-packages/debugpy/_vendored/pydevd/pydevd.py", line 1949, in do_wait_suspend
    keep_suspended = self._do_wait_suspend(thread, frame, event, arg, suspend_type, from_this_thread, frames_tracker)
  File "/Users/mismayil/opt/anaconda3/envs/ml/lib/python3.9/site-packages/debugpy/_vendored/pydevd/pydevd.py", line 1984, in _do_wait_suspend
    time.sleep(0.01)
KeyboardInterrupt


KeyboardInterrupt: 

More specifically, we are going to split our datasets based on the number of jets (`PRI_jet_num`) and create 3 subsets of the data for observations with 0, 1 and more than 1 jet respectively. Each subset will also only have the relevant columns (based on the original paper) All other missing values in the new subsets will be imputed with median values.

In [55]:
X_train_zero, y_train_zero, X_train_one, y_train_one, X_train_many, y_train_many = split_by_jet_num(DATA_TRAIN_PATH, X_train, y_train)
X_test_zero, ids_test_zero, X_test_one, ids_test_one, X_test_many, ids_test_many = split_by_jet_num(DATA_TRAIN_PATH, X_test, ids_test)

In [56]:
X_train_zero.shape, X_train_one.shape, X_train_many.shape

((99913, 15), (77544, 22), (72543, 29))

In [58]:
def train(X_train, y_train, X_test):
    tX_train, ty_train, tX_test, _, cont_features = preprocess(X_train, y_train, X_test, imputable_th=1, encodable_th=0)
    metrics, params = least_squares_cv(ty_train, tX_train, param_grid={})
    return metrics, params

In [59]:
metrics_zero, params_zero = train(X_train_zero, y_train_zero, X_test_zero)

In [60]:
metrics_one, params_one = train(X_train_one, y_train_one, X_test_one)

In [61]:
metrics_many, params_many = train(X_train_many, y_train_many, X_test_many)

In [62]:
metrics_zero, params_zero

({'loss': 0.26577795854224545,
  'accuracy': 81.73255930337304,
  'f1_score': 0.5723234495722914},
 {})

In [63]:
metrics_one, params_one

({'loss': 0.36171118013474135,
  'accuracy': 72.63992777921072,
  'f1_score': 0.5725637044826553},
 {})

In [64]:
metrics_many, params_many

({'loss': 0.35360466832131593,
  'accuracy': 74.21974083264406,
  'f1_score': 0.7081008607677863},
 {})

So, seems like we are doing really good with the data that have zero jets and have on average 76% accuracy now.