# Case Study 6

<font size=4>Brady Arendale, Pankaj Kumar, Kay Ayala  
7/29/20</font>

# Introduction

We were asked to replicate a neural network architecture from the paper ["Searching for Exotic Particles in High-Energy Physics with Deep Learning"](https://arxiv.org/pdf/1402.4735.pdf). We will create the architecture in TensorFlow as originally formulated, and then suggest improvements that could be made based on developments in deep learning since the paper was published in 2014.

# Question 1

In [1]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.optimizers.schedules import ExponentialDecay
from tensorflow import random_normal_initializer
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np
import pandas as pd
import tensorflow

In [8]:
model = Sequential()
learning_rate = 0.05
opt = SGD(learning_rate=learning_rate, momentum=0.9)
init_first = random_normal_initializer(stddev=0.1)
init_hidden = random_normal_initializer(stddev=0.05)
init_output = random_normal_initializer(stddev=0.001)
es = EarlyStopping(monitor='val_loss', min_delta=0.00001, patience=10)

model.add(Dense(300, activation='tanh', kernel_initializer=init_first))
model.add(Dense(300, activation='tanh', kernel_initializer=init_hidden))
model.add(Dense(300, activation='tanh', kernel_initializer=init_hidden))
model.add(Dense(300, activation='tanh', kernel_initializer=init_hidden))
model.add(Dense(300, activation='tanh', kernel_initializer=init_hidden))
model.add(Dense(1, activation='sigmoid', kernel_initializer=init_output))

model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['auc'])

The authors of the paper used a neural network with 5 hidden layers of 300 units each. The final layer outputs a classification. Each hidden layer used a tanh activation function. The activation function for the last layer was not specified, so we used the standard sigmoid activation. The first hidden layer was initialized with normally distributed weights with a mean of 0 and a standard deviation of 0.1. The other hidden layers used a mean of 0 and a standard deviation of 0.05, and the output layer used a mean of 0 and a standard deviation of 0.001.

The authors used stochastic gradient descent to optimize the neural network. They used an intial learning rate of 0.05 with a decay defined below. They used a momentum of 0.9 that increased linearly to 0.99 over 200 epochs. We did not find any way to change the momentum mid-training in TensorFlow, so we locked the momentum at 0.9. They did not specify a loss function, so we used the standard cross-entropy loss function. AUC was used as an evaluation metric. They use early stopping to prevent overfitting. The early stopping criterion was no decrease in validation loss by more than a factor of 0.00001 over 10 epochs.

In [9]:
class LearningRateCallback(tensorflow.keras.callbacks.Callback):
  
    def __init__(self, lr_decay_factor=1.0000002, update_freq=None, min_lr=1e-6):
        self._update_freq = update_freq
        self.learning_rate = learning_rate
        self.lr_decay_factor = lr_decay_factor
        self.min_lr = min_lr

    def on_batch_end(self, batch, logs=None):
        if self._update_freq and batch % self._update_freq != 0:
            return
        
        if self.learning_rate <= self.min_lr:
            self.learning_rate = self.min_lr
        else:    
            self.learning_rate = self.learning_rate/self.lr_decay_factor
        tensorflow.keras.backend.set_value(self.model.optimizer.lr, self.learning_rate)

The learning rate decays by a factor of 1.0000002 per batch, and stops decaying upon reaching the minimum learning rate of 10^-6.

In [11]:
learning_rate_callback = LearningRateCallback()

# model.fit(X_train, y_train, epochs=1000, batch_size=100, callbacks=[es, learning_rate_callback])

Here we define our model fitting function. We use a batch size of 100 like the authors. We add in our early stopping and learning rate decay callbacks. We train for 1000 epochs, which is the maximum number of epochs the authors said their networks trained for. However, our early stopping criterion is likely to be reached before that.

# Question 2

Since deep learning has evolved much over the last 6 years, there are some recommendations we can make based on research since the original paper and more powerful processors. We can experiment with more layers, more number of neurons per layers and other combination of activation functions. We can also try different batch sizes. We expect to find better approximation of analytical functions by increasing complexity. We can experiment with learning rate to address the problem of vanishing gradient descent. We can take partial features from low-level and high-level features. For now the 3 set of features that are used are low-level, high-level and complete. We can exclude some low- and high-level features and train with deeper models to get better results. Experiment with different or custom loss functions with features like label smoothing.  We expect to converge to global minima at faster rate.

# Question 3

In addition to the above, there are several practices that have become standard since the paper was written. First is the use of different activation functions, particularly ReLU. Another is the creation of more advanced optimizers like Adam. Adam automatically adjusts the learning rate adaptively for each paramter, rather than using complicated arbitrary learning rate and momentum updates like in the paper. The combination of ReLU and Adam has been very powerful in improving convergence and combating vanishing and exploding gradient descent.

Although the authors experimented with dropout, there are now many regularization techniques in common use. These include L1 and L2 regularization and batch normalization. These techniques have been shown to improve generalization in many cases, and may improve the performance of the model.

One last standard practice nowadays is the use initialization techniques such as Glorot or Kaiming intialization. These have also been shown to improve convergence compared to sampling from arbitrary uniform or normal distributions. These techniques use input and/or output shapes to determine either the range of a uniform distribution or the standard deviation of a normal distribution to sample from.

# Question 4

We can have a good idea of whether we successfully replicated the authors' paper by comparing AUC scores. The authors reported an AUC score of 0.885 on their best-performing model. We would expect to see something similar. However, we will not replicate the exact model due to differences in things like random intialization and random batch shuffling.

# Conclusion

We replicated the authors' paper as accurately as possible within in the TensorFlow/Keras framework. We suggested possible improvements that could be made such as different number of layers and neurons and different feature selection. We mentioned modern standard practices like ReLU and Adam that could also be used to improve the model. With these improvements, we may be able to achieve a higher AUC and contribute to the success of collider searches for exotic particles.