In [None]:
%run clone_git_on_colab.py

In [None]:
from course_settings import set_tf_nthreads
set_tf_nthreads(1) # best setting for this tutorial at CIP

# Higgs Challenge Example using Neural Networks -- continued


This is essentially the same as what we have done in the notebook 
on the [Higgs Challenge Example using Neural Networks](HiggsChallenge-NN_DL.ipynb)
but here we're going to use a neural network with a more complex (deeper) structure (deeper = more layers)
to squeeze out even a bit more performance.

## Load the data and preprocessing

In [None]:
# the usual setup: 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# load training data
df = pd.read_csv('data/atlas-higgs-challenge-2014-v2.csv.gz')

In [None]:
df["PRI_jet_subleading_pt"]

In [None]:
# map y values to integers
df['Label'] = df['Label'].map({'b':0, 's':1})

In [None]:
# let's create separate arrays
eventID = df['EventId']
X = df.loc[:,'DER_mass_MMC':'PRI_jet_all_pt']
y = df['Label']
weight = df['Weight']

In [None]:
#now split into testing and training samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, eventID_train, event_ID_test, weight_train, weight_test = train_test_split(
    X, y, eventID, weight, test_size=0.33, random_state=42)

We will again use the [approximate median significance][1] from the Kaggle competition to determine how good a solution was. Note that if you do not use the full data set (i.e. you split into training and testing) you have to reweigh the inputs so that the subsample yield matches to the total yield, which we will do below.

[1]: AMS.ipynb

In [None]:
# load function to compute approximate median significance (AMS)
%pycat ams.py
%run ams.py

In [None]:
# calculate the total weights (yields)
sigall  = weight.dot(y)
backall = weight.dot(y == 0)

sigtrain  = weight_train.dot(y_train)
backtrain = weight_train.dot(y_train == 0)

sigtest  = weight_test.dot(y_test)
backtest = weight_test.dot(y_test == 0)



## Rescaling
Neural networks are quite sensitive to feature scaling, so let's try to scale the features. Also, let's set the -999 values to 0.

In [None]:
from sklearn.preprocessing import RobustScaler

X_train[X_train==-999.] = 0.
X_test[X_test==-999.] = 0.

scaler = RobustScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Neutral networks with Keras
SciKit Learn has simple NNs, but if you want to do deep NNs, or train on GPUs, you probably want to use something like Keras instead. 

Example for a deep NN using Keras (thanks to N. Hartmann for providing Keras model)

In [None]:
np.random.seed(1337)  # for reproducibility

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, BatchNormalization


In [None]:
# create the model
from tensorflow.keras import regularizers

model = Sequential()
model.add(Dense(units = 100, input_shape=(30,)))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dense(units = 100))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dense(units = 100))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dense(units =   1, activation='sigmoid'))

* `Dense`: "Just your regular densely-connected NN layer."
  * implements the operation: output = activation(dot(input, kernel) + bias)
    * kernel is a weights matrix created by the layer
    * bias is a bias vector created by the layer (only applicable if `use_bias` is True)
  * `units`: dimensionality of the output array
  * `input_shape`: expected shape of the input arrays (only needed for first layer)
  * `activation`: element-wise activation function
  * `kernel_regularizer`: constraint function applied to the kernel weights matrix (see [constraints][1])
* `BatchNormalization` : Technical trick to adjust weights and speedup computation (see [BatchNormalization][2])
* `Activation`: Specify activation function (see [activation discussion](NN_Activation.ipynb))
  
  
[1]: https://keras.io/constraints/
[2]: https://www.dlology.com/blog/one-simple-trick-to-train-keras-model-faster-with-batch-normalization/

In [None]:
# visualize model
from tensorflow.keras.utils import plot_model
plot_model(model)

In [None]:
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # or weighted metrics

* `optimizer`: name of optimizer or optimizer instance. See [optimizers][1].
  * _Adam_: an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments ([paper][2], a short [summary][4])
* `loss`: name of objective function or objective function. See [losses][3].
  * _binary crossentropy_: 
    $$H_p(q) = -\frac{1}{N}\sum_{i=1}^N [{y_i} \log(\hat{y}_i)+(1-y_i) \log(1-\hat{y}_i)]$$
    * a measure of dissimilarity, used here to define the loss function that should be minimized: "The cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution q, rather than the true distribution p."
       * here the true labels are $y_i=1$ for the positive class and $y_i=0$ for the negative class
       * the estimated probabilities are $\hat y_{i}$
       * $N$ runs over all samples
* `metrics`: list of metrics to be evaluated by the model during training and testing (typically accuracy)

[1]: https://keras.io/optimizers/
[2]: https://arxiv.org/abs/1412.6980v8
[3]: https://keras.io/losses/
[4]: https://medium.com/@nishantnikhil/adam-optimizer-notes-ddac4fd7218

### Introducing Weights

Another innovation we're introducing here is reweighting of the events. We are doing three things here:
1. Applying event-based weights which are stored in `weight_train` (and `weight_test`). This helps to give more weight (in the computation of the loss function) to backgrounds events that have larger cross sections and are therefore more important to suppress than others.
1. Reweighting the signal and background back such that their total weight is again about the same. Note that the unweighted sample has a ratio of about 1:2 for signal:background events, and we had seen that after applying the weight this ratio was reduced to about 1:500. Such a drastic difference in the weights can cause problems in the training, therefore we restore a roughly equal total weight by multiplying with the two (global) weights for signal and background we compute in `class_weight`.
1. Normalizing the weights, such that the mean weight is 1. This avoids producing an overall shift in the loss value which would mean we also have to shift optimization parameters (like learning rate).

In [None]:
class_weight = {0: y_train.shape[0]/backtrain, 1:y_train.shape[0]/sigtrain}
class_weight

In [None]:
weight_train_tot = np.array(weight_train*np.array(list(class_weight.values()))[y_train.astype(int)])
weight_test_tot = np.array(weight_test*np.array(list(class_weight.values()))[y_test.astype(int)])
weight_train_tot /= weight_train_tot.mean()
weight_test_tot /= weight_test_tot.mean()

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
history = model.fit(
    X_train_scaled,
    y_train,
    epochs=100,
    batch_size=64,
    sample_weight=weight_train_tot,
    validation_data=(X_test_scaled, y_test, weight_test_tot),
    callbacks=[EarlyStopping(verbose=True, patience=3)]
)

* `batch_size`: number of samples per gradient update
* `epochs`: number of epochs to train the model. An epoch is an iteration over the entire x and y data provided. 


In [None]:
# visualize training history returned by model.fit

# Plot training & validation accuracy values
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

In [None]:
y_train_prob_keras = model.predict(X_train_scaled)[:, 0]
y_test_prob_keras = model.predict(X_test_scaled)[:, 0]

In [None]:
from sklearn.metrics import roc_curve

In [None]:
# Run the AMS scan
from sklearn.metrics import roc_curve
def ams_scan(y, y_prob, weights, label):
    fpr, tpr, thr = roc_curve(y, y_prob, sample_weight=weights)
    ams_vals = ams(tpr * sigall, fpr * backall)
    print("{}: Maximum AMS {:.3f} for pcut {:.3f}".format(label, ams_vals.max(), thr[np.argmax(ams_vals)]))
    return thr, ams_vals

In [None]:
plt.plot(*ams_scan(y_train, y_train_prob_keras, weight_train, "Train"), label="Train")
plt.plot(*ams_scan(y_test, y_test_prob_keras, weight_test, "Test"), label="Test")
plt.xlim(0.8, 1.)
plt.legend()

There are a few things we can easily vary: number of hidden layers, the activation function, the regularization ($\alpha$). 

# Your tasks
Problems (can do with either MLPClassifier or Keras):
1. Vary the structure of the network (number of hidden layers, number of neurons)
1. Vary the activation. (In Keras can do it per layer, in MLPClassifier only for all)
1. Vary the regularization. May have to do this as the structure changes.
1. Try using derivied variables only or primary variables only.
1. Missing data is represented by -999 before scaling. Is there a better value to use in the training?
1. Try using the event weights to better match the background and signal shapes in the training. Note, though, that you should still treat background and signal separately; don't scale the signal down by the weight.