In [None]:
%run clone_git_on_colab.py

# Higgs Challenge Example using Neural Networks
In this part we continue to work with the data from the **[Higgs Boson ML Challenge][1]** on Kaggle and attempt a solution using neural networks (NN). See the [previous notebook][2] to get started.

We start with some introductory information on [Neural Networks][3].

[1]: https://www.kaggle.com/c/Higgs-boson
[2]: HiggsChallenge.ipynb
[3]: NN_Activation.ipynb

## Neural Networks to discover the Higgs

Now let's start trying to apply a NN to the Higgs Challenge data. We will start using Scikit Learn, and then try **[Keras](https://keras.io/)**.

### Load the data and preprocessing

In [None]:
# the usual setup: 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# load training data
df = pd.read_csv('data/atlas-higgs-challenge-2014-v2.csv.gz')

In [None]:
df.iloc[:5]

In [None]:
df.PRI_jet_leading_pt[df.PRI_jet_leading_pt>0].hist(bins=50)
plt.yscale('log')

f=plt.figure()
df.DER_mass_MMC[(df.DER_mass_MMC>0)&(df.DER_mass_MMC<250)].hist(bins=50);

In [None]:
# map y values to integers
df['Label'] = df['Label'].map({'b':0, 's':1})

In [None]:
df.iloc[:5]

In [None]:
# let's create separate arrays
eventID = df['EventId']
X = df.loc[:,'DER_mass_MMC':'PRI_jet_all_pt']
y = df['Label']
weight = df['Weight']

In [None]:
#now split into testing and training samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, eventID_train, event_ID_test, weight_train, weight_test = train_test_split(
    X, y, eventID, weight, test_size=0.33, random_state=42)

# Neural networks (MLP) in sklearn

In [None]:
# now let's first look at a NN in sklearn
from sklearn.neural_network import MLPClassifier # Multi-layer Perceptron classifier.
mlp = MLPClassifier(verbose=True, early_stopping=True)

In [None]:
# and train
mlp.fit(X_train, y_train)

In [None]:
mlp.score(X_test, y_test)

We will again use the [approximate median significance][1] from the Kaggle competition to determine how good a solution was. Note that if you do not use the full data set (i.e. you split into training and testing) you have to reweigh the inputs so that the subsample yield matches to the total yield, which we will do below.

[1]: AMS.ipynb

In [None]:
# load function to compute approximate median significance (AMS)
%pycat ams.py
%run ams.py

In [None]:
# Determine probability scores
y_train_prob = mlp.predict_proba(X_train)[:, 1]
y_test_prob = mlp.predict_proba(X_test)[:, 1]

In [None]:
# add the probability to the original data frame
df['Prob']=mlp.predict_proba(X)[:, 1]


In [None]:
kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=40)

df[df.Label==0].Prob.hist(label='Background',**kwargs)
df[df.Label==1].Prob.hist(label='Signal',**kwargs)
plt.legend();


In [None]:
# calculate the total weights (yields)
sigall  = weight.dot(y)
backall = weight.dot(y == 0)

In [None]:
# Run the AMS scan
from sklearn.metrics import roc_curve
def ams_scan(y, y_prob, weights, label):
    fpr, tpr, thr = roc_curve(y, y_prob, sample_weight=weights)
    ams_vals = ams(tpr * sigall, fpr * backall)
    print("{}: Maximum AMS {:.3f} for pcut {:.3f}".format(label, ams_vals.max(), thr[np.argmax(ams_vals)]))
    return thr, ams_vals

In [None]:
plt.plot(*ams_scan(y_train, y_train_prob, weight_train, "Train"), label="Train")
plt.plot(*ams_scan(y_test, y_test_prob, weight_test, "Test"), label="Test")
plt.xlim(0., 1.)
plt.grid()
plt.xlabel('Pcut')
plt.ylabel('AMS')
plt.legend();

How did we do? Worse than the BDT from [HiggsChallenge.ipynb](HiggsChallenge.ipynb)
![Comparison with submissions](figures/tr150908_davidRousseau_TMVAFuture_HiggsML.001.png)

## Rescaling
Neural networks are quite sensitive to feature scaling, so let's try to scale the features.

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
X_train.columns

In [None]:
X_train.DER_mass_MMC.hist(bins=50);

In [None]:
a=plt.hist(X_train_scaled[:,0],bins=50)

In [None]:
# and train a new network
mlp_scaled = MLPClassifier(verbose=True, early_stopping=True)
mlp_scaled.fit(X_train_scaled, y_train)

In [None]:
mlp_scaled.score(X_test_scaled, y_test)

In [None]:
# Determine probability scores
y_train_prob_scaled = mlp_scaled.predict_proba(X_train_scaled)[:, 1]
y_test_prob_scaled = mlp_scaled.predict_proba(X_test_scaled)[:, 1]

In [None]:
plt.plot(*ams_scan(y_train, y_train_prob_scaled, weight_train, "Train"), label="Train")
plt.plot(*ams_scan(y_test, y_test_prob_scaled, weight_test, "Test"), label="Test")
plt.xlim(0., 1.)
plt.grid()
plt.xlabel('Pcut')
plt.ylabel('AMS')
plt.legend()

We improved quite a bit by using the same classifier but with rescaled data!

# Neutral networks with Keras
SciKit Learn has simple NNs, but if you want to do deep NNs, or train on GPUs, you probably want to use something like Keras instead. 

Let's try to create a simple NN using Keras.

In [None]:
np.random.seed(1337)  # for reproducibility

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


In [None]:
# create the model
from tensorflow.keras import regularizers

model = Sequential()
model.add(Dense(units = 100, activation='relu', input_shape=(30,), kernel_regularizer=regularizers.l2(0.0001)))
model.add(Dense(units =   1, activation='sigmoid'))

* `Dense`: "Just your regular densely-connected NN layer."
  * implements the operation: output = activation(dot(input, kernel) + bias)
    * kernel is a weights matrix created by the layer
    * bias is a bias vector created by the layer (only applicable if `use_bias` is True)
  * `units`: dimensionality of the output array (note: we do not need to specify to size of the input array, except...)
  * `input_shape`: expected shape of the input arrays (...only needed for the first layer)
  * `activation`: element-wise activation function
  * `kernel_regularizer`: constraint function applied to the kernel weights matrix (see [regularizers][2])
  
  
[1]: https://keras.io/constraints/
[2]: https://keras.io/api/layers/regularizers/

In [None]:
# visualize model
from tensorflow.keras.utils import plot_model
plot_model(model)

In [None]:
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

* `optimizer`: name of optimizer or optimizer instance. See [optimizers][1].
  * _Adam_: an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments ([paper][2], a short [summary][4])
* `loss`: name of objective function or objective function. See [losses][3].
  * _binary crossentropy_: 
    $$H_p(q) = -\frac{1}{N}\sum_{i=1}^N [{y_i} \log(\hat{y}_i)+(1-y_i) \log(1-\hat{y}_i)]$$
    * a measure of dissimilarity, used here to define the loss function that should be minimized: 
    
        "The cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution q, rather than the true distribution p."
        
        (The minimum number of bits to encode an independent event that occurs with probability $y_i$ is $-\log_2(y)$.)
   * here the true labels are $y_i=1$ for the positive class and $y_i=0$ for the negative class
   * the estimated probabilities are $\hat y_{i}$
   * $N$ runs over all samples
* `metrics`: list of metrics to be evaluated by the model during training and testing (typically accuracy)

[1]: https://keras.io/optimizers/
[2]: https://arxiv.org/abs/1412.6980v8
[3]: https://keras.io/losses/
[4]: https://medium.com/@nishantnikhil/adam-optimizer-notes-ddac4fd7218
[5]: https://datascience.stackexchange.com/questions/9302/the-cross-entropy-error-function-in-neural-networks

In [None]:
# train Keras NN (much faster than the MLP)
#model.fit(X_train_scaled, y_train, epochs=5, batch_size=128, sample_weight=weight_train)
history = model.fit(X_train_scaled, y_train, epochs=5, batch_size=64)


* `batch_size`: number of samples per gradient update
* `epochs`: number of epochs to train the model. An epoch is an iteration over the entire training dataset provided. 


In [None]:
# visualize training history returned by model.fit

# Plot training & validation accuracy values
plt.plot(history.history['acc'])
#plt.plot(history.history['val_acc']) -- only available if we do validation split
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

# Plot training & validation loss values
plt.plot(history.history['loss'])
#plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

In [None]:
y_train_prob_keras = model.predict(X_train_scaled)[:, 0]
y_test_prob_keras = model.predict(X_test_scaled)[:, 0]

In [None]:
plt.plot(*ams_scan(y_train, y_train_prob_keras, weight_train, "Train"), label="Train")
plt.plot(*ams_scan(y_test, y_test_prob_keras, weight_test, "Test"), label="Test")
plt.xlim(0., 1.)
plt.grid()
plt.xlabel('Pcut')
plt.ylabel('AMS')
plt.legend();

We only made a single layer NN in Keras. However, you can easily change the structure of the network. As an assignment, try adding an extra hidden layer and changing the number of neurons.

#### Variations of MLP *(optional)*


There are a few things we can easily vary: number of hidden layers, the activation function, the regularization ($\alpha$). Let's go back to MLPClassifer (scaled) and play with some of them.

In [None]:
mlp_play = MLPClassifier(activation='relu', hidden_layer_sizes=(100,100), alpha=0.01, verbose=True, early_stopping=True)
mlp_play.fit(X_train_scaled, y_train)

In [None]:
mlp_play.score(X_test_scaled, y_test)

In [None]:
y_train_prob_play = mlp_play.predict_proba(X_train_scaled)[:, 1]
y_test_prob_play = mlp_play.predict_proba(X_test_scaled)[:, 1]

In [None]:
plt.plot(*ams_scan(y_train, y_train_prob_play, weight_train, "Train"), label="Train")
plt.plot(*ams_scan(y_test, y_test_prob_play, weight_test, "Test"), label="Test")
plt.xlim(0., 1.)
plt.grid()
plt.xlabel('Pcut')
plt.ylabel('AMS')
plt.legend();

# Your tasks
Problems (can do with either MLPClassifier or Keras):
1. Vary the structure of the network (number of hidden layers, number of neurons)
1. Vary the activation. (In Keras can do it per layer, in MLPClassifier only for all)
1. Vary the regularization. May have to do this as the structure changes.
1. Try using derivied variables only or primary variables only.
1. Missing data is represented by -999 before scaling. Is there a better value to use in the training?
1. Try using the event weights to better match the background and signal shapes in the training. Note, though, that you should still treat background and signal separately; don't scale the signal down by the weight.