In this simple notebook we use a fully connected neural network to solve a previously seen problem, the particle ID classification (see notebooks of Chapter 4 for more detail).

It accompanies Chapter 8 of the book (1 of 2).

Copyright: Viviana Acquaviva (2023).

Modifications by Julieta Gruszko (2025).

License: [BSD-3-clause](https://opensource.org/license/bsd-3-clause/)

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.utils import shuffle

In [None]:
import matplotlib
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 150)

font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
matplotlib.rcParams['figure.dpi'] = 300

Pytorch is the most commonly used library for development of Deep Learning models. It is an open-source platform for python that also enables the use of multiple GPU platforms, like CUDA.

Keras is a high-level API (Application Programming Interface) that is built on top of a deep learning library. Originally it was developed for tensorflow, but these days it supports multiple back-ends including pytorch. We can think of it as the equivalent of the sklearn library for neural networks. It is less general, and less customizable, but it is very user-friendly. 

Another option for similar API tools, built into pytorch directly, would be to use pytorch.nn . This is the more common option, but we'd lose access to some of the nice hyperparameter tuning options that interface with keras (and have to replace them with equivalent pytorch-compatible libraries, which do exist if you want to go that route). 

In [None]:
import torch as torch

In [None]:
torch.__version__

This is how you set keras to use the pytorch back-end:

In [None]:
import os
os.environ["KERAS_BACKEND"] = "torch"

In [None]:
import keras

from keras.models import Sequential #the model is built adding layers one after the other

from keras.layers import Dense #fully connected layers: every output talks to every input

from keras.layers import Dropout #for regularization

from keras import Input

We begin with the 4top vs ttbar problem, and we use the configuration where we added the features "number of leptons", "number of jets" etc., but not the particle type for every track.

For reference, the linear SVM with those features achieved 94.2% +/- 0.6% accuracy. Note that those numbers had not been run through <b> nested </b> cross validation so they might be slightly optimistic. 

Luckily we saved the data with our added feature engineering back then, we can just load it now:

In [None]:
X = pd.read_csv('../Data/Features_lim_2.csv')

In [None]:
y = np.genfromtxt('../Data/Labels_lim_2.txt')

In [None]:
X.values.shape

Just to remind ourselves what's in there:

In [None]:
display(X)

There is no "built-in" cross validation process here, so we would need to build it ourselves. For now, we can build three sets: train, validation (for parameter optimization), and test (for final evaluation). We should ideally build this as a cross-validation structure.

In [None]:
#Always shuffle first

X,y = shuffle(X,y, random_state = 10)

In [None]:
X_train = X.values[:3000,:]
y_train = y[:3000]

In [None]:
X_val = X.values[3000:4000,:]
y_val = y[3000:4000]

In [None]:
X_test = X.values[4000:,:]
y_test = y[4000:]

In [None]:
X_train.shape, X_val.shape, X_test.shape

### Building the network

Let's think about the model architecture.

For a fully-connected neural net, we always start with one neuron per feature at the input layer. At the output, we want to perform a classification task. One nice way to build a classifier is to output the probability of having membership in the positive class as the target value. 

### Question:
- How many neurons should our input layer have, in this case?
- How many neurons should our output layer have, in this case?
- What would be a reasonable differentiable activation function for our output layer? Hint: think about what range of values our network should return to work as a classifier.

In addition to the input and output layers, we will add two hidden layers. Here we are making both sizes = 20 (we should optimize this hyperparameter!). We can also reserve the possibility of adding a dropout layer after each one, in case we decide we could benefit from regularization. The dropout fraction should also be optimized through CV.

Other decisions that we have to make are: 
- which nonlinearities we use
- which optimizer we use
- the loss function




The commands below can be used to explore possible choices.

In [None]:
dir(keras.activations)

We'll use ReLU for hidden layers, sigmoid for the output layer.

In [None]:
dir(keras.optimizers)

AdamW is a good "default option" to start with, we'll use that.

In [None]:
dir(keras.losses)

A standard choice for a case like ours, where the labels are 0/1 but we can predict a probability, is the binary cross-entropy or log loss:

L = - $\frac{1}{N} \sum_{i=1}^N y_i \cdot log(p(y_i)) + (1-y_i) \cdot log (1 - p(y_i))$

p is the probability that an object belongs to the positive class. It penalizes positive examples that are associated with predicted low probability, and negative examples that are associated with predicted high probability.

We also have to make initial choices for our training hyperparameters:
- which starting learning rate we adopt: we'll start with 0.001, but again this should be decided through CV
- the number of epochs: e.g. 100; we can plot quantities of interest to check that we have enough
- the batch size for the gradient descent step: here 200, but can explore! Smaller batch sizes will be faster but less stable in terms of convergence. 

### Question:
What is the difference between the number of epochs and the batch size? Describe how each is used.

### This is how we build a fully connected neural network in keras.

The pytorch back-end throws a warning because of a bug here, but you can safely ignore it.

In [None]:
model = Sequential()

# Tell subsequent layers what shape to expect. Keras can also infer this from the data, but it's slower
model.add(keras.Input(shape=(24,)))

# Add an input layer and specify its size (number of original features)

model.add(Dense(20, activation='relu', input_shape=(24,)))

# Add one hidden layer and specify its size

model.add(Dense(20, activation='relu'))

# Add an output layer 

model.add(Dense(1, activation='sigmoid'))

optimizer = keras.optimizers.AdamW(learning_rate=0.001)

model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics = ['accuracy']) 

The "metric" keyword here serves to specify other possible metrics we would like to monitor. The loss itself is not interpretable, so we'll keep an eye on the accuracy.

### Question:
- How are the neuron weights being initialized for each layer?
- How are the biases being initialized for each layer?

Hint: You'll probably need to look at the Keras API documenation, found here: https://keras.io/api/

### Ready to fit?

In [None]:
mynet = model.fit(X_train, y_train, validation_data= (X_val, y_val), epochs = 100,  batch_size=200)

The train and validation accuracy don't look so good.

It's helpful to plot how training and validation loss vary throughout the epochs.

In [None]:
plt.figure(figsize=(14,5))

plt.subplot(121)

plt.plot(mynet.history['loss'], label = 'train')
plt.plot(mynet.history['val_loss'],'-.m', label = 'validation')
plt.ylabel('Loss', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(loc='upper right', fontsize = 12)

plt.subplot(122)

plt.plot(mynet.history['accuracy'], label = 'train')
plt.plot(mynet.history['val_accuracy'], '-.m', label = 'validation')
plt.ylabel('Accuracy', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(fontsize = 12)
plt.subplots_adjust(wspace=0.5)

#plt.show()

#plt.savefig('FirstNN.png', dpi= 300)

There's no sign of any convergence/improvement here! 
### Question:
Any ideas to fix it? What step(s) did we skip?

When in doubt, take a look at the data again!

In [None]:
X.describe()

### Yep, we forgot scaling (and our features have wildly different ranges).

We can still use our sk-learn tools to fix this!

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

As usual, we only use the training set to derive the scaling.

In [None]:
scaler.fit(X_train)

In [None]:
Xst_train = scaler.transform(X_train)
Xst_val = scaler.transform(X_val)
Xst_test = scaler.transform(X_test)

We can now train our neural network again.

In [None]:
mynet = model.fit(Xst_train, y_train, validation_data= (Xst_val, y_val), epochs=100, batch_size=200)

In [None]:
plt.figure(figsize=(14,5))

plt.subplot(121)

plt.plot(mynet.history['loss'], label = 'train')
plt.plot(mynet.history['val_loss'],'-.m', label = 'validation')
plt.ylabel('Loss', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(loc='upper right', fontsize = 12)

plt.subplot(122)

plt.plot(mynet.history['accuracy'], label = 'train')
plt.plot(mynet.history['val_accuracy'], '-.m', label = 'validation')
plt.ylabel('Accuracy', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(fontsize = 12)
plt.subplots_adjust(wspace=0.5)
#plt.show()

#plt.savefig('ScaledNN.png', dpi= 300)

### Questions:
- Do you observe high bias? What about high variance?
- How does the validation accuracy compare to our SVM model with these features (reminder, that model had a test accuracy of 94.2% +/- 0.6%)?

#### As you can see, this network is much better behaved, and it achieves a final accuracy similar to the one found by SVMs (this is common for tabular data like ours). 

#### We do so see some signs of high variance in the accuracy/validation curve; some regularization technique, such as a Dropout layer, may help. Early stopping (not shown here) would also be worth trying.

In [None]:

model = Sequential()
# Tell subsequent layers what shape to expect
model.add(keras.Input(shape=(24,)))
# Add an input layer and specify its size (number of original features)

model.add(Dense(20, activation='relu', input_shape=(24,)))

model.add(Dropout(0.2)) #This is the dropout fraction

# Add one hidden layer and specify its size

model.add(Dense(20, activation='relu'))

model.add(Dropout(0.2)) #This is the dropout fraction

# Add an output layer 

model.add(Dense(1, activation='sigmoid'))

optimizer = keras.optimizers.AdamW(learning_rate=0.001)

model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics = ['accuracy']) 

In [None]:
mynet = model.fit(Xst_train, y_train, validation_data= (Xst_val, y_val), epochs=100, batch_size=200)

In [None]:
plt.figure(figsize=(14,5))

plt.subplot(121)

plt.plot(mynet.history['loss'], label = 'train')
plt.plot(mynet.history['val_loss'],'-.m', label = 'validation')
plt.ylabel('Loss', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(loc='upper right', fontsize = 12)

plt.subplot(122)

plt.plot(mynet.history['accuracy'], label = 'train')
plt.plot(mynet.history['val_accuracy'], '-.m', label = 'validation')
plt.ylabel('Accuracy', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(fontsize = 12)
plt.subplots_adjust(wspace=0.5)

#plt.savefig('RegularizedNN.png', dpi= 300)
#plt.show()

### The final evaluation of the model is always done on the test set; the reason is that the validation fold is used for hyperparameter optimization (which we haven't done here), and test set is blind to it.


In [None]:
scores = model.evaluate(Xst_test, y_test, verbose=1)

print("Accuracy: %.2f%%" % (scores[1]*100))

In [None]:
 #"scores" contains the test loss and the accuracy, which we are monitoring.

scores

### Questions:
- How does the performance of this 2 layer fully-connected neural network compare to the performance of the SVMs you tested?
- Which one would you prefer to use: this fully-connected NN or the SVM? Give at least 2 arguments as for why your chosen method is the preferable one.

### Acknowledgement Statement:

Once you're done with this notebook, move on to the photometric redshift neural network example.