## 1 - Introduction
In this notebook, we're going to explore a data set consisting of simulated data from the ATLAS-experiment which is located at the Large Hadron Collider at CERN. We'll look at the distribution of the data and labels, and explore the many existing techniques for data augmentation in deep learning for computer vision. Then we'll train a model to see how well it can classify between the two labels: black holes and sphalerons - which are two theories in fundamental phyics with similar propeties. The data set was generated by Aurora Grefsrud, a PhD-student who is a part of a research group called HVL ATLAS group at the Western Norway University of Applied Sciences. 

## 2 - Setup
We're using the PyTorch setup which is great at building and training deep neural networks, such as convolutional neural networks (CNNs). 

In [104]:
# Import necessary libraries
from pathlib import Path
import sys

import torch
from torch.utils.data import DataLoader
import torch.nn as nn
from torch import Tensor

from sklearn.model_selection import train_test_split


### 2.1 - Sets up the data directories
This section covers the setup of the environment for our machine learning project. We import the necessary modules, set up the directory structure, and load the data for our analysis.

In the methods folder there is a dataloader.py file and a plotCreator.py file. The separate dataloader-file provides an interface to load and preprocess data into a format that can be fed into a neural network for training and validation. 

The separate plotCreator-file provides a modile that contains functions for creating visualizations of data, such as histograms and grayscale images. These can be useful in analyzing and understanding the data used in a neural network.

Keeping them separate files improves the modularity, organization, and maintainability of the code. 

In [105]:
# Constructs a path to a directory that contains dataloader.py and plotCreator.py
module_path = str(Path.cwd().parents[0] / "methods")

# Checks to see if the directory is already in sys.path to avoid adding it multiple times.
if module_path not in sys.path:
    sys.path.append(module_path)

# Imports all the functions defined in the dataloader.py
from dataloader import *

# Creates two file paths pointing to two HDF5 files
data_path0 = str(Path.cwd().parents[0] / "data" / "BH_n4_M10_res50_15000_events.h5")
data_path1 = str(Path.cwd().parents[0] / "data" / "PP13-Sphaleron-THR9-FRZ15-NB0-NSUBPALL_res50_15000_events.h5")

In [106]:
# Reads the two HDF5 data files and creates two NumPy arrays
bhArray = dataToArray(data_path0)
sphArray = dataToArray(data_path1)

## 3 - Inspect data
What are the dimensions of our two numpy arrays, and how does this effect what we know about the dataset?

In [107]:
# Prints the shape of the arrays
print(bhArray.shape)
print(sphArray.shape)

(15000, 50, 50, 3)
(15000, 50, 50, 3)


The tuple represents the simulated data from the ATLAS detector at CERN, consisting of 15,000 2D histograms. The second and third elements of the tuple indicate that each histogram has a size of 50x50 pixels, and the final element indicates that the histograms are in RGB format with 3 channels. The channels correspond to the sub-detectors at ATLAS: EMcal, HCal, and tracks, represented respectively by the red, green, and blue color channels.

## 4 - Data preparation

In [108]:
# Combines the two arrays of data into a single array. It creates a combined dataset
# that can be used for training the machine learning model to distinguish 
# between the two classes.
dataArray = np.concatenate((bhArray, sphArray), axis=0)

In [109]:
np.shape(dataArray)

(30000, 50, 50, 3)

The output of the dataArray tells us that the concatenation of the 'bhArray' and 'sphArray' was performed correctly, since it has the expected number of data points and dimensions.

In [110]:
# Creating an array of length 30,000 with the first 15,000 elements set to 0 and the
# second 15,000 elements set to 1. This corresponds to the two classes of data: black
# hole (class 0) and sphaleron (class 1).
labelsArray = np.concatenate((np.zeros(15_000),np.ones(15_000)),axis=0)

In [111]:
np.shape(labelsArray)

(30000,)

The output from the above cell tells us that we have managed to separate our arrays into two different arrays. The dataArray contains the features, which is the actual data, while the labelsArray only contains the labels. This makes it possible for the machine learning algorithm to utilize supervised machine learning.

In [112]:
# Checks whether a CUDA-enabled GPU is available on the system where the code is running.
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Running on the GPU")
else:
    device = torch.device("cpu")
    print("Running on the CPU")

Running on the CPU


### 4.1 - Create training and test sets


In [113]:
# Randomly splits the dataArray and labelsArray into two sets. The trainData and trainLabels
# will be used to train the machine learning model, while the testData and testLabels will be 
# used to evaluate the performance of the model. 
trainData, testData, trainLabels, testLabels = train_test_split(dataArray, labelsArray, random_state=42)

### 4.2 - Data loaders

In [114]:
# Converts the NumPy arrays into PyTorch tensors, preparing the data for use in PyTorch models.
trainData = torch.from_numpy(trainData)
testData = torch.from_numpy(testData)
trainLabels = torch.from_numpy(trainLabels)
testLabels = torch.from_numpy(testLabels)

In [115]:
# Creates PyTorch 'TensorDataset' objects that can be passed as input to PyTorch data loaders.
train = torch.utils.data.TensorDataset(trainData, trainLabels)
test = torch.utils.data.TensorDataset(testData, testLabels)

In [116]:
# Creates PyTorch 'DataLoader' objects from the train and test datasets.
trainLoader = DataLoader(train, shuffle=True, batch_size=50)
testLoader = DataLoader(test, shuffle=True, batch_size=50)

'shuffle = True' specifies that the data should be randomly shuffled before each epoch of training. This helps to ensure that the models sees different examples in each epoch, which can improve generalization and prevent overfitting.

'batch_size = 50' specifiees that the data should be divided into batches of size 50. Batchin the data can help to improve the efficiency of training by allowing the model to process multiple examples in parallel.

## 5 - Neural network models

### 5.1 - A feedforward neural network
Defines a simple feedforward neural network with three fully connected layers and ReLU activation functions. It is designed for classification tasks on input data with three features.

In [117]:
class LinearModel(nn.Module):
    def __init__(self, resolution, num_classes, stride=1):

        super(LinearModel, self).__init__()
        self.fc1 = nn.Linear(3, 1000)
        self.fc2 = nn.Linear(1000, 100)
        self.fc3 = nn.Linear(100, 2)
        


    def forward(self, x:Tensor):
        out = F.relu(self.fc1(x))
        out = F.relu(self.fc2(x))
        out = self.fc3(x)
        return out

### 5.2 - A convolutional neural network (CNN)
Defines a convolutional neural network (CNN) with two convolutional layers, two max-pooling layers, and two fully connected layers, and is designed for classification tasks on 2D image data with 3 channels (RGB images). These types of neural networks are commonly used for image classification tasks because it is effective at learning spatial features and patterns from the image data. 

In [118]:
class ConvModel(nn.Module):
    def __init__(self):

        super(ConvModel, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=0)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=64, kernel_size=3, padding=0)

        self.fc1 = nn.Linear(11*11*64, 128)
        self.fc2 = nn.Linear(128,2)


    def forward(self, x:Tensor):
        x = self.conv1(x)
        x = F.relu(x) #to activate function above

        x = F.max_pool2d(x,2)

        x = self.conv2(x)
        x = F.relu(x)

        x = F.max_pool2d(x,2)

        x = torch.flatten(x, 1)

        x = self.fc1(x)
        x = F.relu(x)

        x = self.fc2(x)
 
        return x