## Setup

### Packages

- `awkward`: For dealing with nested, variable-sized data.
- `pennylane`: Quantum machine learning.
- `lightning`: Simplifying training process.
- `pytorch_geometric`: Graph neural network package.
- `wandb`: Monitoring training process.

In [1]:
# basic packages
import os, glob, random
from itertools import product
import matplotlib.pyplot as plt

# data
import awkward as ak
from d_hep_data import JetEvents

# qml
import pennylane as qml
from pennylane import numpy as np

# pytorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn import functional as F

# pytorch_lightning
import lightning as L
import lightning.pytorch as pl
import torchmetrics

# pytorch_geometric
import networkx as nx
import torch_geometric.nn as geom_nn
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader

# wandb
import wandb
from lightning.pytorch.loggers import WandbLogger
wandb.login()

# reproducibility
L.seed_everything(3020616)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# faster calculation on GPU but less precision
torch.set_float32_matmul_precision("medium")

#--------------------------------------------------------------------------
#                         FastJet release 3.4.0
#                 M. Cacciari, G.P. Salam and G. Soyez                  
#     A software package for jet finding and analysis at colliders      
#                           http://fastjet.fr                           
#	                                                                      
# Please cite EPJC72(2012)1896 [arXiv:1111.6097] if you use this package
# for scientific work and optionally PLB641(2006)57 [hep-ph/0512210].   
#                                                                       
# FastJet is provided without warranty under the GNU GPL v2 or higher.  
# It uses T. Chan's closest pair algorithm, S. Fortune's Voronoi code,
# CGAL and 3rd party plugin jet algorithms. See COPYING file for details.
#--------------------------------------------------------------------------


  from .autonotebook import tqdm as notebook_tqdm
ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mntuyianchen[0m. Use [1m`wandb login --relogin`[0m to force relogin
INFO: Global seed set to 3020616
INFO:lightning.fabric.utilities.seed:Global seed set to 3020616


### Configurations

Hyperparameters and configurations for:
- Data (channel, .etc)
- Training process (Trainer, .etc)
- Model architecture (input/output dimension, .etc)

In [2]:
# configuration dictionary
cf = {}
cf["wandb"]    = True
cf["project"]  = "g_4vec_2pcqnn"

# data infotmation
cf["num_events"]    = "5000"
cf["sig_channel"]   = "ZprimeToZhToZinvhbb"
cf["bkg_channel"]   = "QCD_HT2000toInf"
cf["jet_type"]      = "fatjet"
cf["cut"]           = f"({cf['jet_type']}_pt>=500)&({cf['jet_type']}_pt<=1500)"
cf["subjet_radius"] = 0.1

# traning configuration
cf["num_train_ratio"] = 0.4
cf["num_test_ratio"]  = 0.4
cf["batch_size"]      = 64
cf["num_workers"]     = 12
cf["max_epochs"]      = 10
cf["accelerator"]     = "gpu"
cf["num_data"]        = 1000
cf["fast_dev_run"]    = False

# model hyperparameters
cf["loss_function"]  = nn.BCEWithLogitsLoss()
cf["optimizer"]      = optim.AdamW
cf["learning_rate"]  = 1E-3
cf["gnn_hidden"]     = 16
cf["gnn_num_layers"] = 3
cf["mlp_hidden"]     = 64
cf["mlp_num_layers"] = 2

### Data Module

In this project, we train with data containing only the four momentum of particles. In order to reduce the size of the data (due to the long training time for quantum machine learning), we reduce the size of data by `fastjet` package by clustering particles again by `anti-kt algorithm` with smaller radius.

The detail (source code) for creating fastjet reclustering events is in the `d_hep_data` file.

To test the power of QML for learning space structure of data (geometric angles, e.g. $p_t$, $\eta$, $\phi$), we will use four momentum only (or z-boosted invariant variables $p_t$, $\eta$, $\phi$).

In [3]:
class JetDataModule(pl.LightningDataModule):
    def __init__(self):
        super().__init__()
        # jet events
        self.sig_data_list = self._create_data_list(cf["sig_channel"], 1)
        self.bkg_data_list = self._create_data_list(cf["bkg_channel"], 0)

        # count the number of training, validation, and testing
        assert len(self.sig_data_list) >= cf["num_data"], f"sig data not enough: {len(self.sig_data_list)} < {cf['num_data']}"
        assert len(self.bkg_data_list) >= cf["num_data"], f"bkg data not enough: {len(self.bkg_data_list)} < {cf['num_data']}"
        num_train = int(cf["num_data"] * cf["num_train_ratio"])
        num_test  = int(cf["num_data"] * cf["num_test_ratio"])
        num_valid = cf["num_data"] - num_train - num_test
        print(f"DataLog: {cf['sig_channel']} has {len(self.sig_data_list)} events and {cf['bkg_channel']} has {len(self.bkg_data_list)} events.")
        print(f"Choose num_data for each channel to be {cf['num_data']} | Each channel  has num_train = {num_train}, num_valid = {num_valid}, num_test = {num_test}")

        # prepare dataset for dataloader
        train_idx = num_train
        valid_idx = num_train + num_valid
        test_idx  = num_train + num_valid + num_test
        self.train_dataset = self.sig_data_list[:train_idx] + self.bkg_data_list[:train_idx]
        self.valid_dataset = self.sig_data_list[train_idx:valid_idx] + self.bkg_data_list[train_idx:valid_idx]
        self.test_dataset  = self.sig_data_list[valid_idx:test_idx] + self.bkg_data_list[valid_idx:test_idx]
    
    def _create_data_list(self, channel, y):
        # use fastjet to recluster jet events into subjet events
        jet_events     = JetEvents(channel, cf["num_events"], cf["jet_type"], cf["cut"])
        fastjet_events = jet_events.fastjet_events(R=cf["subjet_radius"])
        # list for saving pytorch_geometric "Data"
        data_list = []
        for e in fastjet_events:
            # create pytorch_geometric "Data" object
            x = torch.tensor([e["pt"], e["delta_eta"], e["delta_phi"]], dtype=torch.float32)
            x = torch.transpose(x, 0, 1)
            edge_index = list(product(range(len(e)), range(len(e))))
            edge_index = torch.transpose(torch.tensor(edge_index), 0, 1)
            x.requires_grad, edge_index.requires_grad = False, False
            data_list.append(Data(x=x, edge_index=edge_index, y=y))
        random.shuffle(data_list)
        return data_list

    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=cf["batch_size"], num_workers=cf["num_workers"],  shuffle=True)

    def val_dataloader(self):
        return DataLoader(self.valid_dataset, batch_size=cf["batch_size"], num_workers=cf["num_workers"])

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=cf["batch_size"], num_workers=cf["num_workers"])

## Models

To compare classical GNN with quantum GNN, we use `GraphConv` and `MessagePassing` with `pennylane` for classical and quantum repectively.

- Why using `nn.ModuleList` instead of `nn.Sequential`?
Both `nn.ModuleList` and `nn.Sequential` trace the trainable parameters autometically. However, since we are using "gnn" layers, we need to feed into additional argument `edge_index`. In order to check whether we are using "gnn" layers or not, we use `isinstance` to check the class type (Since all PyTorch Geometric graph layer inherit the class `MessagePassing`). For detail, see [When should I use nn.ModuleList and when should I use nn.Sequential?](https://discuss.pytorch.org/t/when-should-i-use-nn-modulelist-and-when-should-i-use-nn-sequential/5463/3)

### Classical GNN Model

In [4]:
class GraphConvModel(nn.Module):
    def __init__(self, gnn_in, gnn_hidden, gnn_out, gnn_num_layers, mlp_hidden, mlp_num_layers):
        super().__init__()
        # graph neural network
        if gnn_num_layers == 1:
            gnn_layers = [geom_nn.GraphConv(in_channels=gnn_in, out_channels=gnn_out), nn.ReLU()]
        else:
            gnn_layers = [geom_nn.GraphConv(in_channels=gnn_in, out_channels=gnn_hidden), nn.ReLU()]
            for _ in range(gnn_num_layers-2):
                gnn_layers += [geom_nn.GraphConv(in_channels=gnn_hidden, out_channels=gnn_hidden), nn.ReLU()]
            gnn_layers += [geom_nn.GraphConv(in_channels=gnn_hidden, out_channels=gnn_out)]
        self.gnn_layers = nn.ModuleList(gnn_layers)

        # multi-layer perceptron
        if mlp_num_layers == 1:
            mlp_layers = [nn.Linear(gnn_out, 1), nn.ReLU()]
        else:
            mlp_layers = [nn.Linear(gnn_out, mlp_hidden), nn.ReLU()]
            for _ in range(mlp_num_layers-2):
                mlp_layers += [nn.Linear(mlp_hidden, mlp_hidden), nn.ReLU()]
            mlp_layers += [nn.Linear(mlp_hidden, 1)]
        self.mlp_layers = nn.Sequential(*mlp_layers)
        
    def forward(self, x, edge_index, batch):
        # gnn message passing
        for layer in self.gnn_layers:
            if isinstance(layer, geom_nn.MessagePassing):
                x = layer(x, edge_index)
            else:
                x = layer(x)
        
        # gnn graph aggregation and mlp
        x = geom_nn.global_mean_pool(x, batch)
        x = self.mlp_layers(x)
        return x

### Lightning Module

Most of the hyperparameters are defined at `cf` configuration dictionary.

Note that when using `nn.BCEWithLogitsLoss`, the first argument should not be paased to `sigmoid`.

In [5]:
class LitModel(L.LightningModule):
    def __init__(self, model):
        super().__init__()
        self.save_hyperparameters(ignore=['model'])
        self.model = model
        self.loss_function = cf["loss_function"]

    def forward(self, data):
        # predict y
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = self.model(x, edge_index, batch)
        x = x.squeeze(dim=-1)

        # calculate loss and accuracy
        y_pred = (x>0).float()
        y_true = data.y.float()
        loss   = self.loss_function(x, y_true)
        acc    = (y_pred == data.y).float().mean()
        return loss, acc

    def configure_optimizers(self):
        optimizer = cf["optimizer"](self.parameters(), lr=cf["learning_rate"])
        return optimizer

    def training_step(self, data, batch_idx):
        loss, acc = self.forward(data)
        self.log("train_loss", loss, on_epoch=True, batch_size=len(data.x))
        self.log("train_acc", acc, on_epoch=True, batch_size=len(data.x))
        return loss

    def validation_step(self, data, batch_idx):
        loss, acc = self.forward(data)
        self.log("train_loss", loss, on_epoch=True, batch_size=len(data.x))
        self.log("train_acc", acc, on_epoch=True, batch_size=len(data.x))
        return loss

    def test_step(self, data, batch_idx):
        _, acc = self.forward(data)
        self.log("test_acc", acc, on_epoch=True, batch_size=len(data.x))

## Train/Test the Model

### Training procedure

In [6]:
def train(model, data_module, name_prefix="", name_suffix=""):
    # setup id and path for saving
    name     = f"{name_prefix}_{model.__class__.__name__}_{name_suffix}"
    save_dir = f"./result/"
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    
    # wandb logger setup
    if cf["wandb"]:
        project     = cf['project']
        group       = f"{cf['num_data']}_{cf['sig_channel']}_{cf['bkg_channel']}_{cf['jet_type']}_{cf['cut']}"
        job_type    = model.__class__.__name__
        wandb_logger = WandbLogger(project=project, group=group, job_type=job_type, name=name, id=name, save_dir=save_dir)
        wandb_logger.experiment.config.update(cf)
        wandb_logger.watch(model, log="all")
    
    # start lightning training
    logger   = wandb_logger if cf["wandb"] else None
    trainer  = L.Trainer(logger=logger, accelerator=cf["accelerator"], max_epochs=cf["max_epochs"], fast_dev_run=cf["fast_dev_run"])
    litmodel = LitModel(model)
    trainer.fit(litmodel, datamodule=data_module)
    trainer.test(litmodel, datamodule=data_module)

    # finish wandb monitoring
    if cf["wandb"]:
        wandb.finish()

### Load data and set up sub-config dictionaries

In [7]:
# load data
data_module = JetDataModule()
input_dim   = data_module.train_dataset[0].x.shape[1]

# classical GrapgConv model
cf_graph_conv = {
    "gnn_in"        : input_dim, 
    "gnn_hidden"    : cf["gnn_hidden"], 
    "gnn_out"       : cf["gnn_hidden"], 
    "gnn_num_layers": cf["gnn_num_layers"],
    "mlp_hidden"    : cf["mlp_hidden"], 
    "mlp_num_layers": cf["mlp_num_layers"],
}

DataLog: Successfully create ZprimeToZhToZinvhbb with 2980 events.
DataLog: Finish reclustering ZprimeToZhToZinvhbb with anti-kt algorithm.
DataLog: Successfully create QCD_HT2000toInf with 4917 events.
DataLog: Finish reclustering QCD_HT2000toInf with anti-kt algorithm.
DataLog: ZprimeToZhToZinvhbb has 2980 events and QCD_HT2000toInf has 4917 events.
Choose num_data for each channel to be 1000 | Each channel  has num_train = 400, num_valid = 200, num_test = 400


### Start training each model

In [8]:
train(GraphConvModel(**cf_graph_conv), data_module=data_module)

[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name          | Type              | Params
----------------------------------------------------
0 | model         | GraphConvModel    | 2.3 K 
1 | loss_function | BCEWithLogitsLoss | 0     
----------------------------------------------------
2.3 K     Trainable params
0         Non-trainable params
2.3 K     Total params
0.009     Total estimated model params size (MB)
INFO:lightning.pytorch.callbacks.model_summary:
  | Name          | Type        

                                                                           

  rank_zero_warn(


Epoch 9: 100%|██████████| 13/13 [00:00<00:00, 14.08it/s, v_num=del_]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 13/13 [00:00<00:00, 14.00it/s, v_num=del_]


INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing DataLoader 0: 100%|██████████| 13/13 [00:00<00:00, 202.73it/s]


0,1
epoch,▁▁▂▂▂▂▃▃▃▄▄▅▅▅▅▆▆▆▇▇▇▇█
test_acc,▁
train_acc,▄▁▁▁▄▁▁▇▄█
train_acc_epoch,▃▂▁▄▄▅▄▆▇█
train_acc_step,▁█
train_loss,█▅▂▂▁▁▁▁▁▁
train_loss_epoch,█▃▂▁▁▁▁▁▁▁
train_loss_step,█▁
trainer/global_step,▁▁▂▂▃▃▃▃▃▄▄▅▅▆▆▆▆▆▇▇███

0,1
epoch,10.0
test_acc,0.62911
train_acc,0.67743
train_acc_epoch,0.62894
train_acc_step,0.8125
train_loss,37.93479
train_loss_epoch,44.50123
train_loss_step,18.68076
trainer/global_step,130.0
