Repository for performing hypothesis tests using the outputs of various Machine Learning methods, associated with the paper 'A simple guide from Machine Learning outputs to statistical criteria' arXiv:2203.03669 (2022). Authors: Charanjit K. Khosa, Veronica Sanz and Michael Soughton.
Currently there is much interest in using Machine Learning methods for detecting and identifying signals within HEP. However, little has so far been done towards the seemingly straightforward task of incorporating the results from these methods into statistical tests such as those used for discovery of new particles. Our paper demonstrates how to use the outputs of supervised classifiers or unsupervised anomaly detection methods can be used in Log-Likelihood Ratio hypothesis tests and in obtaining seperation and discovery significances.
We train supervised Machine Learning methods (CNN and DNN classifiers) on types of signals and backgrounds that may be found within the LHC. The CNN was trained on images (in
We also train an unsupervised VAE on a on the same Standard Model background as was the DNN classifier, but now it has no knowledge of what the SMEFT signal looks like. With the goal of using the outputs of the VAE to obtain a significance of finding SMEFT signals within the data we calculate the Reconstruction Error
Events are generated through MadGraph
along with Pythia
and Delphes
for showering and detector effects.
The code is run in python3.8.5
. The following packages are required:
numpy=1.19.1
scipy=1.5.0
matplotlib=3.3.1
seaborn=0.11.0
keras=2.4.3
scikit-learn=0.23.2
scikit-image=0.17.2
These can be installed manually or via the conda yaml file using
conda env create --name <env name> -f environment.yml
The code is split into three main directories.
The first jet-cnn
contains code to train the CNN classifer on QCD and top jet images, find the probabilities of new images being either a top or QCD jet and then performs a simple hypothesis test on data which contains a small number of top jets with a dominant QCD background, against data comprised of only a QCD background. The hypothesis test is performed using a number of toy experiments to find the significance levels/seperation significance. This code also supports bootstrapping to gauge the variance in outputs arising from training.
The second eft-dnn
does the same but with a DNN classifier using data of
The third eft-vae
trains a VAE on only the
There is also a fourth directory misc
which contains scripts used to produce plots for demonstration purposes, but are not otherwise used.
For instructions on running the code see the respective directories.
Please cite the paper as follows in your publications if it helps your research:
@article{Khosa:2022vxb,
author = "Khosa, Charanjit Kaur and Sanz, Veronica and Soughton, Michael",
title = "{A simple guide from machine learning outputs to statistical criteria in particle physics}",
eprint = "2203.03669",
archivePrefix = "arXiv",
primaryClass = "hep-ph",
doi = "10.21468/SciPostPhysCore.5.4.050",
journal = "SciPost Phys. Core",
volume = "5",
pages = "050",
year = "2022"
}
VS is supported by the PROMETEO/2021/083 from Generalitat Valenciana, and by PID2020-113644GB-I00 from the Spanish Ministerio de Ciencia e Innovacion. MS acknowledges support by the Data Intensive Science Center in the South East Physics Network (DISCnet), an extension of the STFC, under grant number ST/P006760/1.