Using Machine Learning to Simulate Distributions of Observables at the Large Hadron Collider
Arxiv link: https://arxiv.org/abs/2310.08994
- MlHEPsim
- Table of contents:
- Setup
- Configuration
- Running
- Documentation
Clone repository:
git clone https://github.com/j-gavran/MlHEPsim.git
Make virtual environment:
pip install pip --upgrade
pip install virtualenv
python3 -m venv ml_hep_sim_env
Activate env:
source ml_hep_sim_env/bin/activate
Install dependencies:
pip install -r requirements.txt
source setup.sh
Need to change VENV_PATH to your location.
MLflow runs are saved in mlruns/
and can be accesed with:
mlflow ui
All model parameters can be set inside yaml files in ml_hep_sim/conf/ using Hydra library.
- Directly from model files
- Using a pipeline
- ml_hep_sim/data_utils/toy_datasets.py, collection of 2D point datasets for testing and debugging:
TOY_DATASETS = ["swissroll", "circles", "rings", "moons", "4gaussians", "8gaussians", "pinwheel", "2spirals", "checkerboard", "line", "cos", "fmf_normal", "fmf_uniform", "einstein"]
- ml_hep_sim/data_utils/mnist, see preprocess_mnist.py for more info.
- ml_hep_sim/data_utils/higgs, download and preprocess Higgs dataset:
python3 ml_hep_sim/data_utils/higgs/process_higgs_dataset.py
- ml_hep_sim/data_utils/dataset_utils.py, rescale features using one of the following methods:
- normal: zero mean and unit variance - robust: removes the median and scales the data according to the quantile range - sigmoid: [0, 1] range - tanh: [-1, 1] range - logit: [0, 1] -> [-inf, inf] ranges - logit_normal: [0, 1] -> [-inf, inf] -> normal ranges - Gauss scaler: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629#250927
- Autoencoder
- Multi layer perceptron
- Residual networks
- U-net
- Classifiers
- Binary classifier
- Multi label class classifier
-
N-dim tests
- Classification (train a classifier)
-
1-dim tests
- Two sample
$\chi^2$ test - Kolmogorov-Smirnov test
- Two sample
- pyhf upper limits and specs
- Wasserstein distance
- Maximum mean discrepancy
- Statistics plots
- Two sample plot
- N sample plot
- HEP plot (standard stacked plot)
- Style (colors, font sizes, etc.)
- Matplotlib setup (latex)
Roughly, the ML pipeline consists of 3 stages built out of blocks holding intermediate results. The blocks are connected to each other in a direct acyclic graph fashion that can be thought of as a compositum of functions. This can be visualized as a tree with a draw_pipeline_tree()
method. The pipeline is run by calling fit()
method after composing all blocks with the compose()
method. The blocks are:
- ConfigBuilderBlock
- ModelBuilderBlock
- DatasetBuilderBlock
- ModelTrainerBlock
- ModelLoaderBlock
- DataGeneratorBlock
- GeneratedDataVerifierBlock
- ClassifierRunnerBlock
- ReferenceDataLoaderBlock
- DistanceMetricRunnerBlock
- PCARunnerBlock
- StatTestRunnerBlock
- CouplingModelTestingBlock
- MADEMOGModelTestingBlock
- ScalingTestBlock
Post stage 3, analysis specific:
- VariableExtractBlock
- RatioHighestValuesCutBlock
- CutBlock
- CutByIndexBlock
- RedoRescaleDataBlock
- GCBlock
Training a HIGGS classifier with default parameters.
from ml_hep_sim.pipeline.blocks import (
ConfigBuilderBlock,
DatasetBuilderBlock,
ModelBuilderBlock,
ModelTrainerBlock,
)
from ml_hep_sim.pipeline.distributed_pipes import Pipeline
class_train_pipeline = Pipeline(pipeline_name="classifier_train_pipeline", pipeline_path="ml_pipeline/")
x1 = ConfigBuilderBlock(config_path="../conf", config_name="classifier_config", model_name="BinaryClassifier")()
x2 = ModelBuilderBlock(model_type="other")(x1)
x3 = DatasetBuilderBlock()(x1)
x4 = ModelTrainerBlock()(x2, x3)
class_train_pipeline.compose(x1, x2, x3, x4)
class_train_pipeline.fit().save()
See prebuilt directory for more info.
CP = ClassifierPipeline(run_name, override, pipeline_path="ml_pipeline/test/") # all the magic is in the override argument that changes the predefined hydra config in conf/ directory
CP.build_train_pipeline()
CP.fit(force=True) # if model with this name already exists force training again
CP.build_inference_pipeline(test_dataset) # test classification, e.g. "higgs_bkg"
res = CP.infer(return_results=True) # returns classification scores for test_dataset
class_train_pipeline, class_infer_pipeline = CP.pipeline["train_pipeline"], CP.pipeline["inference_pipeline"]
Only change is in model_name, which can be any implemented flow model and in N_gen (number of generated events).
FP = FlowPipeline(run_name, model_name, override, pipeline_path=f"ml_pipeline/test/",)
FP.build_train_pipeline()
FP.fit(force=True)
FP.build_inference_pipeline(N_gen=10 ** 5) # inference == generation
res = FP.infer(return_results=True) # returns flow generated results
flow_train_pipeline, flow_infer_pipeline = FP.pipeline["train_pipeline"], FP.pipeline["inference_pipeline"]
Most of the script files are also available as jupyter notebooks. The notebooks are used for analysis and plotting. The .py
files are used for running the analysis in a pipeline. Some quick generator tests are given in generators.py
.
Scripts should generally be run in the following order:
- Generator pipeline
- Cut pipeline
- Histogram pipeline
- Upper limit pipeline
- Pull plots (example pipeline tree)
- CLs pipeline
- Spurious signal pipeline
- utils.py
- SigBkgBlock
- hists_pipeline.py
- MakeHistsFromSamples
- MakeHistsFromSamplesLumi
- ul_pipeline.py
- UpperLimitScannerBlock
- PullBlock
- cls_pipeline.py
- CLsBlock
- CLsBlockResultsParser
- spur_pipeline.py
- SpurBlock
- SpurBlockResultsParser