MlHEPsim

Using Machine Learning to Simulate Distributions of Observables at the Large Hadron Collider

Arxiv link: https://arxiv.org/abs/2310.08994

Setup

Clone repository:

git clone https://github.com/j-gavran/MlHEPsim.git

Make virtual environment:

pip install pip --upgrade
pip install virtualenv
python3 -m venv ml_hep_sim_env

Activate env:

source ml_hep_sim_env/bin/activate

Install dependencies:

pip install -r requirements.txt

Setup script

source setup.sh

Need to change VENV_PATH to your location.

Logging

MLflow runs are saved in mlruns/ and can be accesed with:

mlflow ui

Configuration

All model parameters can be set inside yaml files in ml_hep_sim/conf/ using Hydra library.

Running

Directly from model files
Using a pipeline

Documentation

Datasets

ml_hep_sim/data_utils/toy_datasets.py, collection of 2D point datasets for testing and debugging:

TOY_DATASETS = ["swissroll", "circles", "rings", "moons", "4gaussians", "8gaussians", "pinwheel", "2spirals", "checkerboard", "line", "cos", "fmf_normal", "fmf_uniform", "einstein"]

ml_hep_sim/data_utils/mnist, see preprocess_mnist.py for more info.
ml_hep_sim/data_utils/higgs, download and preprocess Higgs dataset:
```
python3 ml_hep_sim/data_utils/higgs/process_higgs_dataset.py
```

Feature rescaling

ml_hep_sim/data_utils/dataset_utils.py, rescale features using one of the following methods:

- normal: zero mean and unit variance
- robust: removes the median and scales the data according to the quantile range
- sigmoid: [0, 1] range
- tanh: [-1, 1] range
- logit: [0, 1] -> [-inf, inf] ranges
- logit_normal: [0, 1] -> [-inf, inf] -> normal ranges
- Gauss scaler: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629#250927

Neural networks

Autoencoder
Multi layer perceptron
Residual networks
U-net
Classifiers
- Binary classifier
- Multi label class classifier

VAEs

Normalizing flows

Statistics

Tests

N-dim tests
- Classification (train a classifier)
1-dim tests
- Two sample $\chi^2$ test
- Kolmogorov-Smirnov test

Other

pyhf upper limits and specs
Wasserstein distance
Maximum mean discrepancy
Statistics plots
- Two sample plot
- N sample plot

Plotting

HEP plot (standard stacked plot)
Style (colors, font sizes, etc.)
Matplotlib setup (latex)

Using pipeline

Roughly, the ML pipeline consists of 3 stages built out of blocks holding intermediate results. The blocks are connected to each other in a direct acyclic graph fashion that can be thought of as a compositum of functions. This can be visualized as a tree with a draw_pipeline_tree() method. The pipeline is run by calling fit() method after composing all blocks with the compose() method. The blocks are:

Stage 1 blocks: Model building, training and saving

ConfigBuilderBlock
ModelBuilderBlock
DatasetBuilderBlock
ModelTrainerBlock

Stage 2 blocks: Model loading, generation and verification

ModelLoaderBlock
DataGeneratorBlock
GeneratedDataVerifierBlock
ClassifierRunnerBlock

Stage 3 blocks: Testing

ReferenceDataLoaderBlock
DistanceMetricRunnerBlock
PCARunnerBlock
StatTestRunnerBlock
CouplingModelTestingBlock
MADEMOGModelTestingBlock
ScalingTestBlock

Other blocks

Post stage 3, analysis specific:

VariableExtractBlock
RatioHighestValuesCutBlock
CutBlock
CutByIndexBlock
RedoRescaleDataBlock
GCBlock

Example: building a pipeline

Training a HIGGS classifier with default parameters.

from ml_hep_sim.pipeline.blocks import (
    ConfigBuilderBlock,
    DatasetBuilderBlock,
    ModelBuilderBlock,
    ModelTrainerBlock,
)
from ml_hep_sim.pipeline.distributed_pipes import Pipeline


class_train_pipeline = Pipeline(pipeline_name="classifier_train_pipeline", pipeline_path="ml_pipeline/")

x1 = ConfigBuilderBlock(config_path="../conf", config_name="classifier_config", model_name="BinaryClassifier")()
x2 = ModelBuilderBlock(model_type="other")(x1)
x3 = DatasetBuilderBlock()(x1)
x4 = ModelTrainerBlock()(x2, x3)

class_train_pipeline.compose(x1, x2, x3, x4)
class_train_pipeline.fit().save()

Example: running a prebuilt classifier pipeline

See prebuilt directory for more info.

CP = ClassifierPipeline(run_name, override, pipeline_path="ml_pipeline/test/") # all the magic is in the override argument that changes the predefined hydra config in conf/ directory

CP.build_train_pipeline()
CP.fit(force=True) # if model with this name already exists force training again

CP.build_inference_pipeline(test_dataset) # test classification, e.g. "higgs_bkg"

res = CP.infer(return_results=True) # returns classification scores for test_dataset

class_train_pipeline, class_infer_pipeline = CP.pipeline["train_pipeline"], CP.pipeline["inference_pipeline"]

Example: running a prebuilt flow pipeline

Only change is in model_name, which can be any implemented flow model and in N_gen (number of generated events).

FP = FlowPipeline(run_name, model_name, override, pipeline_path=f"ml_pipeline/test/",)

FP.build_train_pipeline()
FP.fit(force=True)

FP.build_inference_pipeline(N_gen=10 ** 5) # inference == generation

res = FP.infer(return_results=True) # returns flow generated results

flow_train_pipeline, flow_infer_pipeline = FP.pipeline["train_pipeline"], FP.pipeline["inference_pipeline"]

Analysis setup

Most of the script files are also available as jupyter notebooks. The notebooks are used for analysis and plotting. The .py files are used for running the analysis in a pipeline. Some quick generator tests are given in generators.py.

Workflow

Scripts should generally be run in the following order:

Generator pipeline
Cut pipeline
Histogram pipeline
Upper limit pipeline
- Pull plots (example pipeline tree)
CLs pipeline
Spurious signal pipeline

List of analysis blocks

utils.py
- SigBkgBlock
hists_pipeline.py
- MakeHistsFromSamples
- MakeHistsFromSamplesLumi
ul_pipeline.py
- UpperLimitScannerBlock
- PullBlock
cls_pipeline.py
- CLsBlock
- CLsBlockResultsParser
spur_pipeline.py
- SpurBlock
- SpurBlockResultsParser

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MlHEPsim

Table of contents:

Setup

Setup script

Logging

Configuration

Running

Documentation

Datasets

Feature rescaling

Neural networks

VAEs

Normalizing flows

Statistics

Tests

Other

Plotting

Using pipeline

Stage 1 blocks: Model building, training and saving

Stage 2 blocks: Model loading, generation and verification

Stage 3 blocks: Testing

Other blocks

Example: building a pipeline

Example: running a prebuilt classifier pipeline

Example: running a prebuilt flow pipeline

Analysis setup

Workflow

List of analysis blocks

Miscellaneous

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
ml_hep_sim		ml_hep_sim
ml_pipeline		ml_pipeline
mlruns		mlruns
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

j-gavran/MlHEPsim

Folders and files

Latest commit

History

Repository files navigation

MlHEPsim

Table of contents:

Setup

Setup script

Logging

Configuration

Running

Documentation

Datasets

Feature rescaling

Neural networks

VAEs

Normalizing flows

Statistics

Tests

Other

Plotting

Using pipeline

Stage 1 blocks: Model building, training and saving

Stage 2 blocks: Model loading, generation and verification

Stage 3 blocks: Testing

Other blocks

Example: building a pipeline

Example: running a prebuilt classifier pipeline

Example: running a prebuilt flow pipeline

Analysis setup

Workflow

List of analysis blocks

Miscellaneous

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages