Skip to content

Using ML to Simulate Distributions of Observables at the LHC

Notifications You must be signed in to change notification settings

j-gavran/MlHEPsim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MlHEPsim

Code style: black PyTorch Lightning

Using Machine Learning to Simulate Distributions of Observables at the Large Hadron Collider

Arxiv link: https://arxiv.org/abs/2310.08994

Table of contents:

Setup

Clone repository:

git clone https://github.com/j-gavran/MlHEPsim.git

Make virtual environment:

pip install pip --upgrade
pip install virtualenv
python3 -m venv ml_hep_sim_env

Activate env:

source ml_hep_sim_env/bin/activate

Install dependencies:

pip install -r requirements.txt

Setup script

source setup.sh

Need to change VENV_PATH to your location.

Logging

MLflow runs are saved in mlruns/ and can be accesed with:

mlflow ui

Configuration

All model parameters can be set inside yaml files in ml_hep_sim/conf/ using Hydra library.

Running

  1. Directly from model files
  2. Using a pipeline

Documentation

Datasets

  • ml_hep_sim/data_utils/toy_datasets.py, collection of 2D point datasets for testing and debugging:
    TOY_DATASETS = ["swissroll", "circles", "rings", "moons", "4gaussians", "8gaussians", "pinwheel", "2spirals", "checkerboard", "line", "cos", "fmf_normal", "fmf_uniform", "einstein"]
    
  • ml_hep_sim/data_utils/mnist, see preprocess_mnist.py for more info.
  • ml_hep_sim/data_utils/higgs, download and preprocess Higgs dataset:
    python3 ml_hep_sim/data_utils/higgs/process_higgs_dataset.py

Feature rescaling

  • ml_hep_sim/data_utils/dataset_utils.py, rescale features using one of the following methods:
    - normal: zero mean and unit variance
    - robust: removes the median and scales the data according to the quantile range
    - sigmoid: [0, 1] range
    - tanh: [-1, 1] range
    - logit: [0, 1] -> [-inf, inf] ranges
    - logit_normal: [0, 1] -> [-inf, inf] -> normal ranges
    - Gauss scaler: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629#250927
    

Neural networks

VAEs

Normalizing flows

Statistics

Tests

  1. N-dim tests
    • Classification (train a classifier)
  2. 1-dim tests
    • Two sample $\chi^2$ test
    • Kolmogorov-Smirnov test

Other

Plotting

  • HEP plot (standard stacked plot)
  • Style (colors, font sizes, etc.)
  • Matplotlib setup (latex)

Using pipeline

Roughly, the ML pipeline consists of 3 stages built out of blocks holding intermediate results. The blocks are connected to each other in a direct acyclic graph fashion that can be thought of as a compositum of functions. This can be visualized as a tree with a draw_pipeline_tree() method. The pipeline is run by calling fit() method after composing all blocks with the compose() method. The blocks are:

Stage 1 blocks: Model building, training and saving

  • ConfigBuilderBlock
  • ModelBuilderBlock
  • DatasetBuilderBlock
  • ModelTrainerBlock

Stage 2 blocks: Model loading, generation and verification

  • ModelLoaderBlock
  • DataGeneratorBlock
  • GeneratedDataVerifierBlock
  • ClassifierRunnerBlock

Stage 3 blocks: Testing

  • ReferenceDataLoaderBlock
  • DistanceMetricRunnerBlock
  • PCARunnerBlock
  • StatTestRunnerBlock
  • CouplingModelTestingBlock
  • MADEMOGModelTestingBlock
  • ScalingTestBlock

Other blocks

Post stage 3, analysis specific:

  • VariableExtractBlock
  • RatioHighestValuesCutBlock
  • CutBlock
  • CutByIndexBlock
  • RedoRescaleDataBlock
  • GCBlock

Example: building a pipeline

Training a HIGGS classifier with default parameters.

from ml_hep_sim.pipeline.blocks import (
    ConfigBuilderBlock,
    DatasetBuilderBlock,
    ModelBuilderBlock,
    ModelTrainerBlock,
)
from ml_hep_sim.pipeline.distributed_pipes import Pipeline


class_train_pipeline = Pipeline(pipeline_name="classifier_train_pipeline", pipeline_path="ml_pipeline/")

x1 = ConfigBuilderBlock(config_path="../conf", config_name="classifier_config", model_name="BinaryClassifier")()
x2 = ModelBuilderBlock(model_type="other")(x1)
x3 = DatasetBuilderBlock()(x1)
x4 = ModelTrainerBlock()(x2, x3)

class_train_pipeline.compose(x1, x2, x3, x4)
class_train_pipeline.fit().save()

Example: running a prebuilt classifier pipeline

See prebuilt directory for more info.

CP = ClassifierPipeline(run_name, override, pipeline_path="ml_pipeline/test/") # all the magic is in the override argument that changes the predefined hydra config in conf/ directory

CP.build_train_pipeline()
CP.fit(force=True) # if model with this name already exists force training again

CP.build_inference_pipeline(test_dataset) # test classification, e.g. "higgs_bkg"

res = CP.infer(return_results=True) # returns classification scores for test_dataset

class_train_pipeline, class_infer_pipeline = CP.pipeline["train_pipeline"], CP.pipeline["inference_pipeline"]

Example: running a prebuilt flow pipeline

Only change is in model_name, which can be any implemented flow model and in N_gen (number of generated events).

FP = FlowPipeline(run_name, model_name, override, pipeline_path=f"ml_pipeline/test/",)

FP.build_train_pipeline()
FP.fit(force=True)

FP.build_inference_pipeline(N_gen=10 ** 5) # inference == generation

res = FP.infer(return_results=True) # returns flow generated results

flow_train_pipeline, flow_infer_pipeline = FP.pipeline["train_pipeline"], FP.pipeline["inference_pipeline"]

Analysis setup

Most of the script files are also available as jupyter notebooks. The notebooks are used for analysis and plotting. The .py files are used for running the analysis in a pipeline. Some quick generator tests are given in generators.py.

Workflow

Scripts should generally be run in the following order:

  1. Generator pipeline
  2. Cut pipeline
  3. Histogram pipeline
  4. Upper limit pipeline
  5. CLs pipeline
  6. Spurious signal pipeline

List of analysis blocks

  • utils.py
    • SigBkgBlock
  • hists_pipeline.py
    • MakeHistsFromSamples
    • MakeHistsFromSamplesLumi
  • ul_pipeline.py
    • UpperLimitScannerBlock
    • PullBlock
  • cls_pipeline.py
    • CLsBlock
    • CLsBlockResultsParser
  • spur_pipeline.py
    • SpurBlock
    • SpurBlockResultsParser

Miscellaneous

Releases

No releases published

Packages

No packages published