# MadMiner example

In this tutorial we'll demonstrate how to use MadMiner to generate train and test samples for the ML methods introduced in ["Constraining Effective Field Theories With Machine Learning"](https://arxiv.org/abs/1805.00013) and ["A Guide to Constraining Effective Field Theories With Machine Learning"](https://arxiv.org/abs/1805.00020), both by Johann Brehmer, Gilles Louppe, Juan Pavez, and Kyle Cranmer.

Before you execute this notebook, make sure you have running installations of MadGraph, Pythia, and Delphes. Note that at least for now, the MG-Pythia interface and Delphes require custom patches (available upon request). In addition, MadMiner and [DelphesMiner](https://github.com/johannbrehmer/delphesminer) have to be in your PYTHONPATH.

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
from matplotlib import pyplot as plt
% matplotlib inline

from madminer.goldmine import GoldMine
from madminer.refinery import combine_and_shuffle
from madminer.refinery import Refinery
from madminer.refinery import constant_benchmark_theta, multiple_benchmark_thetas
from madminer.refinery import constant_morphing_theta, multiple_morphing_thetas, random_morphing_thetas
from madminer.tools.plots import plot_2d_morphing_basis
from delphesprocessor.delphesprocessor import DelphesProcessor

Please enter here the path to your MG5 root directory. This notebook assumes that you installed Delphes and Pythia through MG5.

In [2]:
mg_dir = '/Users/johannbrehmer/work/projects/madminer/MG5_aMC_v2_6_2'

## 1. Define parameter space

After creating a `GoldMine` instance, the first important step is the definition of the parameter space. Each model parameter is characterized by a name as well as the LHA block and ID.

If morphing is used, one also has to specify the maximal power with which the parameter contributes to the squared matrix element. For instance, a parameter that contributes only to one vertex, will typically have `morphing_max_power=2`, while a parameter that contributes to two vertices usually has `morphing_max_power=4`. Exceptions arise for instance when the interference effects between the SM and dimension-six operators are modelled, but the square of the dimension-six amplitude (subleading in 1/Lambda) is not taken into account, in which case `morphing_max_power=1`. Finally, the `parameter_range` argument defines the range of parameter values that are used for the automatic optimization of the morphing basis.

In [None]:
miner = GoldMine()

miner.add_parameter(
    lha_block='dim6',
    lha_id=2,
    parameter_name='CWL2',
    morphing_max_power=2,
    parameter_range=(-10.,10.)
)
miner.add_parameter(
    lha_block='dim6',
    lha_id=5,
    parameter_name='CPWL2',
    morphing_max_power=2,
    parameter_range=(-10.,10.)
)

## 2. Define benchmark points (evaluation points for |M|^2)

The next step is the definition of all the points at which the weights (squared matrix elements) should be evaluated by MadGraph. We call these points "benchmarks".

### 2a. Set benchmarks by hand

One can define benchmarks by hand:

In [None]:
miner.add_benchmark(
    {'CWL2':0., 'CPWL2':0.},
    'sm'
)
miner.add_benchmark(
    {'CWL2':1., 'CPWL2':0.},
    'bsm'
)

### 2b. Benchmarks for morphing

If morphing is used, the function `set_benchmarks_from_morphing` has to be called. With the option `keep_existing_benchmarks=True`, MadMiner will keep all the benchmark points defined beforehand and run a simple optimization algorithm to fix the remaining ones for the basis (which may be none). Otherwise, MadMiner will optimize the full basis and forget about all previously defined benchmark points. The argument `n_trials` determines the number of random candidate bases that the optimization algorithm goes through.

In [None]:
miner.set_benchmarks_from_morphing(
    keep_existing_benchmarks=True,
    n_trials=1000,
    max_overall_power=2
)

Let's have a look at the resulting morphing basis and the "morphing error", i.e. the sum of squared morphing weights as a function of the parameter space:

In [None]:
fig = plot_2d_morphing_basis(
    miner.morpher,
    xlabel=r'$c_{W} / \Lambda^2$ [TeV$^{-2}$]',
    ylabel=r'$c_{\tilde{W}} / \Lambda^2$ [TeV$^{-2}$]',
    xrange=(-10.,10),
    yrange=(-10.,10.)
)

## 3. Save settings and run MadGraph

The parameter space, benchmark points, and morphing setup are saved in a HDF5 file:

In [None]:
miner.save('data/madminer_example.h5')

In a next step, MadMiner starts MadGraph and Pythia to generate events and calculate the weights. You have to provide paths to the process card, run card, param card (the entries corresponding to the parameters of interest will be automatically adapted), and an empty reweight card.

The `sample_benchmark` option can be used to specify which benchmark should be used for sampling. If it is not used, MadMiner will automatically use the benchmark that was added first. Finally, if MadGraph is supposed to run in a different Python environment or requires other setup steps, you can use the `initial_command` argument.

In [None]:
miner.run(
    mg_directory=mg_dir,
    proc_card_file='cards/proc_card.dat',
    param_card_template_file='cards/param_card_template.dat',
    reweight_card_template_file='cards/reweight_card_template.dat',
    run_card_file='cards/run_card.dat',
    pythia8_card_file='cards/pythia8_card.dat',
    sample_benchmark='sm',
    initial_command='source activate python2'
)

## 4. Run detector simulation and extract observables

The detector simulation and calculation of observables is not part of MadMiner. The reason is that different users might have very different requirements here: while a phenomenologist might be content with the fast detector simulation from Delphes, an experimental analysis might require the full simulation through Geant4.

We provide the DelphesMiner package, which wraps around Delphes and allows for the fast extraction of observables into the HDF5 file.

Any user is free to replace the DelphesMiner step with a tool of their choice. 

In [None]:
dm = DelphesProcessor()

After creating the DelphesProcessor object, one can add a number of HepMC event samples...

In [None]:
dm.add_hepmc_sample('MG_process/Events/run_01/tag_1_pythia8_events.hepmc.gz')

... and have it run Delphes:

In [None]:
dm.run_delphes(delphes_directory=mg_dir + '/Delphes',
               delphes_card='cards/delphes_card.dat',
               initial_command='source activate python2')

The next step is the definition of observables through a name and a python expression. For the latter, you can use the objects `j[i]`, `e[i]`, `mu[i]`, `a[i]`, `met`, where the indices `i` refer to a ordering by the transverse momentum. All of these objects are scikit-hep [LorentzVectors](http://scikit-hep.org/api/math.html#vector-classes), see the link for a documentation of their properties.

There is an optional keyword `required`. If `required=True`, we will only keep events where the observable can be parsed, i.e. all involved particles have been detected. If `required=False`, un-parseable observables will be filled with `np.nan`.

In [None]:
dm.add_observable('pt_e1', 'e[0].pt', required=True)
dm.add_observable('pt_mu1', 'mu[0].pt', required=True)
dm.add_observable('delta_eta_ll', 'abs(e[0].eta - mu[0].eta)', required=True)
dm.add_observable('delta_phi_ll', 'abs(e[0].phi() - mu[0].phi())', required=True)

The function `analyse_delphes_samples` extracts all these observables from the Delphes ROOT file(s) generated before:

In [None]:
dm.analyse_delphes_samples()

The values of the observables and the weights are then saved in the HDF5 file. It is possible to overwrite the same file, or to leave the original file intact and save all the data into a new file as follows:

In [None]:
dm.save('data/madminer_example_with_data.h5', 'data/madminer_example.h5')

It's easy to check some distributions at this stage

In [None]:
fig = plt.figure(figsize=(5,5))

for weights in dm.weights:
    plt.hist(dm.observations['pt_e1'], range=(0.,400.), bins=20, histtype='step', weights=weights)

plt.show()

## 5. Combine and shuffle different event samples

To reduce disk usage, you can generate several small event samples with the steps given above, and combine them now. Note that (for now) it is essential that all of them are generated with the same setup, including the same benchmark points / morphing basis!

In [None]:
combine_and_shuffle(
    ['data/madminer_example_with_data.h5'],
    'data/madminer_example_shuffled.h5'
)

## 6. Make (unweighted) training and test samples

The last important MadMiner class is the `Smithy`. From all the data we have in the HDF5 file now, it extracts unweighted samples including the augmented data ("gold") that is needed as training and evaluation data for the Machine Learning algorithms.

In [3]:
refinery = Refinery('data/madminer_example_shuffled.h5', debug=True)

23:02  
23:02  ------------------------------------------------------------
23:02  |                                                          |
23:02  |  MadMiner                                                |
23:02  |                                                          |
23:02  |  Version from July 19, 2018                              |
23:02  |                                                          |
23:02  |           Johann Brehmer, Kyle Cranmer, and Felix Kling  |
23:02  |                                                          |
23:02  ------------------------------------------------------------
23:02  
23:02  
    
                                                      @ @ @ @   @ @ @ @                                
                                                     @. . . . @ . . . . @                              
                                                     @. . . . . . . . . @                              
                   @@@@@@@@@@@@@@@             

The `Refinery` class defines four different high-level functions to generate train or test samples:
- `extract_samples_train_plain()`, which only saves observations x, for instance for histograms or ABC;
- `extract_samples_train_local()` for methods like SALLY and SALLINO;
- `extract_samples_train_ratio()` for techniques like CARL, ROLR, CASCAL, and RASCAL; and
- `extract_samples_test()` for the evaluation of any method.

For the arguments `theta`, `theta0`, or `theta1`, you can use the helper functions `constant_benchmark_theta()`, `multiple_benchmark_thetas()`, `constant_morphing_theta()`, `multiple_morphing_thetas()`, and `random_morphing_thetas()`, all defined in the `smithy` module.

In [6]:
x, theta, t_xz = refinery.extract_samples_train_local(
    theta=constant_morphing_theta(np.array([1.e-5,0.])),
    n_samples=1000,
    folder='./data/samples',
    filename='train_sally'
)

23:04  Extracting training sample for local score regression. Sampling and score evaluation according to ('theta', array([1.e-05, 0.e+00]))
23:04  Starting sample extraction
23:04  New theta configuration
23:04    Sampling theta: morphing, [1.e-05 0.e+00]
23:04    Auxiliary theta: None, None
23:04    # samples: 1000


In [7]:
x, theta0, theta1, y, r_xz, t_xz = refinery.extract_samples_train_ratio(
    theta0=random_morphing_thetas(None, [('gaussian', 0., 0.5), ('flat', -0.8, 0.8)]),
    theta1=constant_benchmark_theta('sm'),
    n_samples=1000,
    folder='./data/samples',
    filename='train_rascal'
)

23:04  Extracting training sample for ratio-based methods. Numerator hypothesis: ('random', (None, [('gaussian', 0.0, 0.5), ('flat', -0.8, 0.8)])), denominator hypothesis: ('benchmark', 'sm')
23:04  Starting sample extraction
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.24872974  0.25641442]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.34692512  0.70023703]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.15367818 0.63754141]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-1.01306823 -0.77610162]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.5430035   0.50659296]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04 

23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.62647299 -0.35495861]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.15246342 -0.23189124]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.27729522 0.00953788]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.46543386 -0.5592225 ]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.60893162 -0.32389942]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.30911907  0.24987608]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.62196636

23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.4775212  0.60568735]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.04502684 -0.57741371]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.93283969  0.36871287]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.59593787 -0.28749779]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-1.12238155  0.6148421 ]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.50286355 -0.30973131]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta

23:04    Sampling theta: morphing, [ 0.06119591 -0.60481677]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.88128545 -0.78086781]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.29181634  0.50647458]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.42421942 -0.37383562]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.29923835 -0.53751271]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.76148443 -0.03759148]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.60808654 -0.11919606]
23:04    Auxili

23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.18441453  0.28266235]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.09807777  0.06226477]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-1.01269112 -0.49747473]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.7660128   0.02251093]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.01155575 0.44495844]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.76612252 0.40211964]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta c

23:04  New theta configuration
23:04    Sampling theta: morphing, [0.22307707 0.18390396]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.00870601 0.63881052]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.23885749  0.49336994]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.37709044 -0.5070794 ]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.20113352 -0.17451263]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.62122826 -0.06069654]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.1327001  0.

23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.05392767 0.46517041]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.26946595 -0.01306432]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.43066531 0.51944118]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.11575011 -0.69513021]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.01953842 -0.78908899]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.51935222 -0.67970917]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta c

23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.43330574 -0.44418245]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.91270675 -0.47102051]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.10640457  0.63917024]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.19496914 0.57482655]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.61739181 0.18089863]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.17418375 0.13667108]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.30079438  0.

23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.41076641  0.68358631]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.52912923 -0.59332029]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.94508835 -0.62930164]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.00115025 0.0187941 ]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.30375459  0.05815046]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [1.00059402 0.71559633]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta c

23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.43823752 -0.47955417]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.24683863  0.59826916]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.06987864 -0.79485225]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.20683848 0.37166595]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [-0.05266781 -0.6381768 ]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [ 0.55787533 -0.07529993]
23:04    Auxiliary theta: benchmark, sm
23:04    # samples: 1
23:04  New theta configuration
23:04    Sampling theta: morphing, [0.3470202  

In [8]:
x, theta = refinery.extract_samples_test(
    theta=constant_benchmark_theta('sm'),
    n_samples=1000,
    folder='./data/samples',
    filename='test'
)

23:04  Extracting evaluation sample. Sampling according to ('benchmark', 'sm')
23:04  Starting sample extraction
23:04  New theta configuration
23:04    Sampling theta: benchmark, sm
23:04    Auxiliary theta: None, None
23:04    # samples: 1000


For debugging, you can also access the full list of observables and benchmark weights (in fb) in the HDF5 file:

In [9]:
all_x, all_weights = refinery.extract_raw_data()

print(all_x)
print(all_weights)

[[1.77766159e+02 1.70798584e+02 2.23781466e-02 3.41418093e+00]
 [8.98204575e+01 1.06333748e+02 1.26335031e+00 2.97818649e+00]
 [1.08448730e+03 6.14256958e+02 9.63677540e-01 3.19618416e+00]
 ...
 [4.36720001e+02 2.33174133e+02 2.86702240e+00 2.91675136e+00]
 [1.53468201e+03 5.98130066e+02 7.03580171e-01 3.10173911e+00]
 [7.01861115e+01 3.50527832e+02 1.94378757e+00 3.60562098e+00]]
[[0.00013943 0.00014308 0.00014411 0.00014399 0.00013628 0.00013715]
 [0.00013892 0.0001438  0.00014376 0.00014404 0.00013629 0.00013715]
 [0.00013898 0.00014388 0.00014389 0.00014416 0.00013628 0.00013715]
 ...
 [0.00013882 0.00014403 0.00014373 0.00014411 0.00013628 0.00013715]
 [0.00013897 0.00014396 0.00014392 0.00014421 0.00013627 0.00013715]
 [0.00013875 0.00014422 0.00014374 0.0001442  0.00013628 0.00013715]]


Let's have a look at some distributions and correlations in this test sample:

In [None]:
import corner

labels = [r'$p_{T,e}$ [GeV]', r'$p_{T,\mu}$ [GeV]', r'$\Delta \eta_{\ell\ell}$', r'$\Delta \phi_{\ell\ell}$']
ranges = [(0., 500.), (0., 500.), (0.,3.), (0.,6.2)]

_ = corner.corner(x, color='C0', labels=labels, range=ranges)