# Tutorial 01 - Data Preparation

This tutorial guides you towards the creation of the training data used in this study.

### Downloading inputs

The original inputs are available from Zenodo for various samples:

- QCD background from official LHCO dataset: https://zenodo.org/records/4536377/files/events_anomalydetection_v2.features.h5
- Extra QCD background : https://zenodo.org/records/8370758/files/events_anomalydetection_qcd_extra_inneronly_features.h5
- Parametric W->X(qq)Y(qq) signal : https://zenodo.org/records/11188685/files/events_anomalydetection_Z_XY_qq_parametric.h5
- Parametric W->X(qq)Y(qqq) signal : https://zenodo.org/records/11188685/files/events_anomalydetection_Z_XY_qqq_parametric.h5

You can download the inputs manually and put them according to the file structure `<datadir>/original/<input>.h5`.

A more straight forward way to download the inputs and make sure it has the correct file structure is to use the paws Command Line Interface (CLI).

In [1]:
!paws download_data --help

Usage: paws download_data [OPTIONS]

  Download datasets used in this study.

Options:
  -s, --samples [QCD|extra_QCD|W_qq|W_qqq]
                                  List of data samples to download (separated
                                  by commas). Available samples are: ['QCD',
                                  'extra_QCD', 'W_qq', 'W_qqq']. By default,
                                  all samples will be downloaded.
  -d, --datadir TEXT              Base directory for storing datasets. The
                                  downloaded data will be stored in
                                  <datadir>/raw.  [default: datasets]
  --help                          Show this message and exit.


In [2]:
# to download all input samples with datadir = "datasets"
!paws download_data -d datasets

[INFO] Downloading sample "QCD" from https://zenodo.org/records/4536377/files/events_anomalydetection_v2.features.h5
--2024-06-01 19:36:24--  https://zenodo.org/records/4536377/files/events_anomalydetection_v2.features.h5
Resolving zenodo.org (zenodo.org)... 2001:1458:d00:3b::100:200, 2001:1458:d00:9::100:195, 2001:1458:d00:3a::100:33a, ...
Connecting to zenodo.org (zenodo.org)|2001:1458:d00:3b::100:200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 74315238 (71M) [application/octet-stream]
Saving to: ‘datasets/original/events_anomalydetection_v2.features.h5’


2024-06-01 19:36:29 (18.2 MB/s) - ‘datasets/original/events_anomalydetection_v2.features.h5’ saved [74315238/74315238]

[INFO] File downloaded to datasets/original/events_anomalydetection_v2.features.h5
[INFO] Downloading sample "extra_QCD" from https://zenodo.org/records/8370758/files/events_anomalydetection_qcd_extra_inneronly_features.h5
--2024-06-01 19:36:29--  https://zenodo.org/records/8370758/fi

Alternatively, you may use the paws API:

In [3]:
from paws import PathManager
from paws.data_preparation import download_file
from paws.settings import Sample, SampleURLs

# modify your base data directory here
datadir = "datasets"

# choose the samples to download, you will most likely download all of them
samples = ['QCD', 'extra_QCD', 'W_qq', 'W_qqq']

print(f"datadir = {datadir}")
print(f"samples = {samples}")

datadir = datasets
samples = ['QCD', 'extra_QCD', 'W_qq', 'W_qqq']


In [4]:
# check the urls for each sample
SampleURLs

{<Sample.QCD: 0>: 'https://zenodo.org/records/4536377/files/events_anomalydetection_v2.features.h5',
 <Sample.EXTRA_QCD: 1>: 'https://zenodo.org/records/8370758/files/events_anomalydetection_qcd_extra_inneronly_features.h5',
 <Sample.W_QQ: 2>: 'https://zenodo.org/records/11188685/files/events_anomalydetection_Z_XY_qq_parametric.h5',
 <Sample.W_QQQ: 3>: 'https://zenodo.org/records/11188685/files/events_anomalydetection_Z_XY_qqq_parametric.h5'}

In [5]:
# the actual output directory is handled by PathManager
path_manager = PathManager(directories={"dataset": datadir})
outdir = path_manager.get_directory('original_dataset')
print(f"outdir = {outdir}")

outdir = datasets/original


In [None]:
# now download the inputs
for sample in samples:
    url = SampleURLs[Sample.parse(sample)]
    print(f'Downloading sample "{sample}" from {url}')
    download_file(url, outdir)

### Prepare dedicated datasets

Dedicated datasets are the processed datasets in tfrecord format per sample per mass point, i.e. (mX, mY). For the background, the dataset is repeated for each mass point.

These datasets are used for training of the datacated supervised and the weakly supervised models which we train over signal of a specific mass point.

To use the CLI:

In [8]:
!paws create_dedicated_datasets --help

Usage: paws create_dedicated_datasets [OPTIONS]

  Create dedicated datasets for model training.

Options:
  -s, --samples [QCD|extra_QCD|W_qq|W_qqq]
                                  List of data samples to process (separated
                                  by commas). Available samples are: ['QCD',
                                  'extra_QCD', 'W_qq', 'W_qqq']. By default,
                                  all samples will be processed.
  -d, --datadir TEXT              Base directory for storing datasets.
                                  [default: datasets]
  --cache / --no-cache            Whether to cache existing results.
                                  [default: cache]
  --parallel INTEGER              Parallelize job across the N workers
                                  Case  0: Jobs are run sequentially (for debugging)
                                  Case -1: Jobs are run across N_CPU workers.  [default: -1]
                                  Verbosity level.  [default

In [None]:
# note that each command will consume all the CPUs by default, please limit the CPU usage by setting --parallel N_CPU
!paws create_dedicated_datasets --s QCD -d datasets
!paws create_dedicated_datasets --s extra_QCD -d datasets
!paws create_dedicated_datasets --s W_qq -d datasets
!paws create_dedicated_datasets --s W_qqq -d datasets

To use the API:

In [None]:
from paws.data_preparation import create_high_level_dedicated_datasets

datadir = "datasets"
samples = ['QCD', 'extra_QCD', 'W_qq', 'W_qqq']

for sample in samples:
    create_high_level_dedicated_datasets(sample, datadir=datadir, parallel=-1)

### Prepare dedicated datasets

Dedicated datasets are the combined dedicated dataset for each decay mode scenario. The datasets are pre-shuffled with a given seed.

Notice that for each sample and mass point, the dataset is split into N = 100 shards. We prepare the training, validation and test dataset splits based on the set of shard indices that go into each split to ensure their orthogonality and reproducibility. In the weakly supervised training, the validation and test datasets will be turned into the training datasets instead (and the training split into the validation and test) to avoid bias from the supervised training.

These datasets are used for training of the parameterised supervised models which we train over a parametric set of signal mass points.

To use the CLI:

In [10]:
!paws create_param_datasets --help

Usage: paws create_param_datasets [OPTIONS]

  Create parameterised datasets for model training.

Options:
  -s, --samples [QCD|extra_QCD|W_qq|W_qqq]
                                  List of data samples to include (separated
                                  by commas). Available samples are: ['QCD',
                                  'extra_QCD', 'W_qq', 'W_qqq']. By default,
                                  all samples will be included. Note that for
                                  two-prong / three-prong training, only the
                                  two-prong / three-prong signals should be
                                  included.
  -d, --datadir TEXT              Base directory for storing datasets.
                                  [default: datasets]
  --shards TEXT                   Process datasets with the specific shard
                                  indices (separated by commas). By default,
                                  all shards will be processed. Use

In [None]:
# two-prong dataset
!paws create_param_datasets -s QCD,extra_QCD,W_qq -d datasets
# three-prong dataset
!paws create_param_datasets -s QCD,extra_QCD,W_qqq -d datasets

To use the API:

In [None]:
import numpy as np
    
from paws.data_preparation import create_parameterised_datasets
from paws.settings import NUM_SHARDS

datadir = "datasets

two_prong_samples = ["W_qq", "QCD", "extra_QCD"]
three_prong_samples = ["W_qqq", "QCD", "extra_QCD"]

# you can process a subset of shards in each job to potentially speed up the procedure
# process all shards in one go for now
shard_indices = np.arange(NUM_SHARDS)

# you might run into memory issue if you process too many shards at the same time, try to reduce the number of parallel workers accordingly

# two-prong dataset
create_parameterised_datasets(shard_indices, sample=two_prong_samples, datadir=datadir, parallel=16)
# three-prong dataset
create_parameterised_datasets(shard_indices, sample=three_prong_samples, datadir=datadir, parallel=16)