Skip to content

petrstepanov/dual-readout-tmva

Repository files navigation

Dual Readout TMVA

This project's goal is to facilitate the particle identification process utilizing the application of Machine Learning algorithms. The hardware setup is outlined below. The incident particle interacts with the scintillation crystal and may produce two types of signals. Depending on the particle type, only Cerenkov radiation is produced (e.g. pions) or Cerenkov radiation is accompanied by scintillation radiation (for e- or gamma).

Particle identification setup schematics

Tektronix oscilloscope MDO4034C was set up with an external trigger using the PyVisa library. A total of 16064 spectra for Cerenkov-only particles (experimental sample "Cube 6") and 33056 spectra for Cerenkov and scintillation signal (sample "Cube 9") are acquired by Vladimir Berdnikov. Waveforms are stored on the pristine@jlab.org machine under the /misc/cuanas/Data folder.

Two sample waveforms for each signal type are visualized below. Top histograms represent the original waveform from the oscilloscope. The bottom images are zoomed-in signals. On the left, we observe a Cerenkov-only signal. On the right, the signal is a superposition of the Cerenkov effect with a longer scintillation "tail".

Two types of signals for the binary classification

We observe visual discrepancies in the shape of the waveforms from different particle types. Machine Learning (ML) can be applied to classify particles into two known groups.

Program Description

This program demonstrates that ML techniques can be successfully applied to perform the classification of "unknown" experimental spectra. Generally speaking, ML technique consists of three stages:

  • Preparation. Oscilloscope waveforms are processed and converted into a ROOT tree of a certain format to be loaded into the ML algorithm(s).
  • Training. Known input data that corresponds to each classification group is given to the ML algorithm.
  • Classification. Unknown spectra are analyzed by the trained algorithm and classified into groups.

The program should be compiled and executed on the JLab computing farm. First, two stages are the most compute-intensive ones. Classification can be carried out on a local computer. Below we elaborate on the program workflow, and syntax and provide the result of the classification.

Preparation Stage

Waveform CSV files for each classification group hosted on the pristine@jlab.org are copied to the /w/hallc-scshelf2102/kaon/petrs/Data/Cubes folder for processing. Group rename operation is preformed to give unique name to each .csv file (with the help of rename '' 'Mar14_' *.csv command).

It is necessary to exclude a subset of "testing" data for ML training process. Therefore, randomly selected 600 spectra from each category (a total of 1200 files) are moved to the /w/hallc-scshelf2102/kaon/petrs/Data/Cubes-processed/samples-testing folder. Rest of the spectra from each category are consolidated in following folders:

  • Cerenkov-only spectra are stored under:
    /w/hallc-scshelf2102/kaon/petrs/Data/Cubes-processed/sample6-learning/
  • Cerenkov and scintillation are copied to:
    /w/hallc-scshelf2102/kaon/petrs/Data/Cubes-processed/sample9-learning/

In this experiment, the size of the trigger plates (top and bottom) is larger compared to the scintillation crystal. Therefore some trigger events are produced with a very weak to empty signal. We will name these waveforms as the "baseline" waveforms. A few of the "baseline" waveforms are visualized in the figure below:

Example set of baseline specrtra to be classified with AI ROOT TMVA

On the other hand, when an incident particle hits both trigger plates and the scintillation crystal, we obtain a waveform containing Cerenkov and (possibly) scintillation information. We will refer to these waveforms as the "event" waveforms.

During the preparation stage, the .csv files are imported into ROOT histograms. To separate the "baseline" spectra from the actual "event" signals (when the scintillation signal is registered), the program creates and visualizes a ROOT file with a waveform name, and minimum registered voltage value (signal is negative), and the peak position. Next, the "baseline" waveforms are manually filtered out from all the waveforms upon the following criteria:

  • Amplitude threshold value of an "event" waveform should be < 0.03 V.
  • Peak position in a range of -10 to 20 ns.

The image below visualizes the criteria for filtering out noisy spectra (for the set of Cherenkov-only samples, Cube 9).

Criteria for filtering out baseline waveforms

The ratio of the "event" to "baseline" waveforms for two groups of samples is following:

  • For Cerenkov-only spectra:
    Identified 51% "event" waveforms (8226 files), 49% baseline waveforms (7837 files).
  • For Cerenkov and scintillation spectra:
    Identified 56% "event" waveforms (18606 files), 44% baseline waveforms (14449 files).

Further improvement of the program is the implementation of AI-based classification of the baseline spectra into a separate group.

Now that the "baseline" waveforms are filtered out, "event" signals are prepared to input to the ML algorithm. To my knowledge, some ML algorithms cannot work with negative variable data. Therefore, the original waveforms (oscilloscope gives negative signal) are inverted. Next, negative values are set to zero. Additionally, we crop the waveforms to exclude insignificant data. This processing is performed in the HistUtils::prepHistForTMVA() method.

Next, the data is ready to be written into the ROOT tree in a special format for the TMVA analysis. There are two approaches to creating ROOT tree data for machine learning. Starting in ROOT v6.20 a modern method for the ML tree preparation, where all the histogram bin values into a single tree branch as an array is implemented. Refer to the image below.

Creating ROOT tree with data for TMVA Machine Learning

Unfortunately, this method failed to provide correct classification results. An error in the ROOT code was found and reported in this Pull Request. Tree structures for both - modern and traditional approaches are visualized below. As a temporary workaround to be able to run the program on the JLab farm, the input data was formatted in a traditional way, where every ML variable (histogram bin) is stored in a separate tree branch.

Training Stage

Currently two ML algorithms are implemented in the code: boosted decision trees (BDT) and deep neural network (DNN). DNN algorithm requires splitting the input data into the "training" and "test" events. A ratio of 80÷20 for training-to-test events was selected respectively. Results of the training stage are presented on the graphs below:

TMVA Overtraining Check

TMVA Cut Efficiences

TMVA Signal Efficiency

TMVA Training History

The training stage is rather resourceful. However, it should be run only once. As the result of the training stage, TMVA outputs the so-called weight" files containing ML training information. A set of weight files each corresponding to the implemented ML algorithm (BDT and DNN) are used to classify the "unknown" waveforms without the need to re-train the model.

Classification Stage

At this stage, the program takes a set of the "unknown" spectra and applies the trained ML algorithm to determine if a spectrum shape corresponds to the Cerenkov-only or Cerenkov with scintillation category.

A set of "unknown" spectra which is a random mix of Cerenkov and Cerenkov+scintillation spectra excluded from the training stage is analyzed by the AI algorithm. "Unknown" spectra are segregated under /w/hallc-scshelf2102/kaon/petrs/Data/Cubes-processed/samples-testing folder.

The output of the classification stage for a particular spectrum is a float number in a range of [0, 1]. Classification results for unknown spectra are stored in the histograms and presented in the image below.

TMVA Classification results

Additionally, the program outputs the classification results in the Terminal output. There is a set of two classification results for each spectrum - for BDT and DNN classifiers.

Entry: 1
Filename: Mar28_DataLog_10236_7f6b_812b
MVA response for "TMVA_CNN_Classification_BDT.weights": 0.486826
MVA response for "TMVA_CNN_Classification_DNN.weights": 0.999035

Entry: 2
Filename: Mar28_DataLog_10237_7f6d_812b
MVA response for "TMVA_CNN_Classification_BDT.weights": 0.56271
MVA response for "TMVA_CNN_Classification_DNN.weights": 0.999979

Entry 3: 
Filename: Mar28_DataLog_10264_7fb0_812c
MVA response for "TMVA_CNN_Classification_BDT.weights": 0.378375
MVA response for "TMVA_CNN_Classification_DNN.weights": 0.999851

...

All information output by the program is currently stored in /w/hallc-scshelf2102/kaon/petrs/Data/Cubes-results/TMVA-Jun23 folder. The next section describes how to reproduce the obtained results.

Program Build and Run

To reproduce obtained results program code needs to be checked out in the JLab computer environment.

  • Log in to the computing farm ssh <your-username>@login.jlab.org.
  • Connect to one of the ifarm nodes ssh ifarm.
  • Clone the program code: git clone https://github.com/petrstepanov/dual-readout-tmva.
  • Source the environment source /site/12gev_phys/softenv.csh 2.5.
  • Create a folder for the out-of-source build: mkdir dual-readout-tmva-build && cd dual-readout-tmva-build.
  • Generate the makefile with CMake: cmake ../dual-readout-tmva.
  • Build the source code: make -j`nproc` .

Executable dual-readout-tmva will be generated inside the current folder. Program mode (preparation, training, or classification) and paths to the source directories containing input data are passed as command-line parameters.

Preparation Stage

First, we run the program in the preparation stage providing paths to source folders with known types:

./dual-readout-tmva --mode prepare --background <cerenkov-waveforms-path> --signal <cerenkov-and-scintillation-path>

where <cerenkov-waveforms-path> and <cerenkov-and-scintillation-path> are folder paths of the Cube 6 and Cube 9 waveforms respectively.

Program outputs the tmva-input.root file containing processed "event" waveforms written in a ROOT tree under the treeB (background, Cerenkov only) and treeS (signal, Cerenkov and scintillation) branches.

Training Stage

Next, we train the ML algorithms by providing them with two sets of "known" waveforms from two different sets:

./dual-readout-tmva --mode train <path-to-tmva-input-file>

During the training program outputs the ClassificationOutput.root file containing training plots data along with the weight files. To run the TMVA GUI and view plots with training history, one can use the following command:

./dual-readout-tmva --mode tmva-gui <path-to-classification-output-file>

where <path-to-classification-output-file> is the ClassificationOutput.root file path.

Classification Stage

Finally, to proceed with the classification stage, the program must be run with the following command line parameters:

./dual-readout-tmva --mode classify --weight <weight-folder> --test <test-folder>

where <weight-folder> is a directory path where the weight files are stored, <test-folder> is a directory path containing the "unknown" waveforms to be classified.

The program outputs the classification information in the Terminal and additionally saves classification results in the output TMVApp.root file.

Conclusion

In this work, we successfully applied Machine Learning (ML) techniques to perform binary classification of the oscilloscope spectra upon their shape.

  • First group of spectra are obtained from a crystal without scintillation centers (CUA sample #6). These waveforms contain only the Cerenkov signal.

  • Second group of spectra are obtained from a scintillator block (CUA sample #9). These waveforms represent superimposed Cerenkov and scintillation tail signals.

The program utilizes the CERN ROOT TMVA framework to perform the classification. Currently, two classifier techniques are implemented: Binary Decision Trees (BDT) and Deep Neural Networks (DNN). For each waveform from the testing set, the program outputs two float numbers corresponding to the probabilities of belonging to each group.

Classifier outputs for DNN and BDT algorithms do confirm each other for the vast majority of the analyzed spectra. Therefore we can conclude that ML techniques can be successfully used for differentiating between the spectra' shapes and applied in the particle analysis procedure.

One of the improvements of the approach outlined in the article is upgrading the algorithm to support the multi-class classification. For instance, Machine Learning algorithms can be trained to recognize the "baseline" spectra and attribute them to the third class of signals.

This will lift the necessity of filtering the "baseline" spectra from the set of all spectra before the "learning" stage of the ML analysis.

About

Separating the Scintilation signal from Cerenkov background

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published