<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>



# Creating Data

There are two ways of creating the tuning data file. You can either:
- [[Recommended] Run the shell createData.py executable](#Using-the-createData.py-executable); or
- [Interact with the python code](#Interacting-with-the-python-code).

A third way is under development, where it will be possible to create the tuning data directly on the GRID. 


## Optimal GRID PhysVal download

The standalone approaches consider that you have already downloaded the data from the reconstructions (or that you have done this in some standalone way). 

If you need to download the PhysVals on the GRID, you can make use of the `scripts/grid_scripts/run_dump.py` script. This simple script can be used to download the PhysVals in batches, and skimming them (thus reducing their size) to keep only the needed TTrees.

An usage example of download a dataset and removing all Trigger information:

```bash
run_dump.py --inDS user.wsfreund.mc14_13TeV.129160.Pythia8_AU2CTEQ6L1_perf_JF17.e3084_s2044_s2008_r5988.rr0003.ph0002_PhysVal/ --outFolder mergedJet --triggerList '' --numberOfSamplesPerPackage 250
```


## Using the createData.py executable

This is the prefered and supported way for creating the data files. You will need to have the xAOD/PhysVal with the rings in order to generate the data files. The command can be run after you set the environment through `source setrootcore.sh` command on the RootCore packages directory.

The help information should be self-explanatory, where the most important options are `--sgnInputFiles`, `--bkgInputFiles`, `--operation`, `--reference` and `--treePath`.

If you want to run a $\eta$ and $E_T$ dependent discrimination, you can specify the bins through `--etaBins` and `etBins`, respectively.


In [3]:
%%bash
createData.py

usage: createData.py -s SignalInputFiles [SignalInputFiles ...] -b BackgroundInputFiles [BackgroundInputFiles ...] -op OPERATION [-t TreePath [TreePath ...]]
                     [--reference {Off_CutID,Off_Likelihood,Truth} [{Off_CutID,Off_Likelihood,Truth} ...]] [-tEff EfficienciyTreePath [EfficienciyTreePath ...]] [-l1 L1EMCLUSCUT] [-l2 L2ETCUT]
                     [-off OFFETCUT] [--getRatesOnly] [--etBins ETBINS [ETBINS ...]] [--etaBins ETABINS [ETABINS ...]] [--ringConfig RINGCONFIG [RINGCONFIG ...]] [-nC NCLUSTERS] [-o OUTPUT]

Create TuningTool data from PhysVal.

optional arguments:
                        The output level for the main logger

Required arguments:

  -s SignalInputFiles [SignalInputFiles ...], --sgnInputFiles SignalInputFiles [SignalInputFiles ...]
                        The signal files that will be used to tune the discriminators
  -b BackgroundInputFiles [BackgroundInputFiles ...], --bkgInputFiles BackgroundInputFiles [BackgroundInputFiles ...]
           

Running one example:

In [2]:
%%bash
createData.py \
  -s /tmp/jodafons/TriggerTuning2016/samples/user.jodafons.mc14_13TeV.147406.PowhegPythia8_AZNLO_Zee.recon.RDO.rel20.7.3.6.e3059_s1982_s2008_r5993_rr0001_p1_PhysVal  \
  -b /tmp/jodafons/TriggerTuning2016/samples/user.jodafons.mc14_13TeV.129160.Pythia8_AU2CTEQ6L1_perf_JF17.recon.RDO.rel20.7.3.6.e3084_s2044_s2008_r5988.rr0001_p1_PhysVal \
  -op L2 \
  -o mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_nod0_L1EM20VH \
  --reference Off_Likelihood Truth \
  -t Trigger/HLT/Egamma/Ntuple/e24_medium_L1EM18VH \
  -l1 0 \
  -l2 19 \
  -t Trigger/HLT/Egamma/ZeeNtuple/e24_lhmedium_nod0_L1EM20VH \
     Trigger/HLT/Egamma/BackgroundNtuple/e24_lhmedium_nod0_L1EM20VH \
  --crossFile /afs/cern.ch/work/w/wsfreund/private/crossValid.pic.gz \
  --nClusters 1000 \
  --etBins 0 30  50  20000 \
  --etaBins 0  0.8   1.37  1.54  2.5
  #--output-level VERBOSE

Py.CreateData                           INFO Extracting signal dataset information...
Py.FilterEvents                         INFO There is available a total of 470350 entries.
Py.FilterEvents                         INFO L2CaloAccept_etBin0_etaBin0 : 100.000000 (60/60)
Py.FilterEvents                         INFO L2ElAccept_etBin0_etaBin0 : 100.000000 (60/60)
Py.FilterEvents                         INFO EFCaloAccept_etBin0_etaBin0 : 91.666667 (55/60)
Py.FilterEvents                         INFO EFElAccept_etBin0_etaBin0 : 91.666667 (55/60)
Py.FilterEvents                         INFO L2CaloAccept_etBin0_etaBin1 : 100.000000 (38/38)
Py.FilterEvents                         INFO L2ElAccept_etBin0_etaBin1 : 100.000000 (38/38)
Py.FilterEvents                         INFO EFCaloAccept_etBin0_etaBin1 : 71.052632 (27/38)
Py.FilterEvents                         INFO EFElAccept_etBin0_etaBin1 : 71.052632 (27/38)
Py.FilterEvents                         INFO L2CaloAccept_etBin0_etaBin2 : 100.0000

Note that a file mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_nod0_L1EM20VH.npz was created, containing the tuning data to be used by the `TuningJob.py`. This file also contains the efficiencies benchmarks, as we will check on the information available on the [`TuningDataArchieve`](#The-information-available-on-the-TuningDataArchieve).

In [3]:
%%bash
ls

CreateData.ipynb
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_nod0_L1EM20VH.npz
tuningData.npz


## Interacting with the python code


If, instead, you want to directly call the python code within a script. This is done through the __call__ method of the `CreateData` class:

In [4]:
from TuningTools.CreateData import createData
help(createData.__call__)

Help on method __call__ in module TuningTools.CreateData:

__call__(self, sgnFileList, bkgFileList, ringerOperation, **kw) method of TuningTools.CreateData.CreateData instance
    Creates a numpy file ntuple with rings and its targets
    Arguments:
      - sgnFileList: A python list or a comma separated list of the root files
          containing the TuningTool TTree for the signal dataset
      - bkgFileList: A python list or a comma separated list of the root files
          containing the TuningTool TTree for the background dataset
      - ringerOperation: Set Operation type to be used by the filter
    Optional arguments:
      - output ['tuningData']: Name for the output file
      - referenceSgn [Reference.Truth]: Filter reference for signal dataset
      - referenceBkg [Reference.Truth]: Filter reference for background dataset
      - treePath [Set using operation]: set tree name on file, this may be set to
        use different sources then the default.
          Default for:


You can use the following example to have an idea on how to use the `CreateData` class and its default object `createData`:

In [5]:
# %load ../scripts/skeletons/create_data.py
#!/usr/bin/env python
from TuningTools.FilterEvents import *
from TuningTools.CreateData import createData
from RingerCore.FileIO import save, load, expandFolders
from RingerCore.Logger import LoggingLevel

from TuningTools.CrossValid import CrossValidArchieve
with CrossValidArchieve( "/afs/cern.ch/work/w/wsfreund/private/crossValid.pic.gz" ) as CVArchieve:
  crossVal = CVArchieve
del CVArchieve

RatesOnly=False
etaBins  = [0, 0.8 , 1.37, 1.54, 2.5]
etBins   = [0,30, 50, 20000]# in GeV
output   = 'mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_nod0_L1EM20VH'
#basepath = '/afs/cern.ch/work/j/jodafons/public/Online/PhysVal/'
#bkgName  = 'user.jodafons.mc14_13TeV.129160.Pythia8_AU2CTEQ6L1_perf_JF17.recon.RDO.rel20.1.0.4.e3084_s2044_s2008_r5988.rr0104_a0001_PhysVal.root'
#sgnName  = 'user.jodafons.mc14_13TeV.147406.PowhegPythia8_AZNLO_Zee.recon.RDO.rel20.1.0.4.e3059_s1982_s2008_r5993_rr0104_a0001_PhysVal.root'
basepath = '/tmp/jodafons/TriggerTuning2016/samples'
bkgName  = 'user.jodafons.mc14_13TeV.129160.Pythia8_AU2CTEQ6L1_perf_JF17.recon.RDO.rel20.7.3.6.e3084_s2044_s2008_r5988.rr0001_p1_PhysVal'
sgnName  = 'user.jodafons.mc14_13TeV.147406.PowhegPythia8_AZNLO_Zee.recon.RDO.rel20.7.3.6.e3059_s1982_s2008_r5993_rr0001_p1_PhysVal'


createData( basepath + '/' + sgnName, 
            basepath + '/' + bkgName,
            RingerOperation.L2,
            referenceSgn    = Reference.Off_Likelihood,
            referenceBkg    = Reference.Truth,
            treePath        = ['Trigger/HLT/Egamma/ZeeNtuple/e24_lhmedium_nod0_L1EM20VH', \
                               'Trigger/HLT/Egamma/BackgroundNtuple/e24_lhmedium_nod0_L1EM20VH'],
            l1EmClusCut     = 20,
            l2EtCut         = 19,
            #level           = LoggingLevel.VERBOSE,
            nClusters       = 2000,
            getRatesOnly    = RatesOnly,
            etBins          = etBins,
            etaBins         = etaBins,
            crossVal        = crossVal )




Py.CreateData                           INFO Extracting signal dataset information...
Py.FilterEvents                         INFO There is available a total of 470350 entries.
Py.FilterEvents                         INFO L2CaloAccept_etBin0_etaBin0 : 100.000000 (116/116)
Py.FilterEvents                         INFO L2ElAccept_etBin0_etaBin0 : 100.000000 (116/116)
Py.FilterEvents                         INFO EFCaloAccept_etBin0_etaBin0 : 94.827586 (110/116)
Py.FilterEvents                         INFO EFElAccept_etBin0_etaBin0 : 93.965517 (109/116)
Py.FilterEvents                         INFO L2CaloAccept_etBin0_etaBin1 : 100.000000 (79/79)
Py.FilterEvents                         INFO L2ElAccept_etBin0_etaBin1 : 100.000000 (79/79)
Py.FilterEvents                         INFO EFCaloAccept_etBin0_etaBin1 : 81.012658 (64/79)
Py.FilterEvents                         INFO EFElAccept_etBin0_etaBin1 : 79.746835 (63/79)
Py.FilterEvents                         INFO L2CaloAccept_etBin0_etaBin2 : 

This time we've created the default output file, as can be seen with `ls` command.


In [6]:
%%bash
ls

CreateData.ipynb
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_nod0_L1EM20VH.npz
tuningData.npz


### Using FilterEvents

However, in some very special cases, you might want to run directly the `FilterEvents` class instead, to obtain only some of the datasets rings.

Its documentation should cover all configuration:

In [None]:
from TuningTools.FilterEvents import filterEvents
help(filterEvents.__call__)

In this case, consider the example:

In [None]:
%load ../scripts/analysis_scripts/Trigger_20_0_1_4/FilterEvents.py

Note that the previous script saved each of the bins in a separed file. It also exported them into matlab format:

In [25]:
%%bash
ls

CreateData.ipynb
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_L1EM20VH_etBin_0_etaBin_0.mat
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_L1EM20VH_etBin_0_etaBin_0.npz
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_L1EM20VH_etBin_0_etaBin_1.mat
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_L1EM20VH_etBin_0_etaBin_1.npz
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_L1EM20VH_etBin_0_etaBin_2.mat
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_L1EM20VH_etBin_0_etaBin_2.npz
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_L1EM20VH_etBin_0_etaBin_3.mat
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_L1EM20VH_etBin_0_etaBin_3.npz
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_L1EM20VH_etBin_1_etaBin_0.mat
mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_L1E

## The information available on the TuningDataArchieve

We can explore the information available on the TuningDataArchieve by using the RingerCore.FileIO load method directly on the file and checking its RAW content:

In [7]:
from RingerCore.FileIO import load
f = load('tuningData.npz')
print f
f.keys()

<numpy.lib.npyio.NpzFile object at 0x275e850>


['signal_rings_etBin_0_etaBin_2',
 'signal_rings_etBin_0_etaBin_3',
 'signal_rings_etBin_0_etaBin_0',
 'signal_rings_etBin_0_etaBin_1',
 'background_rings_etBin_2_etaBin_0',
 'background_rings_etBin_2_etaBin_1',
 'background_rings_etBin_2_etaBin_2',
 'background_rings_etBin_2_etaBin_3',
 'signal_rings_etBin_2_etaBin_0',
 'signal_rings_etBin_2_etaBin_1',
 'signal_rings_etBin_2_etaBin_2',
 'signal_rings_etBin_2_etaBin_3',
 'background_efficiencies',
 'version',
 'signal_rings_etBin_1_etaBin_1',
 'signal_rings_etBin_1_etaBin_0',
 'signal_rings_etBin_1_etaBin_3',
 'signal_rings_etBin_1_etaBin_2',
 'type',
 'et_bins',
 'eta_bins',
 'background_cross_efficiencies',
 'background_rings_etBin_0_etaBin_2',
 'background_rings_etBin_0_etaBin_3',
 'background_rings_etBin_0_etaBin_0',
 'background_rings_etBin_0_etaBin_1',
 'background_rings_etBin_1_etaBin_1',
 'background_rings_etBin_1_etaBin_0',
 'background_rings_etBin_1_etaBin_3',
 'background_rings_etBin_1_etaBin_2',
 'signal_efficiencies',
 'si

In this version, we are saving each one of the `signal` and `background` rings data in a dedicated bin file, so that the rings information can be retrieved without loading the full data into memory. The other information available on the file, which are quite small in the memory, are not separated into bins. You will find, besides the rings, the `et_bins` and `eta_bins` limits, the `background` and `signal` benchmark efficiencies, which, for the trigger, will be `L2CaloAccept`, `L2ElAccept`, `EFCaloAccept` and `EFElAccept`; and, for the offline, they will be `CutIDLoose`, `CutIDMedium`, `CutIDTight`, `LHLoose`, `LHMedium` and `LHTight`. Those efficiencies are measured for all the dataset and given with respect to the used reference. However, if you provide the `CrossValid` object that will be used to tune the Ringer selector, there will be available the `background` and `signal` cross-validation datasets efficiencies to compare efficiencies in the same used datasets. 

One more important note: when loading only one bin into the memory, the context manager will handle and keep only the desired bin information into memory, as we can see in the following code:

In [16]:
from TuningTools.CreateData import TuningDataArchieve
with TuningDataArchieve("tuningData.npz", eta_bin = 0, et_bin = 0) as data: pass
print 'Background L2Calo in(et=0,eta=0) efficiency is: ', data['background_efficiencies']['L2CaloAccept'].eff_str()
print 'Background HLT bin(et=0,eta=0) efficiency is: ', data['background_efficiencies']['EFElAccept'].eff_str()
print 'Signal L2Calo bin(et=0,eta=0) efficiency is: ', data['signal_efficiencies']['L2CaloAccept'].eff_str()
print 'Signal HLT bin(et=0,eta=0) efficiency is: ', data['signal_efficiencies']['EFElAccept'].eff_str()
data

Background L2Calo in(et=0,eta=0) efficiency is:  18.529412 (63/340)
Background HLT bin(et=0,eta=0) efficiency is:  0.882353 (3/340)
Signal L2Calo bin(et=0,eta=0) efficiency is:  100.000000 (116/116)
Signal HLT bin(et=0,eta=0) efficiency is:  93.965517 (109/116)


{'background_cross_efficiencies': OrderedDict([('L2CaloAccept',
               <TuningTools.FilterEvents.BranchCrossEffCollector at 0x637e510>),
              ('L2ElAccept',
               <TuningTools.FilterEvents.BranchCrossEffCollector at 0x6385fd0>),
              ('EFCaloAccept',
               <TuningTools.FilterEvents.BranchCrossEffCollector at 0x5bf78d0>),
              ('EFElAccept',
               <TuningTools.FilterEvents.BranchCrossEffCollector at 0x56752d0>)]),
 'background_efficiencies': OrderedDict([('L2CaloAccept',
               <TuningTools.FilterEvents.BranchEffCollector at 0x2e1c4d0>),
              ('L2ElAccept',
               <TuningTools.FilterEvents.BranchEffCollector at 0x2e1c5d0>),
              ('EFCaloAccept',
               <TuningTools.FilterEvents.BranchEffCollector at 0x2e1c790>),
              ('EFElAccept',
               <TuningTools.FilterEvents.BranchEffCollector at 0x2e1c650>)]),
 'background_rings': array([[  2.13978439e+02,   5.42900635e+02,   1

We can also do not request to load a specific bin, but rather the full data available for all bins in the file:

In [17]:
from TuningTools.CreateData import TuningDataArchieve
with TuningDataArchieve("tuningData.npz") as data: pass
print 'Background L2Calo in(et=1,eta=0) efficiency is: ', data['background_efficiencies']['L2CaloAccept'][1][0].eff_str()
print 'Background HLT bin(et=1,eta=0) efficiency is: ', data['background_efficiencies']['EFElAccept'][1][0].eff_str()
print 'Signal L2Calo bin(et=1,eta=0) efficiency is: ', data['signal_efficiencies']['L2CaloAccept'][1][0].eff_str()
print 'Signal HLT bin(et=1,eta=0) efficiency is: ', data['signal_efficiencies']['EFElAccept'][1][0].eff_str()
data

Background L2Calo in(et=1,eta=0) efficiency is:  25.490196 (52/204)
Background HLT bin(et=1,eta=0) efficiency is:  0.980392 (2/204)
Signal L2Calo bin(et=1,eta=0) efficiency is:  100.000000 (502/502)
Signal HLT bin(et=1,eta=0) efficiency is:  98.804781 (496/502)


{'background_cross_efficiencies': OrderedDict([('L2CaloAccept',
               [[<TuningTools.FilterEvents.BranchCrossEffCollector at 0x6b21610>,
                 <TuningTools.FilterEvents.BranchCrossEffCollector at 0x5653410>,
                 <TuningTools.FilterEvents.BranchCrossEffCollector at 0x72e89d0>,
                 <TuningTools.FilterEvents.BranchCrossEffCollector at 0x72ff3d0>],
                [<TuningTools.FilterEvents.BranchCrossEffCollector at 0x72f2d90>,
                 <TuningTools.FilterEvents.BranchCrossEffCollector at 0x73037d0>,
                 <TuningTools.FilterEvents.BranchCrossEffCollector at 0x731d1d0>,
                 <TuningTools.FilterEvents.BranchCrossEffCollector at 0x731cb90>],
                [<TuningTools.FilterEvents.BranchCrossEffCollector at 0x7307550>,
                 <TuningTools.FilterEvents.BranchCrossEffCollector at 0x7310f50>,
                 <TuningTools.FilterEvents.BranchCrossEffCollector at 0x72f7910>,
                 <TuningTools.Fi

Please note that doing this is not needed, in general, as the bin informations are treated separetely for tuning.

<script type="text/javascript">
    show=true;
    function toggle(){
        if (show){
            $('div.input').hide();
        }else{
            $('div.input').show();
        }
        show = !show
    }
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
</script>
<a href="javascript:toggle()" target="_self"></a>