# Tuning the Discriminator

*This documentation is still being written.*

It is possible to tune the Ringer discriminator both on standalone and on the GRID. The latter only applies if you have installed the TuningTools with cvmfs access.

In order to run a standalone tuning, you can both run: 

- [Recommended] [use the executable](#Using-the-tuning-shell-command); or
- a [python script](#Running-through-python-script). 

If it is wanted to send the job to the GRID, the only option is to run the [executable command to upload the job](#Running-the-GRID-dispatch-tuning-command).

<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

# Tuning on the GRID

Running the tuning job on the GRID require more steps than doing it on standalone. It isn't possible to configure each one of the jobs via job arguments, so it is needed to create configuration files for each the GRID jobs. This allows the panda job to divide the job into subsets.

In order to do so, we will first have to [create the configuration data](#Creating-configuration-data) and afterwards export it to be [available on the GRID](#Exporting-data-to-the-GRID). Only after these steps, it will be possible to [dispach the job to the GRID](#Dispatching-the-job-to-the-GRID).

## Creating configuration data

The configuration data is used by the panda pilot to divide the job into subsets. The user needs to specify the discriminator parameter range (i.e.: number of neurons in the hidden layer, in the case of neural networks), the number of initializations and the number of cross-validation sorts, and how many of them will be run in a job on the GRID. Besides that, the tuning process need to specify a unique CrossValidation object to be used by all jobs, being this also informed through a configuration file. Finally, the pre-processing is currently also informed via a configuration file.

In order to generate all this information, the user can make use of `createTuningJobFiles.py` executable.


In [1]:
%%bash
createTuningJobFiles.py 

usage: createTuningJobFiles.py [-h] [--compress [COMPRESS]] [-outJobConfig JOBCONFIFILESOUTPUTFOLDER] [--neuronBounds NEURONBOUNDS [NEURONBOUNDS ...]] [--sortBounds SORTBOUNDS [SORTBOUNDS ...]]
                               [--nInits [NINITS]] [--nNeuronsPerJob NNEURONSPERJOB] [--nSortsPerJob NSORTSPERJOB] [--nInitsPerJob NINITSPERJOB] [-outCross CROSSVALIDOUTPUTFILE] [-ns NSORTS]
                               {ConfigFiles,CrossValidFile,_ignoreCase,all,ppFile} [{ConfigFiles,CrossValidFile,_ignoreCase,all,ppFile} ...]

Generate input file for TuningTool on GRID

positional arguments:
  {ConfigFiles,CrossValidFile,_ignoreCase,all,ppFile}
                        Which kind of files to create. You can choose one or more of the available choices, just don't use all with the other available choices.

optional arguments:
  -h, --help            show this help message and exit
  --compress [COMPRESS]
                        Whether to compress files or not.

JobConfig Files Creation Options

If more information is needed, check the [available example](#Example-1:-Creating-configuration-files).

## Exporting data to the GRID

After creating the files, it is needed to upload them to the GRID so that the panda pilot and the sub-jobs can retrieve their information. The `add_container.sh` helps on this task:


In [12]:
%%bash
add_container.sh -h

Usage: add_container.sh [-hv[verbosity_level=1]] 
                -f|--file[=] INPUTFILE | -f|--file INPUTFILES
                --dataset[=] DATASET
                [--rse[=] RSE 'CERN-PROD_SCRATCHDISK']
                [--useDQ2=1]

Create dataset on grid containing input local files at specified rse.
IMPORTANT: You need to have grid environment set.

    -h             display this help and exit
    -f INPUTFILES  files to upload to grid container. If one file is a directory,
                   it will be expanded using all non diretory files inside it.
    -v             verbose mode. Can be used multiple times for increased
                   verbosity.
    --dataset      the dataset name. It must be specified using
                   user.account.datasetname.
    --rse          The rse to upload the files and put the dataset
    --useDQ2       If set to true, then DQ2 will be used instead of rucio.


An usage example:

```
add_container.sh --file jobConfig --dataset config.nn5to7_sorts50_1by1_inits100_50by50 --rse BNL-OSG2_SCRATCHDISK
```

There is a documentation on the rucio usage aiming on the commands needed by the TuningTools [available here](http://nbviewer.jupyter.org/gist/wsfreund/249b5db998fb4594f800).

## Dispatching the job to the GRID

This is done via the command `runGRIDtuning.py`. The basic arguments needed to be informed are: 
- `--dataDS`: Inform the uploaded container with the data file;
- `--crossValidDS`: Container with the CrossValidation file;
- `--configFileDS`: Container used by the pilot to divide the jobs and contains the job configuration files;
- `--ppDS`: Container with the pre-processing chain.



In [13]:
!runGRIDtuning.py

usage: runGRIDtuning.py [-h] [--show-evo SHOW_EVO] [--max-fail MAX_FAIL] [--epochs EPOCHS] [--do-perf DO_PERF] [--batch-size BATCH_SIZE] [--algorithm-name ALGORITHM_NAME] [--network-arch NETWORK_ARCH]
                        [--cost-function COST_FUNCTION] [--shuffle SHUFFLE] [--seed SEED] [--do-multi-stop DO_MULTI_STOP] -d DATA [DATA ...] -c Config_DS [Config_DS ...] -pp PP_DS [PP_DS ...] -x
                        CrossValid_DS [CrossValid_DS ...] [--et-bins ET_BINS [ET_BINS ...]] [--eta-bins ETA_BINS [ETA_BINS ...]] [--secondaryDSs GRID_SECONDARYDS [GRID_SECONDARYDS ...]] --outDS
                        GRID_OUTDS [--site [GRID_SITE]] [--excludedSite [GRID_EXCLUDEDSITE]] [--debug] [--excludeFile [GRID_EXCLUDEFILE]] [--disableAutoRetry] [--extFile [GRID_EXTFILE]]
                        [--maxNFilesPerJob [GRID_MAXNFILESPERJOB]] [--cloud [GRID_CLOUD]] [--nGBPerJob [GRID_NGBPERJOB]] [--skipScout] [--memory GRID_MEMORY] [--useNewCode] [--dry-run]

Tune discriminators using input 

An example is available [here](#Example-3:-Dispatching-job-to-the-GRID).

# Tuning on Standalone

Most of the TuningTools functionalities can be accessed through python scripts or a shell command. The executable is the recommended way for running a standalone job.

## Using the tuning shell command

This is the recommended way for interacting with the tunning job. Use the command `runTuning.py`:

In [15]:
%%bash
runTuning.py

usage: runTuning.py -d data [-x CROSSFILE] [-c CONFFILELIST [CONFFILELIST ...]] [--neuronBounds NEURONBOUNDS [NEURONBOUNDS ...]] [--sortBounds SORTBOUNDS [SORTBOUNDS ...]]
                    [--initBounds INITBOUNDS [INITBOUNDS ...]] [--ppFileList PPFILELIST [PPFILELIST ...]] [--et-bins ET_BINS [ET_BINS ...]] [--eta-bins ETA_BINS [ETA_BINS ...]] [--no-compress]
                    [--show-evo SHOW_EVO] [--max-fail MAX_FAIL] [--epochs EPOCHS] [--do-perf DO_PERF] [--batch-size BATCH_SIZE] [--algorithm-name ALGORITHM_NAME] [--network-arch NETWORK_ARCH]

Tune discriminators using input data.

Required arguments:

  -d data, --data data  The data file that will be used to tune the discriminators

Optional arguments:

  --no-compress         Don't compress output files.

Cross-validation configuration:

  -x CROSSFILE, --crossFile CROSSFILE
                        The cross-validation file path, pointing to a file created with the create tuning job files

Looping configuration:

  -c CONFFI

An example is available [here](#Example-2:-Standalone-job-via-executable-command).

## Running through python script

However, you can directly access the `TuningJob` class, and call it using a python script. The __call__ method documentation cover all available options:

In [1]:
from TuningTools.TuningJob import TuningJob
help(TuningJob.__call__)

Help on method __call__ in module TuningTools.TuningJob:

__call__(self, dataLocation, **kw) unbound TuningTools.TuningJob.TuningJob method
    Run discrimination tuning for input data created via CreateData.py
    Arguments:
      - dataLocation: A string containing a path to the data file written
        by CreateData.py
    Mutually exclusive optional arguments: Either choose the cross (x) or
      circle (o) of the following block options.
     -------
      x crossValid [CrossValid( nSorts=50, nBoxes=10, nTrain=6, nValid=4, 
                                seed=crossValidSeed )]:
        The cross-validation sorts object. The files can be generated using a
        CreateConfFiles instance which can be accessed via command line using
        the createTuningJobFiles.py script.
      x crossValidSeed [None]: Only used when not specifying the crossValid option.
        The seed is used by the cross validation random sort generator and
        when not specified or specified as None, 

See an example [here](#Example-4:-Standalone-job-via-python-script).

# The information available on the TunedDiscrArchieve

*Still to be written*

# Examples

## Example 1: Creating configuration files

Suppose that it is needed to tune neural networks with number of hidden layer neurons  $N_H\in\{5,7\}$, a total of 10 sorts on 10 data subsets with 6 of them being reserved for training and 4 for validation. The sorts will have the pseudo-number generator seeded by the integer 10. It is wanted to tune a total of 100 initializations to avoid local optima. The sub-jobs should run 1 hidden neuron configuration, only one data sort and 50 initializations. Finally, the pre-processing chain will use the MapStd normalization.

This can be achieved by using the following parameters:

In [9]:
%%bash
createTuningJobFiles.py all \
                        --neuronBounds 5 7 \
                        --sortBounds 10 \
                        --nInits 100 \
                        --nNeuronsPerJob 1 \
                        --nSortsPerJob 1 \
                        --nInitsPerJob 50 \
                        -ns 50 \
                        -nb 10 \
                        -ntr 6 \
                        -nval 4 \
                        -seed 10 \
                        -ppCol "[[MapStd()]]"

Py.__main__                             INFO Creating configuration files at folder jobConfig
Py.CreateTuningJobFiles                 INFO Saved job option configuration at path: jobConfig/job.hn0005.s0000.il0000.iu0049.pic.gz
Py.CreateTuningJobFiles                 INFO Saved job option configuration at path: jobConfig/job.hn0005.s0000.il0050.iu0099.pic.gz
Py.CreateTuningJobFiles                 INFO Saved job option configuration at path: jobConfig/job.hn0005.s0001.il0000.iu0049.pic.gz
Py.CreateTuningJobFiles                 INFO Saved job option configuration at path: jobConfig/job.hn0005.s0001.il0050.iu0099.pic.gz
Py.CreateTuningJobFiles                 INFO Saved job option configuration at path: jobConfig/job.hn0005.s0002.il0000.iu0049.pic.gz
Py.CreateTuningJobFiles                 INFO Saved job option configuration at path: jobConfig/job.hn0005.s0002.il0050.iu0099.pic.gz
Py.CreateTuningJobFiles                 INFO Saved job option configuration at path: jobConfig/job.hn0005.s0

This created the `jobConfig` folder on the current path, which contains the sub-jobs parameters, the `crossValid.pic.gz` file containing the CrossValidation sorts, and the `ppFile_*.pic.gz`, with the pre-processing information.

In [11]:
%%bash
ls jobConfig crossValid.pic* ppFile*.pic*

crossValid.pic.gz
ppFile_MapStd.pic.gz

jobConfig:
job.hn0005.s0000.il0000.iu0049.pic.gz
job.hn0005.s0000.il0050.iu0099.pic.gz
job.hn0005.s0001.il0000.iu0049.pic.gz
job.hn0005.s0001.il0050.iu0099.pic.gz
job.hn0005.s0002.il0000.iu0049.pic.gz
job.hn0005.s0002.il0050.iu0099.pic.gz
job.hn0005.s0003.il0000.iu0049.pic.gz
job.hn0005.s0003.il0050.iu0099.pic.gz
job.hn0005.s0004.il0000.iu0049.pic.gz
job.hn0005.s0004.il0050.iu0099.pic.gz
job.hn0005.s0005.il0000.iu0049.pic.gz
job.hn0005.s0005.il0050.iu0099.pic.gz
job.hn0005.s0006.il0000.iu0049.pic.gz
job.hn0005.s0006.il0050.iu0099.pic.gz
job.hn0005.s0007.il0000.iu0049.pic.gz
job.hn0005.s0007.il0050.iu0099.pic.gz
job.hn0005.s0008.il0000.iu0049.pic.gz
job.hn0005.s0008.il0050.iu0099.pic.gz
job.hn0005.s0009.il0000.iu0049.pic.gz
job.hn0005.s0009.il0050.iu0099.pic.gz
job.hn0006.s0000.il0000.iu0049.pic.gz
job.hn0006.s0000.il0050.iu0099.pic.gz
job.hn0006.s0001.il0000.iu0049.pic.gz
job.hn0006.s0001.il0050.iu0099.pic.gz
job.hn0006.s0002.il0000.iu0049.pic.gz

## Example 2: Standalone job via executable command

Suppose that we want to tune the data created [here](http://nbviewer.jupyter.org/github/wsfreund/TuningTools/blob/master/doc/CreateData.ipynb#Example-1:-Creating-data-using-the-command-line) using 16 hidden layer neurons, 4 initializations and the first sort. We also and to tune in this job the $E_T$ bins from 1 to 2 and the $\eta$ bins from 1 to 3. The command would be the following:

In [36]:
 !runTuning.py -d tuningtoolData.npz \
    --output-level INFO \
    --neuronBounds 16 16 \
    --initBounds 0 4 \
    --sortBounds 0 1 \
    --et-bins 1 2 \
    --eta-bins 1 3

TuningToolPyWrapper                     INFO Changing pseudo-random number generator seed to (1457565886).
Py.TuningJob                            INFO Opening data (etBin=1,etaBin=1) ...
Py.TuningJob                            INFO Tunning Et bin: array([ 30.,  50.], dtype=float32)
Py.TuningJob                            INFO Tunning eta bin: array([ 0.80000001,  1.37      ], dtype=float32)
Py.TuningJob                            INFO Running configuration file number 0 (etBin=1,etaBin=1) 
Py.TuningJob                            INFO Extracting cross validation sort 0 (etBin=1,etaBin=1) 
Py.CrossValid                           INFO Train      #Events/class: [926, 505]
Py.CrossValid                           INFO Validation #Events/class: [617, 336]
Py.TuningJob                            INFO Tuning pre-processing chain (Norm1)...
Py.TuningJob                            INFO Applying pre-processing chain...
Py.TuningJob                            INFO Training <Neuron = 16, sort = 0, 

## Example 3: Dispatching job to the GRID

Consider the follow

In [None]:
!runGRIDtuning.py -d user.wsfreund.mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_nod0_l1etcut20_l2etcut19_efetcut24_binned.pic.npz \
                -pp user.jodafons:user.jodafons.Norm1 \
                -c user.wsfreund.config.nn5to20_sorts50_1by1_inits100_100by100_upload \
                -x user.jodafons.crossVal_50sorts_20160302.pic.gz \
                -o user.wsfreund.nn.mc14_13TeV.147406.129160.sgn.offLH.bkg.truth.trig.wf.e24_lhmedium_nod0_l1et20_l2et19_efet24_binned_debug \
                -otar workarea.tgz \
                --debug

## Example 4: Standalone job via python script

Here, we run a job using 5 initializations of neural networks containing 15 hidden layer neurons. We divide the train and validation dataset using the sort number `0` of the "default" CrossValidation object seeded by the integer `66`. The neural network are trained for a total of 100 epochs, the goal is to measure time performance, so the convergence does not avoid over-training.

In [14]:
# %load ../scripts/skeletons/time_test.py
#!/usr/bin/env python

# TODO Improve skeleton documentation

from timeit import default_timer as timer

start = timer()

DatasetLocationInput = '/afs/cern.ch/work/j/jodafons/public/validate_tuningtool/mc14_13TeV.147406.129160.sgn.offLikelihood.bkg.truth.trig.e24_lhmedium_L1EM20VH_etBin_0_etaBin_0.npz'

#try:
from RingerCore.Logger import Logger, LoggingLevel
mainLogger = Logger.getModuleLogger(__name__)
mainLogger.info("Entering main job.")

from TuningTools.TuningJob import TuningJob
tuningJob = TuningJob()

from TuningTools.PreProc import *

basepath = '/afs/cern.ch/work/j/jodafons/public'

tuningJob( DatasetLocationInput, 
           neuronBoundsCol = [15, 15], 
           sortBoundsCol = [0, 1],
           initBoundsCol = 5, 
           #confFileList = basepath + '/user.wsfreund.config.nn5to20_sorts50_1by1_inits100_100by100/job.hn0015.s0040.il0000.iu0099.pic.gz',
           #ppFileList = basepath+'/user.wsfreund.Norm1/ppFile_pp_Norm1.pic.gz',
           #crossValidFile = basepath+'/user.wsfreund.CrossValid.50Sorts.seed_0/crossValid.pic.gz',
           epochs = 100,
           showEvo = 0,
           #algorithmName= 'rprop',
           #doMultiStop = True,
           #doPerf = True,
           maxFail = 100,
           #seed = 0,
           ppCol = PreProcCollection( PreProcChain( MapStd() ) ),
           crossValidSeed = 66,
           level = LoggingLevel.DEBUG )

mainLogger.info("Finished.")

end = timer()

print 'execution time is: ', (end - start)      


Py.__main__                             INFO Entering main job.
Py.CrossValid                          DEBUG Retrieved the following configuration:
[('nBoxes', 10), ('nSorts', 50), ('nTrain', 6), ('nValid', 4), ('seed', 66)]
Py.TuningJob                            INFO Opening data...
Py.TuningJob                            INFO Running configuration file number 0
Py.TuningJob                            INFO Extracting cross validation sort 0
Py.CrossValid                           INFO Train      #Events/class: [11626, 3386]
Py.CrossValid                           INFO Validation #Events/class: [7751, 2257]
Py.TuningJob                            INFO Tuning pre-processing chain (MapStd)...
Py.TuningJob                           DEBUG Done tuning pre-processing chain!
Py.TuningJob                            INFO Applying pre-processing chain...
Py.TuningJob                           DEBUG Done applying the pre-processing chain!
Py.TuningWrapper                       DEBUG Set batchSiz

execution time is:  93.2350001335


<script type="text/javascript">
    show=true;
    function toggle(){
        if (show){
            $('div.input').hide();
        }else{
            $('div.input').show();
        }
        show = !show
    }
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
</script>
<a href="javascript:toggle()" target="_self"></a>