# How to Use Orchestra?

This notebook is dedicated to setup and use the orchestra service into the LPS caloba cluster.

## 1) Before we starts:

- Login into the caloba cluster using your account;
- Create an images file into your home;
- Download the ringer container into the images file using this command: `singularity pull docker://jodafons/ringer:base`
- Download the orchestra container into the images file using this command: `singularity pull docker://jodafons/orchestra:latest`
- Create the configuration file.

The configuration file must have the name `.orchestra.json`. If you don't have an account into the LPS database, please
request it.

**NOTE**: The configuration file (`json` extension) must have this format

In [35]:
cat $HOME/.orchestra.json

{
  "username" : "jodafons",
  "postgres" : "postgres://jodafons:YOUR_DB_PASSWORD@146.164.147.10:5432/jodafons_db",
  "email"    : "jodafons@lps.ufrj.br",
  "password" : "YOUR_EMAIL_PASSWORD",
  "job_complete_file_name" : ".complete"
}


### 1.1) Start the image:

After you setup everything, let's enter into the container to use all dependencies.

- `singularity run $PWD/images/ringer_base.sif`
- Run the command `source /setup_all_here.sh ringer-atlas` to setup all dependencies.

## 2) Initialize the Database:

In [1]:
!maestro.py user -h

Using all sub packages with ROOT dependence
2021-02-04 19:34:55.407656: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
usage: maestro.py user [-h] {create,delete,list,init} ...

positional arguments:
  {create,delete,list,init}

optional arguments:
  -h, --help            show this help message and exit


In [2]:
!maestro.py user init

Using all sub packages with ROOT dependence
2021-02-04 19:35:14.054182: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[0;32m2021-02-04 19:35:25,348 | Py.UserParser                           INFO Successfully initialized.[0m


## 3) Create an User:

In [3]:
!maestro.py user create -h

Using all sub packages with ROOT dependence
2021-02-04 19:35:40.500699: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
usage: maestro.py create [-h] -n NAME -e EMAIL

optional arguments:
  -h, --help            show this help message and exit
  -n NAME, --name NAME  The name of the user.
  -e EMAIL, --email EMAIL
                        The user email.


In [4]:
!maestro.py user create -n jodafons -e jodafons@lps.ufrj.br

Using all sub packages with ROOT dependence
2021-02-04 19:36:05.731443: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[0;32m2021-02-04 19:36:07,614 | Py.UserParser                           INFO Successfully created.[0m


## 4) Create a Node:

In [5]:
!maestro.py node -h

Using all sub packages with ROOT dependence
2021-02-04 19:36:33.461629: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
usage: maestro.py node [-h] {create,delete,list,stop} ...

positional arguments:
  {create,delete,list,stop}

optional arguments:
  -h, --help            show this help message and exit


In [6]:
!maestro.py node create -h

Using all sub packages with ROOT dependence
2021-02-04 19:36:50.006109: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
usage: maestro.py create [-h] -n NAME -ec ENABLEDCPUSLOTS -mc
                         MAXNUMBEROFCPUSLOTS -eg ENABLEDGPUSLOTS -mg
                         MAXNUMBEROFGPUSLOTS

optional arguments:
  -h, --help            show this help message and exit
  -n NAME, --name NAME  The name of the node.
  -ec ENABLEDCPUSLOTS, --enabledCPUSlots ENABLEDCPUSLOTS
                        The number of CPU enabled slots.
  -mc MAXNUMBEROFCPUSLOTS, --maxNumberOfCPUSlots MAXNUMBEROFCPUSLOTS
                        The total number of CPU slots for this node.
  -eg ENABLEDGPUSLOTS, --enabledGPUSlots ENABLEDGPUSLOTS
                        The number of GPU enabled slots.
  -mg MAXNUMBEROFGPUSLOTS, --maxNumberOfGPUSlots MAXNUMBEROFGPUSLOTS
                        The total number of GPU slots for this node.


Let's create one machine (only for cpy) with 10 slots (all slots enabled). Here, this node (caloba51) will
be used to process 10 jobs at the same time.

**NOTE**: The node name must be equal than the hostname. For example, caloba21, caloba22, .. caloba25.

In [7]:
!maestro.py node create -ec 10 -mc 10 -eg 0 -mg 0 -n caloba51

Using all sub packages with ROOT dependence
2021-02-04 19:37:28.391286: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[0;32m2021-02-04 19:37:29,723 | Py.NodeParser                           INFO Successfully created.[0m


Now, let's create one node only for GPU queue. Use this command to create one node with name 
caloba21 (where we have two RTX2080 availables) with two slots (two enabled).

In [31]:
!maestro.py node create -ec 0 -mc 0 -eg 2 -mg 2 -n caloba21

Using all sub packages with ROOT dependence
2021-02-04 20:37:48.075111: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[0;32m2021-02-04 20:37:49,477 | Py.NodeParser                           INFO Successfully created.[0m


In [32]:
!maestro.py node list

Using all sub packages with ROOT dependence
2021-02-04 20:37:52.274901: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
+----------+-----------+-----------+-------+---------+
|   [92mNode[0m   | [92mGPU Slots[0m | [92mCPU slots[0m |  [92mType[0m |  [92mStatus[0m |
+----------+-----------+-----------+-------+---------+
| caloba51 |    0/0    |   10/10   | slave | offline |
| caloba21 |    2/2    |    0/0    | slave | offline |
+----------+-----------+-----------+-------+---------+


## 5) Registry some datasets into the base:

In [9]:
!maestro.py castor -h

Using all sub packages with ROOT dependence
2021-02-04 19:38:04.162315: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
usage: maestro.py castor [-h] {registry,unregistry,list} ...

positional arguments:
  {registry,unregistry,list}

optional arguments:
  -h, --help            show this help message and exit


In [10]:
!maestro.py castor registry -h

Using all sub packages with ROOT dependence
2021-02-04 19:39:04.210786: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
usage: maestro.py registry [-h] -d DATASETNAME -p PATH

optional arguments:
  -h, --help            show this help message and exit
  -d DATASETNAME, --dataset DATASETNAME
                        The dataset name used to registry into the database.
                        (e.g: user.jodafons...)
  -p PATH, --path PATH  The path to the dataset


### 5.1) Registry data file:

In [12]:
!maestro.py castor registry -d user.jodafons.data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.npz \
    -p /home/jodafons/public/cern_data/files/Zee/data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97/data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.npz

Using all sub packages with ROOT dependence
2021-02-04 19:40:47.995783: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[0;32m2021-02-04 19:40:49,481 | Py.DatasetParser                        INFO Registry /home/jodafons/public/cern_data/files/Zee/data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97/data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.npz into user.jodafons.data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.npz[0m
[0;32m2021-02-04 19:40:49,722 | Py.DatasetParser                        INFO Successfully uploaded.[0m


### 5.2) Registry reference file:

In [13]:
!maestro.py castor registry -d user.jodafons.data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.ref.pic.gz \
    -p /home/jodafons/public/cern_data/files/Zee/data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97/references/data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.ref.pic.gz

Using all sub packages with ROOT dependence
2021-02-04 19:42:07.212435: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[0;32m2021-02-04 19:42:08,494 | Py.DatasetParser                        INFO Registry /home/jodafons/public/cern_data/files/Zee/data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97/references/data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.ref.pic.gz into user.jodafons.data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.ref.pic.gz[0m
[0;32m2021-02-04 19:42:08,617 | Py.DatasetParser                        INFO Successfully uploaded.[0m


### 5.3) Registry job files:

In [14]:
!maestro.py castor registry -d user.jodafons.job_config.Zee_v10.10sorts.10inits.r3 \
    -p /home/jodafons/tasks/jobs/job_config.Zee_v10.10sorts.10inits.r3

Using all sub packages with ROOT dependence
2021-02-04 19:43:31.777281: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[0;32m2021-02-04 19:43:33,233 | Py.DatasetParser                        INFO Registry /home/jodafons/tasks/jobs/job_config.Zee_v10.10sorts.10inits.r3/job_config.ID_0000.ml0.mu0_sl0.su0_il0.iu0.18-Jan-2021-14.47.30.pic.gz into user.jodafons.job_config.Zee_v10.10sorts.10inits.r3[0m
[0;32m2021-02-04 19:43:33,234 | Py.DatasetParser                        INFO Registry /home/jodafons/tasks/jobs/job_config.Zee_v10.10sorts.10inits.r3/job_config.ID_0001.ml0.mu0_sl0.su0_il1.iu1.18-Jan-2021-14.47.30.pic.gz into user.jodafons.job_config.Zee_v10.10sorts.10inits.r3[0m
[0;32m2021-02-04 19:43:33,234 | Py.DatasetParser                        INFO Registry /home/jodafons/tasks/jobs/job_config.Zee_v10.10sorts.10inits.r3/job_config.ID_0002.ml0.mu0_sl0.su0_il2.iu2.18-Jan-2021-14.47.30.pic.gz into user.jodafons.job

## 6) List all datasets:

In [16]:
!maestro.py castor list -u jodafons

Using all sub packages with ROOT dependence
2021-02-04 19:44:12.055614: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
+----------+---------------------------------------------------------------------------------------------------------------+-------+
| [92mUsername[0m |                                                    [92mDataset[0m                                                    | [92mFiles[0m |
+----------+---------------------------------------------------------------------------------------------------------------+-------+
| jodafons |     user.jodafons.data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.npz    |   1   |
| jodafons | user.jodafons.data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.ref.pic.gz |   1   |
| jodafons |                              user.jodafons.job_config.Zee_v10.10sorts.10inits.r3                         

## 7) Create a Task:

In [17]:
!maestro.py task -h

Using all sub packages with ROOT dependence
2021-02-04 19:45:19.594627: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
usage: maestro.py task [-h] {create,repro,retry,delete,list,kill,queue} ...

positional arguments:
  {create,repro,retry,delete,list,kill,queue}

optional arguments:
  -h, --help            show this help message and exit


In [18]:
!maestro.py task create -h

Using all sub packages with ROOT dependence
2021-02-04 19:45:37.991401: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
usage: maestro.py create [-h] -v VOLUME -t TASKNAME -c CONFIGFILE -d DATAFILE
                         [--sd SECONDARYDS] --exec EXECCOMMAND --queue QUEUE
                         [--dry_run] [--bypass]

optional arguments:
  -h, --help            show this help message and exit
  -v VOLUME, --volume VOLUME
                        The volume
  -t TASKNAME, --task TASKNAME
                        The task name to be append into the db.
  -c CONFIGFILE, --configFile CONFIGFILE
                        The job config file that will be used to configure the
                        job (sort and init).
  -d DATAFILE, --dataFile DATAFILE
                        The data/target file used to train the model.
  --sd SECONDARYDS, --secondaryDS SECONDARYDS
                        The secondary datasets to be ap

In [20]:
!maestro.py task create \
  -v $PWD \
  -t user.jodafons.data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97.v10_et0_eta0.r3 \
  -c user.jodafons.job_config.Zee_v10.10sorts.10inits.r3 \
  -d user.jodafons.data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.npz \
  --sd "{'%REF':'user.jodafons.data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97_et0_eta0.ref.pic.gz'}" \
  --exec "run_tuning.py -c %IN -d %DATA -r %REF -v %OUT -t v10 -b zee -p r3" \
  --queue "gpu"

Using all sub packages with ROOT dependence
2021-02-04 19:48:24.890547: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
[0;32m2021-02-04 19:48:26,167 | Py.TaskParser                           INFO Creating the task dir in /home/jodafons/user.jodafons.data17_13TeV.AllPeriods.sgn.probes_lhmedium_EGAM1.bkg.VProbes_EGAM7.GRL_v97.v10_et0_eta0.r3[0m
[0;32m2021-02-04 19:48:27,727 | Py.TaskParser                           INFO Succefully created.[0m


## 8) Start Orchestra:

Open a new terminal and download once again into the caloba cluster. Let's launch the orchestra using the 
`slurm` commands. To start the server, you must have one master and many slaves. The `master` node works as
task manager and consumer. The `slave` node works only as consumer. The two scripts below are used to launch
master and slave nodes.

**NOTE**: Create the master node with two gpus slots.
**NOTE**: You can not launch more than one node as master.

### 8.1) Create the master node:

In [27]:
!cat tasks/scripts/run_slurm_master.sh

#!/bin/bash
#SBATCH --nodes=1            	# Number of nodes
#SBATCH --ntasks-per-node=1  	# Number of tasks/node
#SBATCH --cpus-per-task=16   	# Number of threads/task
#SBATCH --partition=gpu      	# The partion name: gpu or cpu
#SBATCH --job-name=orchestra 	# job name
#SBATCH --exclusive         	# Reserve this node only for you
#SBATCH --account=jodafons	# account name

echo $SLURM_JOB_NODELIST
nodeset -e $SLURM_JOB_NODELIST

# set cuda and cudnn for gpu queue
module purge
module load cudnn/7.6.5 cuda/10.1

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# the container path
export IMG=/home/jodafons/images/orchestra_latest.sif
# allocate the node using two gpus and launch the sing command & ochestra as master
srun -N 1 -n 1 -c $SLURM_CPUS_PER_TASK --gpus 2 singularity run --nv --writable-tmpfs $IMG pilot run --master

wait


### 8.2) Create the slave node:

**NOTE**: Create one slave node with two gpus.

In [28]:
!cat tasks/scripts/run_slurm.sh

#!/bin/bash
#SBATCH --nodes=1            	# Number of nodes
#SBATCH --ntasks-per-node=1  	# Number of tasks/node
#SBATCH --cpus-per-task=16   	# Number of threads/task
#SBATCH --partition=gpu      	# The partion name: gpu or cpu
#SBATCH --job-name=orchestra 	# job name
#SBATCH --exclusive         	# Reserve this node only for you
#SBATCH --account=jodafons	# account name

echo $SLURM_JOB_NODELIST
nodeset -e $SLURM_JOB_NODELIST

# set cuda and cudnn for gpu queue
module purge
module load cudnn/7.6.5 cuda/10.1

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# the container path
export IMG=/home/jodafons/images/orchestra_latest.sif
# allocate the node using two gpus and launch the sing command & ochestra as master
srun -N 1 -n 1 -c $SLURM_CPUS_PER_TASK --gpus 2 singularity run --nv --writable-tmpfs $IMG pilot run &&

wait
