# Tutorial on workflow generation with PyCBC

## Workflows

*Pipeline* specifies a set of scientific tasks to be executed. *Workflow* specifies an instance of a pipeline with settings or configuration to execute a particular analysis.

A cononical example is the diamond workflow, an input file `f.a` goes to a process `preprocess` which creates two files `f.b1` and `f.b2`. Each of those files (`f.b1` and `f.b2`) gets run with the process `findrange`. A final process `analyze` takes the output from both `findrange` processes and produces a final file called `f.d`. A graphical depiction is shown below (Image credit: Pegasus documentation).

<img src="img_diamond_wf.png" alt="workflow_image" style="width: 200px;"/>

Generating and executing workflows are very common in scientific computing, and there is an active field how to manage the execution of workflows using *workflow management systems*:
  * A workflow management system: Automatically locates the necessary input data and computational resources, and manages storage space for executing data-intensive workflows on storage-constrained resources.

In PyCBC, we have made the choice to use the [Pegasus](https://pegasus.isi.edu/) workflow management system which uses [HTCondor](https://research.cs.wisc.edu/htcondor/) scheduler. For a overview of different workflow management systems see: [R. F. Silva et al. 2017](https://www.sciencedirect.com/science/article/pii/S0167739X17302510)

In this tutorial we will:
  * Demonstrate how to generate a workflow to execute PyCBC Inference and some plotting codes with PyCBC.
  * Demonstrate how to generate a workflow to execute a study using PyCBC Inference for a population of simulated binary black hole mergers in simulated data.
  * An example of creating a simple workflow in PyCBC.


## Setup: SciServer terminal and acquire tutorial

Eariler, you went through how to estimate the posteriors with PyCBC Inference on the command line and several plots. There are workflows that execute a workflow that: analyzes these signals and generates a set of plots on HTML pages.

We will run the example on the SciServer. To setup a container with a terminal do:

  1. Go to compute URL at: https://apps.sciserver.org/compute/
  2. Click `Create Container`.
  3. Give the container a name and select the `Python + R` image from the menu.
  4. Click on the name of the container that should now appear on the list at: https://apps.sciserver.org/compute/
  5. A Jupyter notebook should appear. Click `New`->`Terminal`.

You should now have a terminal open.

Create an Anaconda environment with:
```
conda create --yes --name pycbc-2p7 python=2.7
```

Load the new Anaconda environment called `pycbc-2p7` with:
```
source activate pycbc-2p7
```

Install prerequisites with:
```
# install LALSuite and PyCBC
pip install lalsuite pycbc
# install Pegasus Python API
pip install http://download.pegasus.isi.edu/pegasus/4.8.1/pegasus-python-source-4.8.1.tar.gz
```

To test your install, the following commands should work:
```
python -c "import pycbc"
python -c "import Pegasus"
```

Now, get data for this tutorial in the workspace with:
```
cd workspace
git clone http://github.com/gwastro/PyCBCInferenceWorkshopMay2019.git
```

And change into this tutorial with:
```
cd PyCBCInferenceWorkshopMay2019/pycbc_inference_workflow
```

## Generating workflow for binary black hole with GWOSC data

We will use data files from [GWSOC](https://www.gw-openscience.org) for GW150914, and run a workflow that uses ``pycbc_inference`` and generates a set of plots in HTML pages. There is a script already in the repository that demonstrates how to generate the workflow. 

This script does the following:
  1. Download two data files from GWSOC with `wget`.
  1. Set varables associated with the data, eg. trigger time, similar to the previous tutorals.
  1. Make a call to `pycbc_make_inference_workflow` which is the *workflow generator*. It reads in configuration files that describe how to run `pycbc_inference` and it will output an abstract representation of the workflow.

To execute this script do:
```
bash run_ex_bbh.sh
```

You should notice a number of log messages in the terminal after executing this script and a new directory called `bbh` is created.
 
Looking inside, we see `pycbc_make_inference_workflow` takes the following options:
  * `--workflow-name`: A short name to identify this workflow.
  * `--config-file`: A workflow configuration file.
  * `--inference-config-file`: A PyCBC Inference configuration file, e.g. from the eariler tutorials.
  * `--output-map`: Maps logical file names to physical file names on the filesystem. 
  * `--transformation-catalog`: Maps logical executable names to physical executables on the system. E.g. in configuration file references to `inference` get mapped to `/your/path/to/pycbc_inference`.
  * `--gps-end-time`: Event's coalescence time as reported from search or *a priori* estimate.
  * `--config-overrides`: A listing of options in the workflow file that will be overwritten from the command line.
 
## Running workflow for binary balck hole mergers
 
Note the SciServer container does not contain a HTCondor cluster which is a requirement. In order to execute the workflow (in the prescence) of a cluster. **Therefore, SciServer cannot run this command to completion.** But to submit, one would do:
```
cd ${OUTPUT_DIR}
pycbc_submit_dax \
   --no-create-proxy \
   --no-grid \
   --dax ${WORKFLOW_NAME}.dax \
   --enable-shared-filesystem \
   --force-no-accounting-group
```

This command (``pycbc_submit_dax``) will submit the workflow to the HTCondor cluster for execution.

## Input: Workflow configuration file

The workflow configuration file has several sections:
  * ``[workflow]``: Information about the workflow.
  * ``[workflow-*]``: Controls certain aspects of the workflow. E.g. ``[workflow-inference]`` gives details on what to plot.
  * ``[executables]``: Gives path to executables to use.
  * ``[*]``: The remaining sections give command line options to the executes. Names of these sections are the logical names given as keys in ``[executables]``.
  * ``[pegasus_profile-*]``: Gives options to Pegasus on what type of resources to acquire. E.g. how many nodes or memory requirements.

So for example, ``pycbc_inference`` options are set with:
```
[executables]
inference = ${which:pycbc_inference}

[inference]
seed = 39392
sample-rate = 2048
data-conditioning-low-freq = 20
strain-high-pass = 15
pad-data = 8
psd-estimation = median
psd-segment-length = 16
psd-segment-stride = 8
psd-inverse-length = 4
```

And recall, we added overrides on the command line. The corresponding options added to ``pycbc_inference`` were:
```
                       inference:psd-start-time:${PSD_START_TIME} \
                       inference:psd-end-time:${PSD_END_TIME} \
                       inference:psd-segment-length:${PSD_SEG_LEN} \
                       inference:psd-segment-stride:${PSD_STRIDE} \
                       inference:psd-inverse-length:${PSD_INVLEN} \
                       inference:nprocesses:${NPROCS} \
```

## Output: DAX

When we ran our workflow generator we put all the parts together to describe how the workflow should be constructed.

One of the output files is called a *Directed Acyclic Graph in XML* (DAX). The DAX is an XML file that gives an abstract representation of the workflow. It does not specify the paths to files or executables, but rather gives logical names to represent those files.

There are several XML elements:
  * ``<file>``: Describes an input file to the workflow. Here, you should notice the frame files (data) and the configuration file.
  * ``<job>``: Describes how a node will be run. E.g. the command line options.
  * ``<child>``: Give parent-child relationships.
  
In this example, there subworkflows--(1) one that analyzes and plots, and (2) one that generates a HTML page. The primary workflow's DAX is ``bbh/event.dax``.

The DAX which contains the ``pycbc_inference`` node is in ``bbh/event-main.dax``. Do:
```
less bbh/event-main.dax
```

You should notice the input files, jobs, and parent-child relationships.

Do you see your ``[peagusus_profile-inference]`` options? These will be stored in the transformation catalog, eg. ``bbh/event-main.tc.txt``.

## Viewing the workflow output

**Note we have only created the logical representation of the workflow, we have not executed the workflow.**

We will not be able to execute the workflow in the SciServer container. It requires a HTCondor cluster. At the end of this tutorial I give instructions to download a virtual machine with a single HTCondor node that this workflow can be executed inside.

However, I have attached an example set of HTML pages that these workflow generates at [Results](example_pages/index.html).

There are several tabs to view different information:
  * ``Summary``: Gives an overview of the results. Notice the table of credible intervals and the corner plot of the samples of the posterior.
  * ``Detector Sensitivity``: Shows the power spectral density.
  * ``Priors``: Samples of the priors used in the analysis.
  * ``Posteriors``: Posteriors to be plotted denoted in the ``[workflow-inference]`` section.
  * ``Samples``: The chains of the samples from the analysis.
  * ``Workflow``: Configuration files and logs from running workflow.
  
The caption buttons below images or tables gives a description of the file. The command line button gives the command line used to generate the file.

## Running example workflow for simulated population of binary black hole mergers

There is also an example to generate an workflow that generates and analyzes a population of simulated binary black hole mergers.

To run this example, use:
```
bash run_ex_pp.sh
```

This script is similar to our previous example, in that it calls the workflow generator.

# Create your own workflow generator

Workflow generators like ``pycbc_make_inference_workflow`` and ``pycbc_make_inference_inj_workflow`` use building blocks within PyCBC's workflow module to build the analysis.

Inside the ``pycbc.workflow`` subpackage there are several classes:
  * ``Workflow``: Represents the workflow and tracks parent-child relationships.
  * ``Executable``: Represents an executable, excluding the command line options.
  * ``Node``: Represents an executable with a particular set of command line options.
  * ``File``: Represents a file.

There is a scirpt ``create_workflow.py`` that contains an example. It creates a simple workflow with one node that runs ``echo``. We go through it here.

Import modules. We need the ``pycbc.workflow`` modules for building the workflow. The ``os`` module is used to :
```
#! /usr/bin/env python

from pycbc import workflow as wf
```

Create a Python object that contains options for the ``Workflow`` class. These are the set of required options the class must have; denoting its short name, configuration file, command line overrides, desired output map path, and desired transofrmation catalog path. To create the class, do:
```
# options
class Options:
    pass
opts = Options
opts.workflow_name = "wfex"
opts.config_files = ["ex_wfex.ini"]
opts.config_delete = None
opts.config_overrides = None
opts.output_file = "wfex.dax"
opts.output_map = "output.map"
opts.transformation_catalog = "wfex.tc.txt"
```

Create directory where output files will be written:
```
# set output directory
out_dir = "wfex_results"

# create output directory
wf.makedir(out_dir)
```

Create an instance of the ``Workflow`` class:
```
# create workflow
container = wf.Workflow(opts, opts.workflow_name)
```

In our workflow, we will simply call ``echo``. To create the ``Executable`` instance, do:
```
# create executable
exe_echo = wf.Executable(container.cp, "echo",
                         ifos=container.ifos, out_dir=out_dir)
```

To create a ``Node`` instance with its command line options, do:
```
# create node
node_echo = exe_echo.create_node()

# add options
node_echo.add_opt("--option-1", container.analysis_time[0])
node_echo.add_opt("--option-2", container.analysis_time[1])
```

To add the node to the workflow, do:
```
# add node to workflow
container += node_echo
```

Finally, you will want to write your DAX, output map, and transformation catalogs. Therefore, do:
```
# save
container.save(filename=opts.output_file,
               output_map_path=opts.output_map,
               transformation_catalog_path=opts.transformation_catalog)
```

You should now see:
  * A new DAX file called ``wfex.dax`` which contains our ``echo`` node.
  * A new transformation catalog that displays your physical path to ``echo``.
  
Look in the configuration file, ``ex_wfex.ini``. This contains the minimum for using the workflow modules as well:
```
[workflow]
start-time = 0
end-time = 1

[workflow-ifos]
h1 =
l1 =

[executables]
echo = ${which:echo}

[echo]
```

This should look familar, eg. notice the ``[workflow]``, `[workflow-*]``, ``[executables]``, and the ``[*]`` sections as with the examples above.

## Additional exercises

### Get virutal machine with HTCondor node

This is a more advanced example, where there is a OVA file you can download and run a virtual machine with a virtual HTCondor cluster. The documentation from Pegsus is [here](https://pegasus.isi.edu/documentation/vm_virtualbox.php).

On your local machine, try doing:
  * Install VirtualBox
  * Download the virtual machine at: http://download.pegasus.isi.edu/pegasus/4.9.1/PegasusTutorialVM-4.9.1.ova
  * Launch VirtualBox
  * ``File``->``Import Appliance``, and then select your downloaded OVA file.
  * Double-click new icon for ``PegasusTutorialVM-4.9.1`` in VirutalBox to launch the virtual machine.

This virtual machine does not have all the requirements to run the example. You will need to:
  * Install pip, eg. ``curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python get-pip.py --user``
  * Install the developmental Python pacakge to obtain the Python header file, eg. ``sudo yum install -y python-devel``.
  * Install git, eg. ``sudo yum install -y git``
  * Install PyCBC and LALSuite and get the tutorial, eg. ``pip install lalsuite pycbc`` and ``
git clone http://github.com/gwastro/PyCBCInferenceWorkshopMay2019.git``.
  * You will not be able to submit from ``/tmp``, so for ``pycbc_submit_dax`` use the ``--local-dir`` option. Eg. use ``mkdir -p ${HOME}/tmp && pycbc_submit_dax --local-dir ${HOME}/tmp``.
  * Remove the ``[inference-pegasus_profile]`` options. In the virtual machine you will only have a single node, and if you request >1 nodes, your ``pycbc_inference`` node will not run.
  * You may need to use ``numpy==1.16.1``.

Can you:
  * Run the examples in the virtual machine?
  * Design your own workflow generator and exevute your own workflow in the virtual machine or a HPC cluster?
  * Add dependencies or ``File`` instances to your workflow generator? See the executables in the PyCBC repository ``bin/workflow`` as examples.