# Pegasus Tutorial For HTC Workflows on ACCESS Resources

Welcome to the Pegasus tutorial notebook, which is intended for new users who want to get a quick overview of Pegasus concepts and usage. This tutorial covers:

 - Using the Pegasus API to generate an abstract workflow
 - Using the API to plan the abstract workflow into an executable workflow
 - Pegasus catalogs for sites, transformations, and data
 - Debug and recover from failures (02-Debugging notebook)
 - Command line tools (03-Command-Line-Tools notebook)
 
For a quick overview of Pegasus, please see this short YouTube video:

[![A 5 Minute Introduction](../images/youtube-pegasus-intro.png)](https://www.youtube.com/watch?v=MNN80OHMQUQ "A 5 Minute Introduction")

## Allocation Optional for Tutorial Workflows

Typically, using ACCESS Pegasus to run workflows necessitates users to link their own allocations. However, the initial notebooks in this guide are pre-configured to operate on a modest resource bundled with ACCESS Pegasus. As you progress to more complex sample workflows, such as Variant Calling, you'll be required to utilize your own allocation.

If you prefer to run the workflow using your own allocation, you can provision as described in the documentation, and comment out the `+run_on_test_cluster` property below. Currently, the following resources are supported

* Purdue Anvil
* SDSC Expanse
* PSC Bridges2
* IU Jetstream2

![Pegasus ACCESS Overview](../images/pegasus-access-overview.png)


## Diamond Workflow

This notebook will generate the **diamond workflow** illustrated below, then plan and execute the workflow on the local condorpool. Rectangles represent input/output files, and ovals represent compute jobs. The arrows represent file dependencies between each compute job. 

![Diamond Workflow](../images/diamond.svg)

The abstract workflow description that you specify to Pegasus is portable, and usually does not contain any locations to physical input files, executables or cluster end points where jobs are executed. Pegasus uses three information catalogs during the planning process. A picture of this process is:

![Catalogs](../images/catalogs.png)

## 1. Import Python API

Pegasus 5.0 introduces a new Python API, which is fully documented in the [Pegasus reference guide](https://pegasus.isi.edu/documentation/reference-guide/api-reference.html). A high level overview of the components:
<br>
```
from Pegasus.api.mixins import EventType, Namespace
from Pegasus.api.properties import Properties
from Pegasus.api.replica_catalog import File, ReplicaCatalog
from Pegasus.api.site_catalog import (
    OS,
    Arch,
    Directory,
    FileServer,
    Grid,
    Operation,
    Scheduler,
    Site,
    SiteCatalog,
)
from Pegasus.api.transformation_catalog import (
    Container,
    Transformation,
    TransformationCatalog,
    TransformationSite,
)
from Pegasus.api.workflow import Job, SubWorkflow, Workflow
from Pegasus.client._client import PegasusClientError
```

While you can import just parts of the API, the most convenient way is to just import it all:

In [None]:
from Pegasus.api import *

## 2. Configure Logging

Configure logging. While this is **not required**, it is useful for seeing output from tools such as `pegasus-plan`, `pegasus-analyzer`, etc. when using these python wrappers. Here we also include a few other imports we might need further down.

In [None]:
from pathlib import Path

import logging

logging.basicConfig(level=logging.DEBUG)
BASE_DIR = Path(".").resolve()

## 3. Configure Pegasus Properties

The `pegasus.properties` file can now be generated using the `Properties()` object as shown below. To see a list of the most commonly used properties, you can use `Properties.ls(prefix)`. By default, `pegasus-plan` will look in `cwd` for a `pegasus.properties` file if one is given.

In [None]:
# --- Properties ---------------------------------------------------------------
props = Properties()
props["pegasus.monitord.encoding"] = "json"                                                                    
props["pegasus.catalog.workflow.amqp.url"] = "amqp://friend:donatedata@msgs.pegasus.isi.edu:5672/prod/workflows"
props["pegasus.mode"] = "tutorial" # speeds up tutorial workflows - remove for production ones

# Allow the jobs to run on the test cluster. You do not need to provision
# resources from your own allocations in this case, but the cluster is small
# and should not be used for production workloads.
props.add_site_profile("condorpool", "condor", "+run_on_test_cluster", "true")

props.write() # written to ./pegasus.properties 

In [None]:
Properties.ls("condor.request")

## 4. Create a Replica Catalog (Specify Initial Input Files)

Any initial input files given to the workflow should be specified in the `ReplicaCatalog`. This object tells Pegasus where each input file is physically located. First, we create a file that will be used as input to this workflow. 

In [None]:
with open("f.a", "w") as f:
    f.write("This is the contents of the input file for the diamond workflow!")

The `./f.a` will be used in this workflow, and so we create a corresponding `File` object. Metadata may also be added to the file as shown below.

Next, a `ReplicaCatalog` object is created so that the physical locations of each input file can be cataloged. This is done using the `ReplicaCatalog.add_replica(site, file, path)` function. As the file `f.a` resides here on the submit machine, we use the reserved keyword `local` for the site parameter. Second, the `File` object is passed in for the `file` parameter. Finally, the absolute path to the file is given. `pathlib.Path` may be used as long as an absolute path is given. 

By default, `pegasus-plan` will look in `cwd` for a `replicas.yml` file if one is given.

In [None]:
# --- Replicas -----------------------------------------------------------------
fa = File("f.a").add_metadata(creator="ryan")
rc = ReplicaCatalog()\
    .add_replica("local", fa, Path(".").resolve() / "f.a")\
    .write() # written to ./replicas.yml 

In [None]:
!cat replicas.yml

## 5. Create a Transformation Catalog (Specify Executables Used)

Any executable (referred to as ***transformations***) used by the workflow needs to be specified in the `TransformationCatalog`. This is done by creating `Transformation` objects, which represent executables. Once created, these must be added to the `TransformationCatalog` object. 

By default, `pegasus-plan` will look in `cwd` for a `transformations.yml` file.

For ACCESS, we recommend users specify containers in which their jobs run in. This allows you to
have similar environment in which jobs run irrespective of the ACCESS resource on which 
the job is launched.

In Pegasus, users have the option of either using a different container for each executable or same container for all executables. When using containers with Pegasus you have two options

1. The container has your executables pre installed. In that case in your transformation catalog, you specify the PFN as the path in the container where your executable is accessible

2. The other case, is you are using a generic baseline container and want to let Pegasus stage your executables in at runtime. To do that you can mark the executable as **stageable** (is_stageable as True) and Pegasus will stage the executable into the container, as part of executable staging.

In the example below, we are indicating that the preprocess, findrange and analyze executables need a container named *base_container* to run. However, we are going to let Pegasus stage them into container when your workflow runs from their location on site `condorpool` .

In [None]:
# --- Container ----------------------------------------------------------

base_container = Container(
                  "base-container",
                  Container.SINGULARITY,
                  image="docker://karanvahi/pegasus-tutorial-minimal"
    
                  # comment out the location below (and comment the above location) 
                  # if you run into docker rate pull limits. Do this if your  
                  # workflow fails on the first try with stage-in jobs fail 
                  # with error like ERROR: toomanyrequests: Too Many Requests. OR
                  # You have reached your pull rate limit. You may increase 
                  # the limit by authenticating and upgrading: 
                  # ttps://www.docker.com/increase-rate-limits. 
                  # You must authenticate your pull requests.
                  #
                  # This is why Pegasus supports tar files of containers, 
                  # and also ensures the pull from a docker hub happens only 
                  # once per workflow
    
                  #image="http://download.pegasus.isi.edu/pegasus/tutorial/pegasus-tutorial-minimal.tar.gz"
               )


# --- Transformations ----------------------------------------------------------
preprocess = Transformation(
                "preprocess",
                site="condorpool",
                pfn="/usr/bin/pegasus-keg",
                is_stageable=True,
                container=base_container,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            ).add_profiles(Namespace.CONDOR, request_disk="120MB")

findrange = Transformation(
                "findrange",
                site="condorpool",
                pfn="/usr/bin/pegasus-keg",
                is_stageable=True,
                container=base_container,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            ).add_profiles(Namespace.CONDOR, request_disk="120MB")

analyze = Transformation(
                "analyze",
                site="condorpool",
                pfn="/usr/bin/pegasus-keg",
                is_stageable=True,
                container=base_container,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            ).add_profiles(Namespace.CONDOR, request_disk="120MB")

tc = TransformationCatalog()\
    .add_containers(base_container)\
    .add_transformations(preprocess, findrange, analyze)\
    .write() # written to ./transformations.yml

In [None]:
!cat transformations.yml

As you can see above, the container is listed once, and multiple transformations can refer to the same container.

Some attributes to keep an eye out for
- *name*  the name assigned to the container that is used as a reference handle when describing executables in Transformation

- *type*  type of Container. Usually is Dokcer or Singularity

- *image* - URL to image in a docker|singularity hub or URL to an existing docker image exported as a tar file or singularity image.  

# 6. Create a Site Catalog

A Site Catalog allows you to describe to Pegasus what your sites look alike. By default, Pegasus always
creates two default sites

* local - it is used to indicate the workflow submit node from where you are issuing pegasus commands. **local** site is usually used to run only data management tasks that Pegasus adds to the workflow. The users compute jobs are not executed on this site.
* condorpool - it is used to indicate the default execution site that consists of condor workers. For this tutorial, the **condorpool** site will be composed of condor workers launched by pilot jobs on ACCESS sites in Section 10.

In this tutorial, we create a local site mainly to specify a job environment setup file, that gets sourced before a job runs on an ACCESS resources, and loads all the relevant modules for the job (namely singularity).

In [None]:
# --- Sites -----------------------------------------------------------------
# add a local site with an optional job env file to use for compute jobs
shared_scratch_dir = "{}/work".format(BASE_DIR)
local_storage_dir = "{}/storage".format(BASE_DIR)

local = Site("local") \
    .add_directories(
    Directory(Directory.SHARED_SCRATCH, shared_scratch_dir)
        .add_file_servers(FileServer("file://" + shared_scratch_dir, Operation.ALL)),
    Directory(Directory.LOCAL_STORAGE, local_storage_dir)
        .add_file_servers(FileServer("file://" + local_storage_dir, Operation.ALL)))

job_env_file = Path(str(BASE_DIR) + "/../tools/job-env-setup.sh").resolve()
local.add_pegasus_profile(pegasus_lite_env_source=job_env_file)

sc = SiteCatalog()\
   .add_sites(local)\
   .write() # written to ./sites.yml

In [None]:
!cat sites.yml

## 7. Create the Workflow

The `Workflow` object is used to store jobs and dependencies between each job. Typical job creation is as follows:

```
# Define job Input/Output files
input_file = File("input.txt")
output_file1 = File("output1.txt")
output_file2 = File("output2.txt")

# Define job, passing in the transformation (executable) it will use
j = Job(transformation_obj)

# Specify command line arguments (if any) which will be passed to the transformation when run
j.add_args("arg1", "arg2", input_file, "arg3", output_file)

# Specify input files (if any)
j.add_inputs(input_file)

# Specify output files (if any)
j.add_outputs(output_file1, output_file2)

# Add profiles to the job
j.add_env(FOO="bar")
j.add_profiles(Namespace.PEGASUS, key="checkpoint.time", value=1)

# Add the job to the workflow object
wf.add_jobs(j)
```

By default, depedencies between jobs are inferred based on input and output files. 

In [None]:
# --- Workflow -----------------------------------------------------------------
wf = Workflow("blackdiamond")

fb1 = File("f.b1")
fb2 = File("f.b2")
job_preprocess = Job(preprocess)\
                    .add_args("-a", "preprocess", "-T", "3", "-i", fa, "-o", fb1, fb2)\
                    .add_inputs(fa)\
                    .add_outputs(fb1, fb2)

fc1 = File("f.c1")
job_findrange_1 = Job(findrange)\
                    .add_args("-a", "findrange", "-T", "3", "-i", fb1, "-o", fc1)\
                    .add_inputs(fb1)\
                    .add_outputs(fc1)

fc2 = File("f.c2")
job_findrange_2 = Job(findrange)\
                    .add_args("-a", "findrange", "-T", "3", "-i", fb2, "-o", fc2)\
                    .add_inputs(fb2)\
                    .add_outputs(fc2)

fd = File("f.d")
job_analyze = Job(analyze)\
                .add_args("-a", "analyze", "-T", "3", "-i", fc1, fc2, "-o", fd)\
                .add_inputs(fc1, fc2)\
                .add_outputs(fd)

wf.add_jobs(job_preprocess, job_findrange_1, job_findrange_2, job_analyze)

## 8. Visualizing the Workflow

Once you have defined your abstract workflow, you can use `pegasus-graphviz` to visualize it. `Workflow.graph()` will invoke `pegasus-graphviz` internally and render your workflow using one of the available formats such as `png`. **Note that Workflow.write() must be invoked before calling Workflow.graph().**

In [None]:
try:
    wf.write()
    wf.graph(include_files=True, label="xform-id", output="graph.png")
except PegasusClientError as e:
    print(e)

In [None]:
# view rendered workflow
from IPython.display import Image
Image(filename='graph.png')

## 9. Run the Workflow

When working in Python, we can just use the reference do the `Workflow` object, you can plan, run, and monitor the workflow directly. These are wrappers around Pegasus CLI tools, and as such, the same arguments may be passed to them. 

**Note that the Pegasus binaries must be added to your PATH for this to work.**

Please wait for the progress bar to indicate that the workflow has finished.

In [None]:
try:
    wf.plan(submit=True)\
        .wait()
except PegasusClientError as e:
    print(e)


Note the line in the output that starts with pegasus-status, contains the command you can use to monitor the status of the workflow. We will cover this command line tool in the next couple of notbooks. The path it contains is the path to the submit directory where all of the files required to submit and monitor the workflow are stored. For now we will just continue to use the Python `Workflow` object.

## 10.  Optional: Launch Pilots Jobs on ACCESS resources

If you opted to use the included test cluster, the jobs should now start running. Please read the rest of this section to understand how provisioning works in the production case, but no other action is necessary. 

If you opted to run the example using your allocation, you should now have some idle jobs in the queue. They are idle because there are no resources yet to execute on. Resources can be brought in with the HTCondor Annex tool, by sending pilot jobs (also called glideins) to the ACCESS resource providers. These pilots have the following properties:

A pilot can run multiple user jobs - it stays active until no more user jobs are available or until end of life has been reached, whichever comes first.

A pilot is partitionable - job slots will dynamically be created based on the resource requirements in the user jobs. This means you can fit multiple user jobs on a compute node at the same time.

A pilot will only run jobs for the user who started it.

The process of starting pilots is described in the [ACCESS Pegasus Documentation](https://xsedetoaccess.ccs.uky.edu/confluence/redirect/ACCESS+Pegasus.html)


## 11. Statistics

Depending on if the workflow finished successfully or not, you have options on what to do next. If the workflow failed you can use `wf.analyze()` do get help finding out what went wrong. If the workflow finished successfully, we can pull out some statistcs from the provenance database:

In [None]:
try:
    wf.statistics()
except PegasusClientError as e:
    print(e)

## 12. Container Setup on a Worker Node

Now that we have been able to run the workflow succesfully, lets look beneath the covers to see how a job that has to run in a container gets setup on a worker node. The container setup for a job happens within PegasusLite, a light-weight Pegasus remote execution engine which wraps the user task on the remote worker node when a job is scheduled to the node. 

PegasusLite is responsible for figuring out the appropriate job directory in which the job executes, staging-in datasets that a job requires, launching the job, staging-out data, and cleaning up the job directory.

![Container Setup in PegasusLite](../images/container-host.png)

To see how Pegasus handled the container in this case, let’s look at some plumbing for one of the `analyze` job. The HTCondor submit file can be seen with:

```bash
$ cat `find scitech/pegasus/blackdiamond/run0001 -name analyze_ID0000004.sub`
```

Look  at the transfer_input_files attribute line, and specifically for the `base-container` file. It is transferred together with all the other inputs for the job.


transfer_input_files = analyze,f.c2,f.c1,**base-container**,..

Looking at the corresponding .sh file we can see how Pegasus executed the container by invoking `docker run` on a script written out at runtime.

## 13. What's Next?

To continue exploring Pegasus, and specifically learn how to debug failed workflows, please open the notebook in `02-Debugging/`