# Pegasus Tutorial

Welcome to the Pegasus tutorial notebook, which is intended for new users who want to get a quick overview of Pegasus concepts and usage. This tutorial covers:

 - Using the Pegasus API to generate an abstract workflow
 - Using the API to plan the abstract workflow into an executable workflow
 - Pegasus catalogs for sites, transformations, and data
 - Debug and recover from failures (02-Debugging notebook)
 - Command line tools (03-Command-Line-Tools notebook)
 
For a quick overview of Pegasus, please see this short YouTube video:

[![A 5 Minute Introduction](../images/youtube-pegasus-intro.png)](https://www.youtube.com/watch?v=MNN80OHMQUQ "A 5 Minute Introduction")


## Diamond Workflow

This notebook will generate the **diamond workflow** illustrated below, then plan and execute the workflow on the local condorpool. Rectangles represent input/output files, and ovals represent compute jobs. The arrows represent file dependencies between each compute job. 

![Diamond Workflow](../images/diamond.svg)

The abstract workflow description that you specify to Pegasus is portable, and usually does not contain any locations to physical input files, executables or cluster end points where jobs are executed. Pegasus uses three information catalogs during the planning process. A picture of this process is:

![Catalogs](../images/catalogs.png)

## 0. Set Jupyter Environment

We set some environment variables and set PYTHONPATH for Pegasus libraries to be imported successfully.
This is temporary until the Jupyter Notebook setup is fixed

In [None]:
import sys
sys.path.append("/usr/lib64/python3.6/site-packages")

In [None]:
%env LANG=en_US.utf-8

## 1. Import Python API

Pegasus 5.0 introduces a new Python API, which is fully documented in the [Pegasus reference guide](https://pegasus.isi.edu/documentation/reference-guide/api-reference.html). A high level overview of the components:
<br>
```
from Pegasus.api.mixins import EventType, Namespace
from Pegasus.api.properties import Properties
from Pegasus.api.replica_catalog import File, ReplicaCatalog
from Pegasus.api.site_catalog import (
    OS,
    Arch,
    Directory,
    FileServer,
    Grid,
    Operation,
    Scheduler,
    Site,
    SiteCatalog,
)
from Pegasus.api.transformation_catalog import (
    Container,
    Transformation,
    TransformationCatalog,
    TransformationSite,
)
from Pegasus.api.workflow import Job, SubWorkflow, Workflow
from Pegasus.client._client import PegasusClientError
```

While you can import just parts of the API, the most convenient way is to just import it all:

In [None]:
from Pegasus.api import *

## 2. Configure Logging

Configure logging. While this is **not required**, it is useful for seeing output from tools such as `pegasus-plan`, `pegasus-analyzer`, etc. when using these python wrappers. Here we also include a few other imports we might need further down.

In [None]:
from pathlib import Path

import logging

logging.basicConfig(level=logging.DEBUG)
BASE_DIR = Path(".").resolve()

## 3. Configure Pegasus Properties

The `pegasus.properties` file can now be generated using the `Properties()` object as shown below. To see a list of the most commonly used properties, you can use `Properties.ls(prefix)`. By default, `pegasus-plan` will look in `cwd` for a `pegasus.properties` file if one is given.

In [None]:
# --- Properties ---------------------------------------------------------------
props = Properties()
props["pegasus.monitord.encoding"] = "json"                                                                    
props["pegasus.catalog.workflow.amqp.url"] = "amqp://friend:donatedata@msgs.pegasus.isi.edu:5672/prod/workflows"
props["pegasus.mode"] = "tutorial" # speeds up tutorial workflows - remove for production ones
props.write() # written to ./pegasus.properties 

In [None]:
Properties.ls("condor.request")

## 4. Create a Replica Catalog (Specify Initial Input Files)

Any initial input files given to the workflow should be specified in the `ReplicaCatalog`. This object tells Pegasus where each input file is physically located. First, we create a file that will be used as input to this workflow. 

In [None]:
with open("f.a", "w") as f:
    f.write("This is the contents of the input file for the diamond workflow!")

The `./f.a` will be used in this workflow, and so we create a corresponding `File` object. Metadata may also be added to the file as shown below.

Next, a `ReplicaCatalog` object is created so that the physical locations of each input file can be cataloged. This is done using the `ReplicaCatalog.add_replica(site, file, path)` function. As the file `f.a` resides here on the submit machine, we use the reserved keyword `local` for the site parameter. Second, the `File` object is passed in for the `file` parameter. Finally, the absolute path to the file is given. `pathlib.Path` may be used as long as an absolute path is given. 

By default, `pegasus-plan` will look in `cwd` for a `replicas.yml` file if one is given.

In [None]:
# --- Replicas -----------------------------------------------------------------
fa = File("f.a").add_metadata(creator="ryan")
rc = ReplicaCatalog()\
    .add_replica("local", fa, Path(".").resolve() / "f.a")\
    .write() # written to ./replicas.yml 

In [None]:
!cat replicas.yml

## 5. Create a Transformation Catalog (Specify Executables Used)

Any executable (referred to as ***transformations***) used by the workflow needs to be specified in the `TransformationCatalog`. This is done by creating `Transformation` objects, which represent executables. Once created, these must be added to the `TransformationCatalog` object. 

By default, `pegasus-plan` will look in `cwd` for a `transformations.yml` file.

In [None]:
# --- Transformations ----------------------------------------------------------
preprocess = Transformation(
                "preprocess",
                site="local",
                pfn="/usr/bin/pegasus-keg",
                is_stageable=True,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            )

findrange = Transformation(
                "findrange",
                site="local",
                pfn="/usr/bin/pegasus-keg",
                is_stageable=True,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            )

analyze = Transformation(
                "analyze",
                site="local",
                pfn="/usr/bin/pegasus-keg",
                is_stageable=True,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            )

tc = TransformationCatalog()\
    .add_transformations(preprocess, findrange, analyze)\
    .write() # ./written to ./transformations.yml

In [None]:
!cat transformations.yml

## 6. Create a Site Catalog

A Site Catalog allows you to describe to Pegasus what your sites look alike. By default, Pegasus always creates two default sites

local - it is used to indicate the workflow submit node from where you are issuing pegasus commands. local site is usually used to run only data management tasks that Pegasus adds to the workflow. The users compute jobs are not executed on this site.
slurm - this site refers to the local SLURM cluster (in our case discovery) to which we will submit workflow to.

In [None]:
# --- Sites -----------------------------------------------------------------
# add a local site with an optional job env file to use for compute jobs
shared_scratch_dir = "{}/LOCAL/work".format(BASE_DIR)
local_storage_dir = "{}/LOCAL/storage".format(BASE_DIR)

# some variables for slurm cluster. you may wish to update
# them for your needs
slurm_partition="main"
slurm_account="ttrojan_123"

local = Site("local") \
    .add_directories(
    Directory(Directory.SHARED_SCRATCH, shared_scratch_dir)
        .add_file_servers(FileServer("file://" + shared_scratch_dir, Operation.ALL)),
    Directory(Directory.LOCAL_STORAGE, local_storage_dir)
        .add_file_servers(FileServer("file://" + local_storage_dir, Operation.ALL)))

slurm_scratch_dir = "{}/SLURM/work".format(BASE_DIR)
slurm_storage_dir = "{}/SLURM/storage".format(BASE_DIR)

slurm = Site("slurm")\
    .add_directories(
    Directory(Directory.SHARED_SCRATCH, slurm_scratch_dir)
        .add_file_servers(FileServer("file://" + slurm_scratch_dir, Operation.ALL)),
    Directory(Directory.LOCAL_STORAGE, slurm_storage_dir)
        .add_file_servers(FileServer("file://" + slurm_storage_dir, Operation.ALL)))

slurm.add_pegasus_profile(
                        style="glite",
                        queue=slurm_partition,
                        project=slurm_account,
                        data_configuration="nonsharedfs",
                        auxillary_local="true",
                        nodes=1,
                        ppn=1,
                        runtime=1800,
                        clusters_num=2
                    )
slurm.add_condor_profile(grid_resource="batch slurm")

sc = SiteCatalog()
sc.add_sites(local)
sc.add_sites(slurm)
   .write() # written to ./sites.yml

In [None]:
!cat sites.yml

## 7. Create the Workflow

The `Workflow` object is used to store jobs and dependencies between each job. Typical job creation is as follows:

```
# Define job Input/Output files
input_file = File("input.txt")
output_file1 = File("output1.txt")
output_file2 = File("output2.txt")

# Define job, passing in the transformation (executable) it will use
j = Job(transformation_obj)

# Specify command line arguments (if any) which will be passed to the transformation when run
j.add_args("arg1", "arg2", input_file, "arg3", output_file)

# Specify input files (if any)
j.add_inputs(input_file)

# Specify output files (if any)
j.add_outputs(output_file1, output_file2)

# Add profiles to the job
j.add_env(FOO="bar")
j.add_profiles(Namespace.PEGASUS, key="checkpoint.time", value=1)

# Add the job to the workflow object
wf.add_jobs(j)
```

By default, depedencies between jobs are inferred based on input and output files. 

In [None]:
# --- Workflow -----------------------------------------------------------------
wf = Workflow("blackdiamond")

fb1 = File("f.b1")
fb2 = File("f.b2")
job_preprocess = Job(preprocess)\
                    .add_args("-a", "preprocess", "-T", "3", "-i", fa, "-o", fb1, fb2)\
                    .add_inputs(fa)\
                    .add_outputs(fb1, fb2)

fc1 = File("f.c1")
job_findrange_1 = Job(findrange)\
                    .add_args("-a", "findrange", "-T", "3", "-i", fb1, "-o", fc1)\
                    .add_inputs(fb1)\
                    .add_outputs(fc1)

fc2 = File("f.c2")
job_findrange_2 = Job(findrange)\
                    .add_args("-a", "findrange", "-T", "3", "-i", fb2, "-o", fc2)\
                    .add_inputs(fb2)\
                    .add_outputs(fc2)

fd = File("f.d")
job_analyze = Job(analyze)\
                .add_args("-a", "analyze", "-T", "3", "-i", fc1, fc2, "-o", fd)\
                .add_inputs(fc1, fc2)\
                .add_outputs(fd)

wf.add_jobs(job_preprocess, job_findrange_1, job_findrange_2, job_analyze)

## 7. Visualizing the Workflow

Once you have defined your abstract workflow, you can use `pegasus-graphviz` to visualize it. `Workflow.graph()` will invoke `pegasus-graphviz` internally and render your workflow using one of the available formats such as `png`. **Note that Workflow.write() must be invoked before calling Workflow.graph().**

In [None]:
try:
    wf.write()
    wf.graph(include_files=True, label="xform-id", output="graph.png")
except PegasusClientError as e:
    print(e)

In [None]:
# view rendered workflow
from IPython.display import Image
Image(filename='graph.png')

## 8. Run the Workflow

When working in Python, we can just use the reference do the `Workflow` object, you can plan, run, and monitor the workflow directly. These are wrappers around Pegasus CLI tools, and as such, the same arguments may be passed to them. 

**Note that the Pegasus binaries must be added to your PATH for this to work.**

Please wait for the progress bar to indicate that the workflow has finished.

In [None]:
try:
    wf.plan(sites=["slurm"], verbose=1,submit=True)\
        .wait()
except PegasusClientError as e:
    print(e)


Note the line in the output that starts with pegasus-status, contains the command you can use to monitor the status of the workflow. We will cover this command line tool in the next couple of notbooks. The path it contains is the path to the submit directory where all of the files required to submit and monitor the workflow are stored. For now we will just continue to use the Python `Workflow` object.

## 9. Statistics

Depending on if the workflow finished successfully or not, you have options on what to do next. If the workflow failed you can use `wf.analyze()` do get help finding out what went wrong. If the workflow finished successfully, we can pull out some statistcs from the provenance database:

In [None]:
try:
    wf.statistics()
except PegasusClientError as e:
    print(e)

## 10. What's Next?

To continue exploring Pegasus, and specifically learn how to debug failed workflows, please open the notebook in `02-Debugging/`