# Workflow Debugging

When running complex computations (such as workflows) on complex computing infrastructure (for example HPC clusters), things will go wrong. It is therefore important to understand how to detect and debug issues as they appear. The good news is that Pegasus is doing a good job with the detection part, using for example exit codes, and provides tooling to help you debug. In this notebook, we will be using the same workflow as in the previous one, but introduce an error and see if we can detect it. 

First, let's clean up some files so that we can run this notebook multiple times:

In [None]:
!rm -f f.a

![Diamond Workflow](../images/diamond.svg)

In [None]:
import logging

from pathlib import Path

from Pegasus.api import *

logging.basicConfig(level=logging.DEBUG)
BASE_DIR = Path(".").resolve()

# --- Properties ---------------------------------------------------------------
props = Properties()
props["pegasus.monitord.encoding"] = "json"                                                                    
props["pegasus.catalog.workflow.amqp.url"] = "amqp://friend:donatedata@msgs.pegasus.isi.edu:5672/prod/workflows"
props["pegasus.mode"] = "tutorial" # speeds up tutorial workflows - remove for production ones

# If you do not want your workflow to run on the TestPool, uncomment this line
#props.add_site_profile("condorpool", "condor", "requirements", "TestPool =!= True")

props.write() # written to ./pegasus.properties 

# --- Replicas -----------------------------------------------------------------
with open("f-problem.a", "w") as f:
   f.write("This is sample input to KEG")

fa = File("f.a").add_metadata(creator="ryan")
rc = ReplicaCatalog().add_replica("local", fa, Path(".").resolve() / "f.a")

# --- Container ----------------------------------------------------------

base_container = Container(
                  "base-container",
                  Container.SINGULARITY,
                  image="docker://karanvahi/pegasus-tutorial-minimal"
    
                  # comment out the location below (and comment the above location) 
                  # if you run into docker rate pull limits. Do this if your  
                  # workflow fails on the first try with stage-in jobs fail 
                  # with error like ERROR: toomanyrequests: Too Many Requests. OR
                  # You have reached your pull rate limit. You may increase 
                  # the limit by authenticating and upgrading: 
                  # ttps://www.docker.com/increase-rate-limits. 
                  # You must authenticate your pull requests.
                  #
                  # This is why Pegasus supports tar files of containers, 
                  # and also ensures the pull from a docker hub happens only 
                  # once per workflow
    
                  #image="http://download.pegasus.isi.edu/pegasus/tutorial/pegasus-tutorial-minimal.tar.gz"
               )


# --- Transformations ----------------------------------------------------------
preprocess = Transformation(
                "preprocess",
                site="condorpool",
                pfn="/usr/bin/pegasus-keg",
                is_stageable=True,
                container=base_container,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            ).add_profiles(Namespace.CONDOR, request_disk="120MB")

findrange = Transformation(
                "findrange",
                site="condorpool",
                pfn="/usr/bin/pegasus-keg",
                is_stageable=True,
                container=base_container,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            ).add_profiles(Namespace.CONDOR, request_disk="120MB")

analyze = Transformation(
                "analyze",
                site="condorpool",
                pfn="/usr/bin/pegasus-keg",
                is_stageable=True,
                container=base_container,
                arch=Arch.X86_64,
                os_type=OS.LINUX
            ).add_profiles(Namespace.CONDOR, request_disk="120MB")

tc = TransformationCatalog()\
    .add_containers(base_container)\
    .add_transformations(preprocess, findrange, analyze)\
    .write() # written to ./transformations.yml

# --- Sites -----------------------------------------------------------------
# add a local site with an optional job env file to use for compute jobs
shared_scratch_dir = "{}/work".format(BASE_DIR)
local_storage_dir = "{}/storage".format(BASE_DIR)

local = Site("local") \
    .add_directories(
    Directory(Directory.SHARED_SCRATCH, shared_scratch_dir)
        .add_file_servers(FileServer("file://" + shared_scratch_dir, Operation.ALL)),
    Directory(Directory.LOCAL_STORAGE, local_storage_dir)
        .add_file_servers(FileServer("file://" + local_storage_dir, Operation.ALL)))

job_env_file = Path(str(BASE_DIR) + "/../tools/job-env-setup.sh").resolve()
local.add_pegasus_profile(pegasus_lite_env_source=job_env_file)

sc = SiteCatalog()\
   .add_sites(local)\
   .write() # written to ./sites.yml

# --- Workflow -----------------------------------------------------------------
'''
                     [f.b1] - (findrange) - [f.c1]
                     /                             \
[f.a] - (preprocess)                               (analyze) - [f.d]
                     \                             /
                     [f.b2] - (findrange) - [f.c2]

'''
wf = Workflow("blackdiamond")

fb1 = File("f.b1")
fb2 = File("f.b2")
job_preprocess = Job(preprocess)\
                     .add_args("-a", "preprocess", "-T", "3", "-i", fa, "-o", fb1, fb2)\
                     .add_inputs(fa)\
                     .add_outputs(fb1, fb2)

fc1 = File("f.c1")
job_findrange_1 = Job(findrange)\
                     .add_args("-a", "findrange", "-T", "3", "-i", fb1, "-o", fc1)\
                     .add_inputs(fb1)\
                     .add_outputs(fc1)

fc2 = File("f.c2")
job_findrange_2 = Job(findrange)\
                     .add_args("-a", "findrange", "-T", "3", "-i", fb2, "-o", fc2)\
                     .add_inputs(fb2)\
                     .add_outputs(fc2)

fd = File("f.d")
job_analyze = Job(analyze)\
               .add_args("-a", "analyze", "-T", "3", "-i", fc1, fc2, "-o", fd)\
               .add_inputs(fc1, fc2)\
               .add_outputs(fd)

wf.add_jobs(job_preprocess, job_findrange_1, job_findrange_2, job_analyze)
wf.add_replica_catalog(rc)


## 2. Run the Workflow



In [None]:
try:
    wf.plan(submit=True)\
        .wait()
except PegasusClientError as e:
    print(e)


Remember, the process of starting pilots is described in the [ACCESS Pegasus Documentation](https://xsedetoaccess.ccs.uky.edu/confluence/redirect/ACCESS+Pegasus.html)

## 3. Analyze

If the workflow failed you can use `wf.analyze()` do get help finding out what went wrong.

In [None]:
try:
    wf.analyze()
except PegasusClientError as e:
    print(e)

In the output we can see `Expected local file does not exist: /home/scitech/notebooks/02-Debugging/f.a` which tells us that an input did not exist. This is because we created it with the wrong name (`f-problem.a`) instead of the intended name (`f.a`).

## 3. Resolving the issue

Let's resolve the issue by renaming the wrongly named input file:

In [None]:
!mv f-problem.a f.a

## 3. Restart the workflow

We can now restart the workflow from where it stopped. Alternativly to the `run()`, you could `plan()` a new instance, but in that case the workflow would start all the way from the beginning again.

In [None]:
try:
    wf.run() \
      .wait()
except PegasusClientError as e:
    print(e)

## What's Next?

To continue exploring Pegasus, and specifically learn how to debug failed workflows, please open the notebook in `03-Command-Line-Tools/`