# Workflow Debugging

When running complex computations (such as workflows) on complex computing infrastructure (for example HPC clusters), things will go wrong. It is therefore important to understand how to detect and debug issues as they appear. The good news is that Pegasus is doing a good job with the detection part, using for example exit codes, and provides tooling to help you debug. In this notebook, we will be using the same workflow as in the previous one, but introduce an error and see if we can detect it. 

To introduce the error, let's rename the input to something which will mismatch the workflow description:

In [None]:
!mv inputs/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt inputs/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt.BADNAME

Now plan and run the workflow:

In [None]:
import logging
import os
import time

from pathlib import Path

from Pegasus.api import *

logging.basicConfig(level=logging.INFO)


class LLMRAGBooks:
    
    BASE_DIR = Path(".").resolve()
    
    
    def __init__(self):
        
        self.props = Properties()

        self.wf = Workflow("llm-rag-books")
        self.tc = TransformationCatalog()
        self.sc = SiteCatalog()
        self.rc = ReplicaCatalog()

        self.wf.add_transformation_catalog(self.tc)
        self.wf.add_site_catalog(self.sc)
        self.wf.add_replica_catalog(self.rc)
        
        self.wf_dir = str(Path(".").resolve())
        self.shared_scratch_dir = os.path.join(self.wf_dir, "scratch")
        self.local_storage_dir = os.path.join(self.wf_dir, "output")
    
    
    # --- Write files in directory -------------------------------------------------
    def write(self):
        self.props.write()
        self.sc.write()
        self.rc.write()
        self.tc.write()
        
        try:
            self.wf.write()
        except PegasusClientError as e:
            print(e)


    # --- Plan and Submit the workflow ----------------------------------------------
    def plan_submit(self):
        try:
            self.wf.plan(submit=True)
        except PegasusClientError as e:
            print(e)
            
            
    # --- Get status of the workflow -----------------------------------------------
    def status(self):
        try:
            self.wf.status(long=True)
        except PegasusClientError as e:
            print(e)

    # --- Start the workflow  -----------------------------------------------
    def run(self):
        try:
            self.wf.run()
        except PegasusClientError as e:
            print(e)
            
    # --- Wait for the workflow to finish -----------------------------------------------
    def wait(self):
        try:
            self.wf.wait()
        except PegasusClientError as e:
            print(e)
     
    
    # --- Analyze of a failed workflow -----------------------------------------------
    def analyze(self):
        try:
            self.wf.analyze()
        except PegasusClientError as e:
            print(e)
    
    
    # --- Get statistics of the workflow -----------------------------------------------
    def statistics(self):
        try:
            self.wf.statistics()
        except PegasusClientError as e:
            print(e)
            
            
    # --- Configuration (Pegasus Properties) ---------------------------------------
    def create_pegasus_properties(self):
        
        # Help Pegasus developers by sharing performance data (optional)
        self.props["pegasus.monitord.encoding"] = "json"
        self.props["pegasus.catalog.workflow.amqp.url"] = "amqp://friend:donatedata@msgs.pegasus.isi.edu:5672/prod/workflows"

        # nicer looking submit dirs
        self.props["pegasus.dir.useTimestamp"] = "true"
        
        # fail fast - save time when doing the tutorial
        self.props["pegasus.mode"] = "tutorial"

        
    # --- Site Catalog -------------------------------------------------------------
    def create_sites_catalog(self, exec_site_name="condorpool"):
        self.sc = SiteCatalog()

        local = (Site("local")
                    .add_directories(
                        Directory(Directory.SHARED_SCRATCH, self.shared_scratch_dir)
                            .add_file_servers(FileServer("file://" + self.shared_scratch_dir, Operation.ALL)),
                        Directory(Directory.LOCAL_STORAGE, self.local_storage_dir)
                            .add_file_servers(FileServer("file://" + self.local_storage_dir, Operation.ALL))
                    )
                )

        condorpool = (Site(exec_site_name)
                        .add_condor_profile(universe="container")
                        .add_pegasus_profile(
                            style="condor"
                        )
                    )
        condorpool.add_profiles(Namespace.ENV, LANG='C')
        condorpool.add_profiles(Namespace.ENV, PYTHONUNBUFFERED='1')
        
        # exclude the ACCESS Pegasus TestPool 
        #condorpool.add_condor_profile(requirements="TestPool =!= True")

        # If you want to run on OSG, please specify your OSG ProjectName. For testing, feel
        # free to use the USC_Deelman project (the PI of the Pegasus project).For
        # production work, please use your own project.
        #condorpool.add_profiles(Namespace.CONDOR, key="+ProjectName", value="\"USC_Deelman\"")
        
        self.sc.add_sites(local, condorpool)
        

    # --- Transformation Catalog (Executables and Containers) ----------------------
    def create_transformation_catalog(self, exec_site_name="condorpool"):
        self.tc = TransformationCatalog()
        
        llm_rag_container = Container("llm_rag_container",
            container_type = Container.SINGULARITY,
            image = "http://download.pegasus.isi.edu/containers/llm-rag/llm-rag-v2.sif",
            image_site = "web"
        )
        
        # main job wrapper
        # note how gpus and other resources are requested
        wrapper = Transformation("llm-wrapper", 
                                 site="local", 
                                 pfn=self.wf_dir+"/bin/llm-wrapper.sh", 
                                 is_stageable=True, 
                                 container=llm_rag_container)\
                  .add_pegasus_profiles(cores=1, gpus=1, memory="10 GB", diskspace="15 GB")\
                  .add_profiles(Namespace.CONDOR, key="require_gpus", value="Capability >= 8.0")

        
        self.tc.add_containers(llm_rag_container)
        self.tc.add_transformations(wrapper)

    
    # --- Replica Catalog ----------------------------------------------------------
    def create_replica_catalog(self):
        self.rc = ReplicaCatalog()

        # Add inference dependencies
        self.rc.add_replica("local", "llm-rag.py", \
                                     os.path.join(self.wf_dir, "bin/llm-rag.py"))
        self.rc.add_replica("local", "Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt", \
                                     os.path.join(self.wf_dir, "inputs/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt"))
     

    # --- Create Workflow ----------------------------------------------------------
    def create_workflow(self):
        self.wf = Workflow(name="llm-rag-books", infer_dependencies=True)
        
        # existing files - already listed in the replica catalog
        llm_rag_py = File("llm-rag.py")
        book = File("Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt")
        
        # these will be generated by the workflow
        answers_txt = File(f"{book}-answers.txt")
        ollama_log = File(f"{book}-ollama.log")
        
        job = (Job("llm-wrapper")
                  .add_args(book)
                  .add_inputs(llm_rag_py)
                  .add_inputs(book)
                  .add_outputs(answers_txt, stage_out=True)
                  .add_outputs(ollama_log, stage_out=True)
              )
        
        self.wf.add_jobs(job)

            
workflow = LLMRAGBooks()

print("Creating execution sites...")
workflow.create_sites_catalog("condorpool")

print("Creating workflow properties...")
workflow.create_pegasus_properties()

print("Creating transformation catalog...")
workflow.create_transformation_catalog("condorpool")

print("Creating replica catalog...")
workflow.create_replica_catalog()

print("Creating workflow dag...")
workflow.create_workflow()

workflow.write()
print("Workflow has been generated!")

## Run the Workflow

In [None]:
workflow.plan_submit()
workflow.wait()

Note the status bar and the state of the different jobs.

## 3. Analyze

When the workflow fails, we can use the Pegasus analyze tool to pinpoint the failure.

In [None]:
workflow.analyze()

In the output we can see `ERROR:  Expected local file does not exist: /home/rynge/git/ACCESS-Pegasus-Examples/04-Tutorial-Debugging-Statistics/inputs/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt'`, which means that the local file might not exist.

## Resolving the issue

The cause of the problem is a mismatch between the input file (`inputs/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt.BADNAME`) and what we have specified in the workflow (`inputs/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt`). The file in the input directory was misnamed to cause this issue for demonstration purposes.

Let's resolve the issue by renaming the wrongly named input file:

In [None]:
mv inputs/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt.BADNAME inputs/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt

## Restart the workflow

We can now restart the workflow from where it stopped. Alternatively to the `run()`, you could `plan_submit()` a new instance, but in that case the workflow would start all the way from the beginning again.

In [None]:
workflow.run()
time.sleep(30)  # give the workflow some time to get started again
workflow.wait()

## Statistics

Pegasus collects provenance information during the workflow execution. By default, Pegasus launches all jobs through a process called `kickstart`, which captures runtime provenance data for each job. This data includes details about the execution environment, input and output files, execution parameters, and performance metrics. 

The collected provenance information is stored in a relational database, allowing users to analyze and summarize workflow executions. Pegasus provides tools such as `pegasus-statistics` to facilitate this analysis. To get a high level summary of the data, run `workflow.statistics()`

In [None]:
workflow.statistics()