# Running a Complete Workflow

**Objective:** Familiarize users with the Pegasus workflow structure using an end-to-end LLM-RAG book summarization example.

Welcome to the first Pegasus tutorial notebook, which is intended for new users who want to get a quick overview of running a Pegasus workflow. 

In this tutorial, a full workflow is provided. In later tutorials, we will learn how to use the API, the provided debugging/statistics tools, and how to provision resources for the workflow to execute on. The outline of those tutorials is:

 - 01 - Running a Complete Workflow (this one)
 - 02 - API
 - 03 - Catalogs
 - 04 - Debugging / Statistics
 - 05 - Provisioning

To get started, just step through the following steps.

## Defining the Workflow

Pegasus workflows are created using an API, making it easy to build, manage, and run workflows in a flexible and scalable way. While it might feel unnecessary for small workflows, this approach works well for creating workflows dynamically based on data, parameters, or triggers, which is essential for automating tasks and handling large projects.

The following example is organized as a Python class. While this isn’t strictly necessary, it helps keep the different parts of the workflow well-structured.

Pegasus workflows are portable, meaning you can execute the same workflow on different infrastructures at different times. To enable this portability, Pegasus uses an abstract workflow model and relies on "catalogs" to describe the execution environment, software, and input data.  The Abstract Workflow description that you specify to Pegasus is portable, and usually does not contain

* any locations to physical input files, 
* locations to executables referred to by the job
* cluster end points where jobs are executed.

Pegasus uses three information catalogs during the planning process.

<img src="../03-Tutorial-Catalogs/images/catalogs.png"/>

These catalogs will be explained in detail in a later chapter.Also, the Pegasus documentation provides more details about catalogs [here](https://pegasus.isi.edu/documentation/user-guide/creating-workflows.html#catalogs)


For now, focus on the workflow definition within the `create_workflow(self)` method.

This workflow is simple: it includes just one job. The job takes a Gutenberg book in plain text format, processes it using an LLM (Large Language Model) with Retrieval-Augmented Generation (RAG), and produces two output files: an answer file and a log file.

The script, `llm-rag.py`, contains the code for this job. Inside the script, you’ll find prompts for the LLM, such as:

 - *Please tell me what kind of LLM you are, and describe what data you were trained on.*
 - *Please provide a one paragraph summary of the book.*
 - *Who is the protagonist in the book?*
 - *Who is the antagonist in the book?*
 - *What time period is the book set in?*
 
The goal is to create a workflow that runs this single job in a container with the LLM model, uses a GPU for processing, and returns the answers.

<div class="alert alert-block alert-info">
<b>Note:</b> There is a separation between the environment where this notebook runs and where the compute job executes. This notebook runs on pegasus.access-ci.org, while the job is executed on any available HTCondor execution points. For this tutorial, a small number of execution points will be provided automatically. For larger workflows, you will learn in the `Provisioning` tutorial how to allocate additional resources using your allocations.
The following figure shows how workflows are defined using the Pegasus API in Jupyter, planned to an executable
HTCondor DAGMan, and jobs flow to the remote execution sites (TestPool in this case).
</div>

<img src="../images/access-pegasus-jobflow.png"/>

The `Workflow` object is used to store jobs and dependencies between each job. Typical job creation is as follows:

```
        # input files
        llm_rag_py = File("llm-rag.py")
        book = File("Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt")
        
        # output files - these will be generated by the job
        answers_txt = File(f"{book}-answers.txt")
        ollama_log = File(f"{book}-ollama.log")
        
        # define the job
        job = Job("llm-wrapper")
        
        # specify command line arguments (if any)
        job.add_args(book)
        
        # associate the input files
        job.add_inputs(llm_rag_py, book)
        
        # associate the output files
        job.add_outputs(answers_txt, ollama_log)

        # add the job to the workflow
        self.wf.add_jobs(job)
```

By default, dependencies between jobs are inferred based on input and output files. 

Let's run a full workflow. This example has a bunch of helper methods to aid with logging and error handling.

In [None]:
import logging
import os

from pathlib import Path

from Pegasus.api import *

logging.basicConfig(level=logging.INFO)


class LLMRAGBooks:
    
    BASE_DIR = Path(".").resolve()
    
    
    def __init__(self):
        
        self.props = Properties()

        self.wf = Workflow("llm-rag-books")
        self.tc = TransformationCatalog()
        self.sc = SiteCatalog()
        self.rc = ReplicaCatalog()

        self.wf.add_transformation_catalog(self.tc)
        self.wf.add_site_catalog(self.sc)
        self.wf.add_replica_catalog(self.rc)
        
        self.wf_dir = str(Path(".").resolve())
        self.shared_scratch_dir = os.path.join(self.wf_dir, "scratch")
        self.local_storage_dir = os.path.join(self.wf_dir, "output")
    
    
    # --- Write files in directory -------------------------------------------------
    def write(self):
        self.props.write()
        self.sc.write()
        self.rc.write()
        self.tc.write()
        
        try:
            self.wf.write()
            # also graph the workflow
            self.wf.graph(include_files=True,  label="xform", output="graph.png")
        except PegasusClientError as e:
            print(e)


    # --- Plan and Submit the workflow ----------------------------------------------
    def plan_submit(self):
        try:
            self.wf.plan(submit=True)
        except PegasusClientError as e:
            print(e)
            
            
    # --- Get status of the workflow -----------------------------------------------
    def status(self):
        try:
            self.wf.status(long=True)
        except PegasusClientError as e:
            print(e)

            
    # --- Wait for the workflow to finish -----------------------------------------------
    def wait(self):
        try:
            self.wf.wait()
        except PegasusClientError as e:
            print(e)
            
            
    # --- Get statistics of the workflow -----------------------------------------------
    def statistics(self):
        try:
            self.wf.statistics()
        except PegasusClientError as e:
            print(e)
            
            
    # --- Configuration (Pegasus Properties) ---------------------------------------
    def create_pegasus_properties(self):
        
        # Help Pegasus developers by sharing performance data (optional)
        self.props["pegasus.monitord.encoding"] = "json"
        self.props["pegasus.catalog.workflow.amqp.url"] = "amqp://friend:donatedata@msgs.pegasus.isi.edu:5672/prod/workflows"

        # nicer looking submit dirs
        self.props["pegasus.dir.useTimestamp"] = "true"

        
    # --- Site Catalog -------------------------------------------------------------
    def create_sites_catalog(self, exec_site_name="condorpool"):
        self.sc = SiteCatalog()

        local = (Site("local")
                    .add_directories(
                        Directory(Directory.SHARED_SCRATCH, self.shared_scratch_dir)
                            .add_file_servers(FileServer("file://" + self.shared_scratch_dir, Operation.ALL)),
                        Directory(Directory.LOCAL_STORAGE, self.local_storage_dir)
                            .add_file_servers(FileServer("file://" + self.local_storage_dir, Operation.ALL))
                    )
                )

        condorpool = (Site(exec_site_name)
                        .add_condor_profile(universe="container")
                        .add_pegasus_profile(
                            style="condor"
                        )
                    )
        condorpool.add_profiles(Namespace.ENV, LANG='C')
        condorpool.add_profiles(Namespace.ENV, PYTHONUNBUFFERED='1')
        
        # exclude the ACCESS Pegasus TestPool 
        #condorpool.add_condor_profile(requirements="TestPool =!= True")

        # If you want to run on OSG, please specify your OSG ProjectName. For testing, feel
        # free to use the USC_Deelman project (the PI of the Pegasus project).For
        # production work, please use your own project.
        #condorpool.add_profiles(Namespace.CONDOR, key="+ProjectName", value="\"USC_Deelman\"")
        
        self.sc.add_sites(local, condorpool)
        

    # --- Transformation Catalog (Executables and Containers) ----------------------
    def create_transformation_catalog(self, exec_site_name="condorpool"):
        self.tc = TransformationCatalog()
        
        llm_rag_container = Container("llm_rag_container",
            container_type = Container.SINGULARITY,
            image = "https://usgs2.osn.mghpcc.org/pegasus-tutorials/containers/llm-rag-v2.sif",
            image_site = "web"
        )
        
        # main job wrapper
        # note how gpus and other resources are requested
        wrapper = Transformation("llm-wrapper", 
                                 site="local", 
                                 pfn=self.wf_dir+"/bin/llm-wrapper.sh", 
                                 is_stageable=True, 
                                 container=llm_rag_container)\
                  .add_pegasus_profiles(cores=1, gpus=1, memory="10 GB", diskspace="15 GB")\
                  .add_profiles(Namespace.CONDOR, key="require_gpus", value="Capability >= 8.0")

        
        self.tc.add_containers(llm_rag_container)
        self.tc.add_transformations(wrapper)

    
    # --- Replica Catalog ----------------------------------------------------------
    def create_replica_catalog(self):
        self.rc = ReplicaCatalog()

        # Add inference dependencies
        self.rc.add_replica("local", "llm-rag.py", \
                                     os.path.join(self.wf_dir, "bin/llm-rag.py"))
        self.rc.add_replica("local", "Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt", \
                                     os.path.join(self.wf_dir, "inputs/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt"))
     

    # --- Create Workflow ----------------------------------------------------------
    def create_workflow(self):
        self.wf = Workflow(name="llm-rag-books", infer_dependencies=True)
        
        # existing files - already listed in the replica catalog
        llm_rag_py = File("llm-rag.py")
        book = File("Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt")
        
        # these will be generated by the workflow
        answers_txt = File(f"{book}-answers.txt")
        ollama_log = File(f"{book}-ollama.log")
        
        job = (Job("llm-wrapper")
                  .add_args(book)
                  .add_inputs(llm_rag_py, book)
                  .add_outputs(answers_txt, stage_out=True)
                  .add_outputs(ollama_log, stage_out=True)
              )
        
        self.wf.add_jobs(job)

            
workflow = LLMRAGBooks()

print("Creating execution sites...")
workflow.create_sites_catalog("condorpool")

print("Creating workflow properties...")
workflow.create_pegasus_properties()

print("Creating transformation catalog...")
workflow.create_transformation_catalog("condorpool")

print("Creating replica catalog...")
workflow.create_replica_catalog()

print("Creating workflow dag...")
workflow.create_workflow()

workflow.write()
print("Workflow has been generated!")
        
        


## Planning the Workflow

The next step is to plan the workflow, which means Pegasus takes an abstract workflow - a high-level representation of tasks, dependencies, and required resources - and transforms it into an executable workflow tailored for the target execution environment.

We will also save a step here, and submit the planned workflow in one go.

In [None]:
workflow.plan_submit()

The abstract workflow is written out to an image for easy inspection. In this case, the workflow is really simple (only one job), but we can see that the expected input and outputs look correct.

In [None]:
from IPython.display import Image
Image(filename='graph.png')

The output of the planning phase includes example command-line commands for monitoring and interacting with the workflow. While these commands are always available, you can also use the Python `workflow` object directly within the notebook. This object provides detailed insights into the workflow's status, including the number of jobs in each state (e.g., idle, running, or completed).

In [None]:
workflow.status()

## Wait for the workflow to finish, and then display the results

We can also just block on the workflow finishing:

In [None]:
workflow.wait()

## Examining the Results

Once the workflow has finished, we can look at the answers file for our results:

In [None]:
!cat output/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt-answers.txt

## What's Next?

To continue exploring Pegasus, the next tutorial notebook will provide an overview how to interact with the Pegasus API.