# Specifying Executables 

**Objective:** Learn about how you can specify containers in which your job runs and executables to invoke by specifying a Transformation Catalog.

In the previous notebook, we executed a simple **Hello World** workflow illustrated below, where the input data is retrieved from a remote location. 


![Hello World Workflow](../images/pipeline.svg)

The python script that we are executing as part of the worklfow is a simple script. However, in real world scientific codes have complex software dependencies. Increasingly **containers** are an attractive way to package the executables and their software dependencies.  

In this notebook, we will build on the previous notebook and use the Pegasus Workflow API to define a Transformation Catalog and specify the executable a job requires and the container in which it should execute. 

<div class="alert alert-block alert-info">
<b>Note:</b> Pegasus treats **containers** as a data dependency for the job. The container gets deployed to the node where the job runs automatically by Pegasus.
</div>


In this example, Pegasus will 

* pick up the inputs from the Pegasus website .
* the executable invoked in a job will be run inside an application container.  
* place the generated outputs in a directory named **output** .


## 1. Import Python API

Pegasus 5.0 introduces a new Python API, which is fully documented in the [Pegasus reference guide](https://pegasus.isi.edu/documentation/reference-guide/api-reference.html). 

We will mainly use the following main classes in this example

<br>

```
from Pegasus.api.replica_catalog import File
from Pegasus.api.workflow import Job, Workflow
from Pegasus.client._client import PegasusClientError
from Pegasus.api.replica_catalog import File, ReplicaCatalog
from Pegasus.api.transformation_catalog import (
    Container,
    Transformation,
    TransformationCatalog,
    TransformationSite,
)
```

The `TransformationCatalog` object is used to specify containers and executables.


In [1]:
from Pegasus.api import *
import sys
from pathlib import Path

import logging

logging.basicConfig(level=logging.DEBUG)

# we specify directories for inputs, executables and outputs
# - directory where the executables that the workflow uses are placed.
# - directory where the outputs should be placed.

BASE_DIR = Path(".").resolve()
EXECUTABLES_DIR = Path(BASE_DIR / ".." /  "executables").resolve()
OUTPUT_DIR = Path(BASE_DIR /  "output").resolve() 

# --- Replicas -----------------------------------------------------------------
fin = File("f.in").add_metadata(creator="vahi")
rc = ReplicaCatalog()\
    .add_replica("remote", fin, "http://download.pegasus.isi.edu/tutorial/inputs/f.in")\
    .write() # written to ./replicas.yml 

## 2. Create a Transformation Catalog (Specify Executables)

The Transformation Catalog serves as a repository of metadata that describes the transformations or executables used within a workflow. Each transformation represents a computational task, such as a container, script, or binary, that will be executed as part of the workflow. The catalog provides essential information about these transformations, including their unique names, versions, and the locations where they are installed or can be accessed. Additionally, the Transformation Catalog can include details about how transformations should be staged, for example, whether they should be transferred to the execution site or executed in place.

In this notebook, we have are specifying a base container containing the LLM model and code(used in later notebooks). The container is hosted on Open Storage Network (OSN), and is a good example how Pegasus can transfer data as part of the workflow. The container is pulled down with a data transfer job in the workflow.

The code we want to run is defined as a `Transformation()`, referencing the container. We also set profiles to specify our resource requirements (1 CPU core, 1 GB RAM, and 1 GB disk), as well what type GPU we need.

In [2]:
tc = TransformationCatalog()
        
wf_container = Container("wf_container",
    container_type = Container.SINGULARITY,
    image = "http://download.pegasus.isi.edu/containers/hello-world/hello-world.sif",
    image_site = "web"
)

# For each type of job in the workflow specify a transformation
# When you instantiate a Job() object, you specify a transformation name
# which is a logical identifier for the executable you want to run
# when the job is launched on a remote node. 
#
# In this workflow, we have two transformations "hello" and "world",
# with each mapping to the same executable that is installed in
# the container. is_stageable parameter is set to False to indicate
# the executable is installed in the container.
# Note: how cpu and other resources are requested
hello = Transformation("hello", 
                         site="web", 
                         pfn="/opt/pegasus-tutorial/pegasus-keg.py", 
                         is_stageable=False, 
                         container=wf_container)\
          .add_pegasus_profiles(cores=1, memory="1 GB", diskspace="1 GB")

world = Transformation("world", 
                         site="web", 
                         pfn="/opt/pegasus-tutorial/pegasus-keg.py", 
                         is_stageable=False, 
                         container=wf_container)\
          .add_pegasus_profiles(cores=1, memory="1 GB", diskspace="1 GB")

tc.add_containers(wf_container)
tc.add_transformations(hello, world)
tc.write()

In [3]:
!cat transformations.yml

x-pegasus:
  apiLang: python
  createdBy: vahi
  createdOn: 10-28-25T23:18:32Z
pegasus: 5.0.4
transformations:
- name: hello
  sites:
  - name: web
    pfn: /opt/pegasus-tutorial/pegasus-keg.py
    type: installed
    container: wf_container
  profiles:
    pegasus:
      cores: 1
      memory: 1024
      diskspace: 15360
- name: world
  sites:
  - name: web
    pfn: /opt/pegasus-tutorial/pegasus-keg.py
    type: installed
    container: wf_container
  profiles:
    pegasus:
      cores: 1
      memory: 1024
      diskspace: 15360
containers:
- name: wf_container
  type: singularity
  image: http://download.pegasus.isi.edu/containers/hello-world/hello-world.sif
  image.site: web


### 3. Define and Execute the Workflow

In [4]:
# the execution site where you job to run.
# local means the jobs run on ACCESS Pegasus itself.
# condorpool means jobs will run on a node provisioned from an ACCESS site such as jetstream
EXEC_SITE="condorpool"

# --- Workflow -----------------------------------------------------------------
wf = Workflow("hello-world")


finter = File("f.inter")
fout = File("f.out")

job_hello = Job("hello")\
                    .add_args("-T", "3", "-i", fin, "-o {}".format(finter))\
                    .add_inputs(fin)\
                    .add_outputs(finter)

job_world = Job("world")\
                    .add_args("-T", "3", "-i", finter, "-o {}".format(fout))\
                    .add_inputs(finter)\
                    .add_outputs(fout)

wf.add_jobs(job_hello, job_world)    

# --- Run the Workflow ---------------------------------------------------
# we have omitted the transformations_dir argument as we have specified a Transformation 
# Catalog to specify locations of executables .  
try:
    wf.write()
    wf.plan(sites=[EXEC_SITE], output_dir=OUTPUT_DIR, submit=True)\
      .wait()      
except PegasusClientError as e:
    print(e)

INFO:Pegasus.api.workflow:hello-world added Job(_id=ID0000001, transformation=hello)
INFO:Pegasus.api.workflow:hello-world added Job(_id=ID0000002, transformation=world)
INFO:Pegasus.api.workflow:inferring hello-world dependencies
INFO:Pegasus.api.workflow:workflow hello-world with 2 jobs generated and written to workflow.yml

################
# pegasus-plan #
################
2025.10.28 23:18:42.546 UTC:
2025.10.28 23:18:42.552 UTC:   -----------------------------------------------------------------------
2025.10.28 23:18:42.557 UTC:   File for submitting this DAG to HTCondor           : hello-world-0.dag.condor.sub
2025.10.28 23:18:42.562 UTC:   Log of DAGMan debugging messages                   : hello-world-0.dag.dagman.out
2025.10.28 23:18:42.567 UTC:   Log of HTCondor library output                     : hello-world-0.dag.lib.out
2025.10.28 23:18:42.573 UTC:   Log of HTCondor library error messages             : hello-world-0.dag.lib.err
2025.10.28 23:18:42.578 UTC:   Log of the 

[[1;32m#########################[0m] 100.0% ..Success ([1;34mUnready: 0[0m, [1;32mCompleted: 12[0m, [1;33mQueued: 0[0m, [1;36mRunning: 0[0m, [1;31mFailed: 0[0m)


## 3. Inspecting the generated output of the workflow

Now lets review the output of the workflow to check where the jobs ran. The displayed hostname will have a prefix of  testpool-cpu-* . The test pool is made of nodes provisioned from the Indiana University's **Jetstream2** which is a Cloud ACCESS resource.

In [None]:
! cat output/f.out