# Catalogs

**Objective:** Learn about how Pegasus uses catalogs to map an abstract workflow to an executable workflow.

The Abstract Workflow description that you specify to Pegasus is portable, and usually does not contain any locations to physical input files, executables or cluster end points where jobs are executed. Pegasus uses three information catalogs during the planning process.

<img src="images/catalogs.png"/>

The Pegasus documentation provides more details about catalogs [here](https://pegasus.isi.edu/documentation/user-guide/creating-workflows.html#catalogs)


## Site Catalog

The Site Catalog defines the computational environments where the workflow's tasks will execute. Each "site" in the catalog represents a distinct resource, such as a local machine, high-performance computing cluster, or cloud platform. The catalog provides detailed information about the resources at each site, including paths for shared and local storage, ensuring efficient data management by defining where input, output, and intermediate data will be stored and how they will be staged in and out of the site.

The example from the previous workflow:

In [7]:
    # --- Site Catalog -------------------------------------------------------------
    def create_sites_catalog(self, exec_site_name="condorpool"):
        self.sc = SiteCatalog()

        local = (Site("local")
                    .add_directories(
                        Directory(Directory.SHARED_SCRATCH, self.shared_scratch_dir)
                            .add_file_servers(FileServer("file://" + self.shared_scratch_dir, Operation.ALL)),
                        Directory(Directory.LOCAL_STORAGE, self.local_storage_dir)
                            .add_file_servers(FileServer("file://" + self.local_storage_dir, Operation.ALL))
                    )
                )

        condorpool = (Site(exec_site_name)
                        .add_condor_profile(universe="container")
                        .add_pegasus_profile(
                            style="condor"
                        )
                    )
        condorpool.add_profiles(Namespace.ENV, LANG='C')
        condorpool.add_profiles(Namespace.ENV, PYTHONUNBUFFERED='1')
        
        # exclude the ACCESS Pegasus TestPool 
        #condorpool.add_condor_profile(requirements="TestPool =!= True")

        # If you want to run on OSG, please specify your OSG ProjectName. For testing, feel
        # free to use the USC_Deelman project (the PI of the Pegasus project).For
        # production work, please use your own project.
        condorpool.add_profiles(Namespace.CONDOR, key="+ProjectName", value="\"USC_Deelman\"")
        
        self.sc.add_sites(local, condorpool)

The _local_ site refers to the submit host, and is always required. In this case we specify two directories, one scratch directory for the workflow to use during the execution, and one long term storage to be use for outputs.

The _condorpool_ site refers to the HTCondor pool we want to run our jobs in. There is no need to specify directories in this case, as the work directory is automatically assigned by HTCondor.

Note how we also set profiles on the _condorpool_ site. There are both environment and HTCondor profiles specified, and they will be applied to all jobs on that site. See the [Configuration](https://pegasus.isi.edu/documentation/reference-guide/configuration.html) chapter in the Pegasus manual.

## Transformation Catalog

The Transformation Catalog serves as a repository of metadata that describes the transformations or executables used within a workflow. Each transformation represents a computational task, such as a container, script, or binary, that will be executed as part of the workflow. The catalog provides essential information about these transformations, including their unique names, versions, and the locations where they are installed or can be accessed. Additionally, the Transformation Catalog can include details about how transformations should be staged, for example, whether they should be transferred to the execution site or executed in place.

In our example workflow, we have a container containing the LLM model and code. The container is hosted on Open Storage Network (OSN), and is a good example how Pegasus can transfer data as part of the workflow. The container is pulled down with a data transfer job in the workflow.

The code we want to run is defined as a `Transformation()`, referencing the container. We also set profiles to specify our resource requirements (1 CPU core, 1 GPU, 10 GB RAM, and 15 GB disk), as well what type GPU we need.

In [8]:
    # --- Transformation Catalog (Executables and Containers) ----------------------
    def create_transformation_catalog(self, exec_site_name="condorpool"):
        self.tc = TransformationCatalog()
        
        llm_rag_container = Container("llm_rag_container",
            container_type = Container.SINGULARITY,
            image = "https://usgs2.osn.mghpcc.org/pegasus-tutorials/containers/llm-rag-v2.sif",
            image_site = "web"
        )
        
        # main job wrapper
        # note how gpus and other resources are requested
        wrapper = Transformation("wrapper", 
                                 site="local", 
                                 pfn=self.wf_dir+"/bin/wrapper.sh", 
                                 is_stageable=True, 
                                 container=llm_rag_container)\
                  .add_pegasus_profiles(cores=1, gpus=1, memory="10 GB", diskspace="15 GB")\
                  .add_profiles(Namespace.CONDOR, key="require_gpus", value="Capability >= 8.0")

        
        self.tc.add_containers(llm_rag_container)
        self.tc.add_transformations(wrapper)

## Replica Catalog

The Replica Catalog acts as a mapping between **logical file names (LFNs)** and their corresponding **physical file locations (PFNs)** across various storage systems. It is a critical component that helps Pegasus locate the data needed for a workflow's execution. Logical file names are workflow-specific identifiers used to reference input, intermediate, or output files, while physical file names represent their actual paths or URLs on storage resources, such as local disks, shared filesystems, cloud storage, or remote servers. The Replica Catalog ensures that Pegasus can find and retrieve the required data efficiently, regardless of where it is stored. It supports multiple storage protocols, such as HTTP, S3, GridFTP, and SCP, making it highly flexible and adaptable to diverse infrastructures.


In [9]:
    # --- Replica Catalog ----------------------------------------------------------
    def create_replica_catalog(self):
        self.rc = ReplicaCatalog()

        # Add inference dependencies
        self.rc.add_replica("local", "llm-rag.py", \
                                     os.path.join(self.wf_dir, "bin/llm-rag.py"))
        self.rc.add_replica("local", "Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt", \
                                     os.path.join(self.wf_dir, "inputs/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.txt"))