# Site Catalog

**Objective:** Learn about how Pegasus uses a Site Catalog to learn information about the layout of the resource you want to execute your workflow on.


The Site Catalog defines the computational environments where the workflow's tasks will execute. Each "site" in the catalog represents a distinct resource, such as a local machine, high-performance computing cluster, or cloud platform. The catalog provides detailed information about the resources at each site, including 

- paths for shared and local storage, ensuring efficient data management by defining where input, output, and intermediate data will be stored 
- file server endpoints defining how data will be staged in and out of the site.
- any profiles (such as environment variables, condor attributes) that need to be applied to each job that executes on that resource.
- any default job resource requirements such as how much memory a job running on a node requires.

By default, Pegasus assumes two sites (which is why you have not seen it yet in the notebooks)

- *local* : The _local_ site refers to the submit host which is literally the machine on which Pegasus is installed and you are running the notebook or any of the pegasus command line tools such as `pegasus-plan`, `pegasus-status` etc. 

- *condorpool* : The _condorpool_ site refers to the HTCondor pool we want to run our jobs in. There is no need to specify directories in this case, as the work directory is automatically assigned by HTCondor.

In the ACCESS Pegasus setup, the _condorpool_ site is made of a TestPool provisioned automatically from ACCESS sites such as *Jetstream2* .

However, you may find yourself defining a **site** when you need to

- override or associate certain environment variables that should apply to your jobs
- execute your worklfow on a remote HPC cluster such as SDSC expanse where you need to use the shared filesystem.

The code example below shows how you can define a Site Catalog and override the **condorpool** site.


In [None]:
    # --- Site Catalog -------------------------------------------------------------
    def create_sites_catalog(self):
        self.sc = SiteCatalog()
        condorpool = (Site("condorpool")
                        .add_condor_profile(universe="container")
                        .add_pegasus_profile(
                            style="condor"
                        )
                    )
        condorpool.add_profiles(Namespace.ENV, LANG='C')
        condorpool.add_profiles(Namespace.ENV, PYTHONUNBUFFERED='1')
        
        # exclude the ACCESS Pegasus TestPool 
        #condorpool.add_condor_profile(requirements="TestPool =!= True")

        # If you want to run on OSG, please specify your OSG ProjectName. For testing, feel
        # free to use the USC_Deelman project (the PI of the Pegasus project).For
        # production work, please use your own project.
        condorpool.add_profiles(Namespace.CONDOR, key="+ProjectName", value="\"USC_Deelman\"")
        
        self.sc.add_sites(condorpool)



Note how we also set profiles on the _condorpool_ site. There are both environment and HTCondor profiles specified, and they will be applied to all jobs on that site. See the [Configuration](https://pegasus.isi.edu/documentation/reference-guide/configuration.html) chapter in the Pegasus manual.

### Running on a shared filesystem on a ACCESS Resource

If you want to run your workflow on a particular ACCESS Resource leveraging the shared filesystem on that resource, then you need to tie-in your **allocation** for the resource, and provision nodes for your workflows using that allocation. These concepts are covered in subsequent notebooks on provisioning and running on a shared fileystem.

We describe below how you would describe such as site e.g. SDSC Expanse resource.

In [None]:
   # --- Site Catalog -------------------------------------------------------------
    def create_sites_catalog(self):
        sc = SiteCatalog()
        expanse = (Site("expanse")
        .add_pegasus_profile(style="condor")
        )
        
        # note the variables below should be updated to refer to your
        # username on expanse starting. There are of the form uxXXXXX
        cluster_shared_dir = "/expanse/lustre/scratch/EXPANSE_USERNAME/temp_project"
        cluster_home_dir = "/home/EXPANSE_USERNAME"
        
        exec_site_shared_scratch_dir = os.path.join(cluster_shared_dir, "pegasuswfs/scratch")
        exec_site_shared_storage_dir = os.path.join(cluster_home_dir, "pegasuswfs/outputs")
        
        expanse.add_directories(
            Directory(Directory.SHARED_SCRATCH, exec_site_shared_scratch_dir)
            .add_file_servers(FileServer("file://" + exec_site_shared_scratch_dir, Operation.ALL)),
            Directory(Directory.LOCAL_STORAGE, exec_site_shared_storage_dir)
            .add_file_servers(FileServer("file://" + exec_site_shared_storage_dir, Operation.ALL))
        )
        expanse.add_profiles(Namespace.ENV, LANG='C')
        expanse.add_profiles(Namespace.ENV, PYTHONUNBUFFERED='1')

        # exclude the ACCESS Pegasus TestPool
        # we want it to run on our provisioned resources.
        expanse.add_condor_profile(requirements="TestPool =!= True")
        
        sc.add_sites(expanse)
        return sc

### Running on a local HPC cluster

If you are doing this training, and want to setup Pegasus to run on your local campus cluster, please refer to the [documentation](https://pegasus.isi.edu/documentation/user-guide/deployment-scenarios.html#hpc-clusters-system-install) for full details.

In general, to describe a local SLURM cluster you will do the setup as follows.

In [None]:
# --- Site Catalog -------------------------------------------------------------
    def create_sites_catalog_for_local_slurm(self):
        sc = SiteCatalog()
        slurm_scratch_dir = "{}/SLURM/work".format(BASE_DIR)
        slurm_storage_dir = "{}/SLURM/storage".format(BASE_DIR)

        slurm = Site("slurm")\
            .add_directories(
            Directory(Directory.SHARED_SCRATCH, slurm_scratch_dir)
                .add_file_servers(FileServer("file://" + slurm_scratch_dir, Operation.ALL)),
            Directory(Directory.LOCAL_STORAGE, slurm_storage_dir)
                .add_file_servers(FileServer("file://" + slurm_storage_dir, Operation.ALL)))

        slurm.add_pegasus_profile(
                                style="glite",
                                queue=slurm_partition,
                                project=slurm_account,
                                data_configuration="nonsharedfs",
                                auxillary_local="true",
                                nodes=1,
                                ppn=1,
                                runtime=1800,
                                clusters_num=2
                            )
        slurm.add_condor_profile(grid_resource="batch slurm")
        
        sc.add_sites(slurm)
        return sc

## Conclusion

By going through these notebooks in order you have now learnt about how Pegasus uses catalogs to map an abstract workflow to an executable workflow.

The Abstract Workflow description that you specify to Pegasus is portable, and usually does not contain any locations to physical input files, executables or cluster end points where jobs are executed. Pegasus uses three information catalogs during the planning process.

<img src="images/catalogs.png"/>

The Pegasus documentation provides more details about catalogs [here](https://pegasus.isi.edu/documentation/user-guide/creating-workflows.html#catalogs)

## What's Next?


We will now progress to running a real LLM RAG workflow, that leverages all the concepts learnt so far and describes all the 3 catalogs.