# Discovering Data

**Objective:** Learn about how you can point remote datasets to Pegasus by specifying a Replica Catalog.

In the previous notebook, we executed a simple **Hello World** workflow illustrated below, where the input data for the workflow was present in a local directory. 


![Hello World Workflow](../images/pipeline.svg)

Often in a real world scenario, the raw input datasets that your workflow requires might be on remote file servers  accessible via multiple storage protocols, such as HTTP, S3, GridFTP, and SCP. You do not need to download these datasets yourself before running a Pegasus workflow. Instead you can specify a remote location in catalog named **Replica Catalog** and Pegasus will dowload the datasets as part of the workflow executed.  

The Replica Catalog acts as a mapping between **logical file names (LFNs)** and their corresponding **physical file locations (PFNs)** across various storage systems.  Logical file names are workflow-specific identifiers used to reference input, intermediate, or output files, while physical file names represent their actual paths or URLs on storage resources, such as local disks, shared filesystems, cloud storage, or remote servers. 

The Replica Catalog ensures that Pegasus can find and retrieve the required data efficiently, regardless of where it is stored. 

In this notebook, we will build on the previous notebook and use the Pegasus Workflow API to define a Replica Catalog and specify a remote location for the input file *f.in*.


In this example, Pegasus will 

* pick up the inputs from the Pegasus website .
* pick up the executables from a directory named **executables** .
* place the generated outputs in a directory named **output** .


## 1. Import Python API

Pegasus 5.0 introduces a new Python API, which is fully documented in the [Pegasus reference guide](https://pegasus.isi.edu/documentation/reference-guide/api-reference.html). 

We will mainly use 3 main classes in this simple quickstart example

<br>

```
from Pegasus.api.replica_catalog import File
from Pegasus.api.workflow import Job, Workflow
from Pegasus.client._client import PegasusClientError
from Pegasus.api.replica_catalog import File, ReplicaCatalog
```

The `ReplicaCatalog` object is used to store the file locations.

```
# Define the location of the inputs a Workflow requires.
input_file = File("input.txt") 

rc = ReplicaCatalog()\
    .add_replica("remote", input_file, "http://example.com/inputs/input.txt")
    .write()
```

In [None]:
from Pegasus.api import *
import sys
from pathlib import Path

import logging

logging.basicConfig(level=logging.DEBUG)

# we specify directories for inputs, executables and outputs
# - directory where the executables that the workflow uses are placed.
# - directory where the outputs should be placed.

BASE_DIR = Path(".").resolve()
EXECUTABLES_DIR = Path(BASE_DIR / ".." /  "executables").resolve()
OUTPUT_DIR = Path(BASE_DIR /  "output").resolve() 

## 2. Create a Replica Catalog (Specify Initial Input Files)

Any initial input files given to the workflow should be specified in the `ReplicaCatalog`. This object tells Pegasus where each input file is physically located. First, we create a file that will be used as input to this workflow. 

The `f.in` will be used in this workflow, and so we create a corresponding `File` object. Metadata may also be added to the file as shown below.

Next, a `ReplicaCatalog` object is created so that the physical locations of each input file can be cataloged. This is done using the `ReplicaCatalog.add_replica(site, file, path)` function. As the file `f.a` resides here on a remote http server (the Pegasus website), we use `remote` for the site parameter. Second, the `File` object is passed in for the `file` parameter. Finally, the remote location is passed.

By default, `pegasus-plan` will look in `cwd` for a `replicas.yml` file if one is given.

In [None]:
# --- Replicas -----------------------------------------------------------------
fin = File("f.in").add_metadata(creator="vahi")
rc = ReplicaCatalog()\
    .add_replica("remote", fin, "http://download.pegasus.isi.edu/tutorial/inputs/f.in")\
    .write() # written to ./replicas.yml 

In [None]:
!cat replicas.yml

### 3. Define and Execute the Workflow

In [None]:
# the execution site where you job to run.
# local means the jobs run on ACCESS Pegasus itself.
# condorpool means jobs will run on a node provisioned from an ACCESS site such as jetstream
EXEC_SITE="condorpool"

# --- Workflow -----------------------------------------------------------------
wf = Workflow("hello-world")


finter = File("f.inter")
fout = File("f.out")

job_hello = Job("hello")\
                    .add_args("-T", "3", "-i", fin, "-o {}".format(finter))\
                    .add_inputs(fin)\
                    .add_outputs(finter)

job_world = Job("world")\
                    .add_args("-T", "3", "-i", finter, "-o {}".format(fout))\
                    .add_inputs(finter)\
                    .add_outputs(fout)

wf.add_jobs(job_hello, job_world)    

# --- Run the Workflow ---------------------------------------------------
# we have omitted the input_dirs option as we have specified a Replica 
# Catalog to specify locations of the input files. You can opt to mix and 
# match. Have some inputs in a Replica Catalog, and some available locally.
try:
    wf.write()
    wf.plan(sites=[EXEC_SITE], transformations_dir=EXECUTABLES_DIR,\
            output_dir=OUTPUT_DIR, submit=True)\
      .wait()      
except PegasusClientError as e:
    print(e)

## 3. Inspecting the generated output of the workflow

Now lets review the output of the workflow to check where the jobs ran. The displayed hostname will have a prefix of  testpool-cpu-* . The test pool is made of nodes provisioned from the Indiana University's **Jetstream2** which is a Cloud ACCESS resource.

In [None]:
! cat output/f.out