# Gladier Flows Tutorial
### Gladier: The Globus Architecture for Data-Intensive Experimental Research.

Gladier is a programmable data capture, storage, and analysis architecture for experimental facilities. The architecture leverages a data and computing substrate based on agents deployed across computer and storage systems at APS, ALCF, and elsewhere, all managed by cloud-hosted Globus services. In particular, we leverage [Globus Connect](https://www.globus.org/globus-connect)
and [funcX](https://funcx.org) agents to facilitate secure, reliable remote data and computation and employ the [Globus Flows](https://www.globus.org/platform/services/flows) platform to orchestrate distributed data management tasks into reliable pipelines.

## Gladier Toolkit
The Gladier toolkit provides tools and capabilities to simplify and accelerate the development of these automations. The toolkit manages the dynamic creation of flows, automatically registers funcX functions, and assists in validating inputs. 

Here we demonstrate how the Gladier toolkit can be used to let anyone create a simple, yet powerful client to automate data management tasks.

While not necessary to use this notebook, the Gladier toolkit is available on pypi and can be installed with:

    $ pip install gladier

Documentation is available [here.](https://gladier.readthedocs.io/en/latest/index.html)


In [None]:
# General Imports
import pprint
import json
import os
import pathlib
import random

# Gladier Imports
from gladier import GladierBaseClient, GladierBaseTool, generate_flow_definition

# Set this so Gladier knows it should login on a remote system
os.environ['SSH_TTY'] = 'JUPYTERHUB_REMOTE'

### Globus Search

For our little experiment here, we will be publishing data to Globus Search so we can later display it in a portal. We need to setup a Globus Search index in order to do the publishing step below.

See the "Metadata Search and Discovery" for more information on Globus Search

In [None]:
import pickle, base64, os, pprint, globus_sdk

# Name of the search index to use for this notebook
index_name = 'gladier-tutorial'

# Load a search client using a token from the Jupyterhub login
data = pickle.loads(base64.b64decode(os.getenv('GLOBUS_DATA')))
search_token = data['tokens']['search.api.globus.org']['access_token']
search = globus_sdk.SearchClient(authorizer=globus_sdk.AccessTokenAuthorizer(search_token))

# Fetch all indices a user has access
indices = [si for si in search.get("/v1/index_list").data['index_list']
           if si['is_trial'] 
           and si['display_name'] == index_name
           and 'owner' in si['permissions']
          ]

# If an index was found with the criteria above, re-use it. Otherwise,
# create a new index.
if indices:
    tutorial_index = indices[0]
    print('Found existing index!')
else:
    index_doc = {
        "display_name": index_name, 
        "description": 'A trial index for running my search tutorial'
    }
    tutorial_index = search.post("/beta/index", json_body=index_doc).data
    print('New search index created succcessfully.')

search_index = tutorial_index['id']

print(tutorial_index['display_name'])
print(tutorial_index['description'])

## Gladier Tools

Gladier Tools are the glue that holds together Globus Flows and funcX functions. Tools bundle everything the funcX function needs to run, so the Glaider Client can register the function, check the requirements, and run it inside the flow.

We need three Flow States below to run our full experiment:

1. RunExperiment -- A function which will do the experimental work
2. GatherMetadata -- A function to gather results of the experiment
3. PublishMetadata -- A flow state to ingest the metadata into Globus Search

The first two Gladier Tools are FuncX Functions, and use the `@generate_flow_definition` decorator to create the flow state for each function. The final publication state does not use the decorator, and instead uses a static flow definition. Gladier will chain each of these together into one single flow, run one after the other.  

In [None]:
def run_experiment(**data):
    """Run an 'experiment' on our remote execution environment."""
    import pathlib
    import random
    experiment = pathlib.Path(data['experiment'])
    # Say hello a bunch of times
    experiment.write_text('Hello Gladier!' * random.randint(1, 100))
    return str(experiment)


@generate_flow_definition
class RunExperiment(GladierBaseTool):
    funcx_functions = [run_experiment]

In [None]:
def gather_metadata(**data):
    import pathlib
    import hashlib
    from datetime import datetime
    experiment = pathlib.Path(data['experiment'])  
    search_document = {
        'search_index': data['search_index'],
        'content': {    
            'dc': {
                'creators': [{'creatorName': 'ALCF Researcher'}],
                'dates': [{'date': datetime.now().isoformat(),
                           'dateType': 'Created'}],
                'formats': ['text/plain'],
                'publicationYear': '2021',
                'publisher': 'ALCF Researcher',
                'resourceType': {'resourceType': 'Dataset',
                                 'resourceTypeGeneral': 'Dataset'},
                'subjects': [{'subject': 'Globus'}, {'subject': 'Flows'}, {'subject': 'ALCF'}],
                'titles': [{'title': f'My Experiment {experiment.name}'}],
                'version': 1,
                
            },
            'files': [{
                'filename': experiment.name,
                'length': experiment.stat().st_size,
                'mime_type': 'text/plain',
                'md5': hashlib.md5(experiment.read_bytes()).hexdigest(),
                'sha256': hashlib.sha256(experiment.read_bytes()).hexdigest(),
            }],
        },
        'subject': data['subject'],
        'visible_to': ['public'],
    }
    # Clean up our 'experiment' file.
    experiment.unlink()
    return search_document


@generate_flow_definition
class GatherMetadata(GladierBaseTool):
    funcx_functions = [gather_metadata]

In [None]:
class PublishMetadata(GladierBaseTool):
    flow_definition = {
        'Comment': 'Publish metadata to Globus Search, with data from the result.',
        'StartAt': 'PublishMetadata',
        'States': {
            'PublishMetadata': {
                'Comment': 'Ingest a Globus Search document',
                'Type': 'Action',
                'ActionUrl': 'https://actions.globus.org/search/ingest',
                'ExceptionOnActionFailure': True,
                'InputPath': '$.GatherMetadata.details.result[0]',
                'ResultPath': '$.PublishMetadata',
                'WaitTime': 300,
                'End': True
            }
        },
    }
    funcx_functions = []


## Gladier Clients

Gladier Clients manage a collection of Glaider Tools and a Globus Flow to link them together into a pipeline. Clients handle both registering funcX functions for each tool and registering the flow to orchestrate each tool's execution. The checksum of the flows and funcX functions are checked prior to each invocation to ensure they are always up-to-date. Further, the client checks the necessary inputs to each tool are present before the flow is invoked.

Once a tool has been created it can be imported and used by a client. The client can then dynamically create a flow using the list of tools.

Here we define an `ExampleClient` and specify the `FileSize` tool. 

In [None]:
@generate_flow_definition
class MyGladierClient(GladierBaseClient):
    gladier_tools = [
        RunExperiment,
        GatherMetadata,
        PublishMetadata,
    ]

The `@generate_flow_definition` annotation prompts the client to dynamically create a Flow to serially combine each tool used by the client. The resulting flow definition is then saved and can be inspected.

More information on flow generation can be found [here.](https://gladier.readthedocs.io/en/latest/flow_generation.html)

## Flow Input

As you can see from the flow definition the input arguments for the tool have been dynamically defined. In this case, the `FileSize` tool requires a `funcx_endpoint_compute`, `file_size_funcx_id` and the entire `input` document is passed as the function payload. These values can be overridden in the flow or defined in the Tool definition.

It is important to note that the funcX function id, `file_size_funcx_id` is automatically populated by the Client at runtime. This allows the client to check whether the function definition has changed and re-register the function with funcX if necessary. As such, you do not need to specify the function id as input to the flow.

Here we define the input to include a pathname for the tool to act on and a public funcX endpoint to perform the execution.

In [None]:
experiment = pathlib.Path(f'/tmp/file_{random.randint(1, 1000000)}.txt')
subject = pathlib.Path('https://example.com/my-gladier-experiment') / experiment.name

flow_input = {
    'input': {
        'experiment': str(experiment),
        'subject': str(subject),
        'funcx_endpoint_compute': '4b116d3c-1703-4f8f-9f6f-39921e5864df',
        'search_index': search_index,
    }
}

### Check Existing Flows

For new users, the flows service only allows one flow deployed per-person. If you have deployed a flow before, you can delete it using the commented line below.

In [None]:
my_gladier_client = MyGladierClient()
for flow in my_gladier_client.flows_client.list_flows()['flows']:
    print(f"{flow['title']}: {flow['id']}")
    
# You can delete flows you don't want with the following line
# my_gladier_client.flows_client.delete_flow('')

## Running the flow

Now input has been created we can use the client to start and monitor the flow.

This will prompt you to authenticate and grant permission to the flow to perform a funcX invocation on your behalf. You may need to login twice, once for the static scopes required by Gladier. Second, for the new flow you have just deployed.

In [None]:
my_gladier_client = MyGladierClient()
flow = my_gladier_client.run_flow(flow_input=flow_input, label=f'Experiment {experiment.name}')
print(f"https://app.globus.org/flows/{flow['flow_id']}/runs/{flow['run_id']}")
my_gladier_client.progress(flow['run_id'])
my_gladier_client.get_status(flow['run_id'])