# Globus data publication platform tutorial

In this tutorial we will demonstrate how the Globus platform can be used to create automated pipelines that can be used to publish arbitrary data, with flexible access control, descriptive metadata, and persistent identifiers. 

We will walk through the following data publication flow:
1. Authenticate with Globus and get tokens for accessing various services
1. Assemble a dataset and move the data to a remote, immutable endpoint, with restricted access
1. Define some metadata for our dataset
1. Use the Globus Identifier service to mint a persistent identifier for the data
1. Index descriptive metadata in Globus Search such that is discoverable by other users

The basic tutorial flow is illustrated below.  

<img src="img/publication_flow.png" alt="Automated data publication flow" align="CENTER" style="width: 85%;"/>

## Prerequisites

To complete this tutorial you will need to make sure you are in the [Tutorial Users Group](https://www.globus.org/app/groups/50b6a29c-63ac-11e4-8062-22000ab68755).

In [None]:
from __future__ import print_function  # for python 2
import globus_sdk
from identifier_api import IdentifierClient
import json

# Globus Endpoint for storing publication (Petrel#Testbed)
publication_endpoint = "e56c36e4-1063-11e6-a747-22000bf2d559"

# Globus Group which can view publications
access_group = "50b6a29c-63ac-11e4-8062-22000ab68755"

# HTTPS URL for the publication endpoint
http_base_url = "https://testbed.petrel.host/"

# Search index ID to store metadata
search_index = "f702761b-3a05-4ba1-af2b-c0e07850c6f1"

# ID of namespace where we create identifiers
identifiers_namespace = "HHxPIZaVDh9u"

# python2/3 safe simple input reading
get_input = getattr(__builtins__, 'raw_input', input)

# 1. Authentication

Boefore implementing the automated publication flow we must authenticate with Globus and request access tokens to use the transfer, search, and identifier services. Here we get the tokens avaialable in JupyterHub, and create clients for interacting with Globus services.

In [None]:
import pickle, base64, os, pprint

data = pickle.loads(base64.b64decode(os.getenv('GLOBUS_DATA')))

transfer_token = data['tokens']['transfer.api.globus.org']['access_token']
search_token = data['tokens']['search.api.globus.org']['access_token']
identifiers_token = data['tokens']['identifiers.globus.org']['access_token']

# to pass tokens to clients, wrap them in GlobusAuthorizers
# these are generic objects which support multiple authentication methods -- Access Tokens are just one
# and pass the results to client objects
transfer = globus_sdk.TransferClient(
    authorizer=globus_sdk.AccessTokenAuthorizer(transfer_token))
search = globus_sdk.SearchClient(
    authorizer=globus_sdk.AccessTokenAuthorizer(search_token))
identifiers = IdentifierClient(
    authorizer=globus_sdk.AccessTokenAuthorizer(identifiers_token))

# 2. Assemble a dataset

In the first stage of the publication workflow we move the data to a location that is immuatable, accessible only to authorized users (i.e. those in the Tutorial Users group), and able to scale as needed. We use a Globus shared endpoint for this purpose, as it allows us to dynamically manage access to data. 

To isloate users' datasets from each other we create a unique directory on our shared endpoint (to avoid conflcits we will name the directory using a UUID).

Note: to follow these instructions you will need to make sure you are in the [Tutorial Users Group](https://www.globus.org/app/groups/50b6a29c-63ac-11e4-8062-22000ab68755).

In [None]:
import uuid

# create a unqiue id for the directory name
share_path = '/' + str(uuid.uuid4()) + '/'
r = transfer.operation_mkdir(publication_endpoint, path=share_path)

print("Publication path: %s" % share_path)
print("https://www.globus.org/app/transfer?origin_id=%s&origin_path=%s" % (publication_endpoint, share_path))

Having created the directory we now need to populate it with our dataset. For simplicity, we will move sample Globus data from the "Globus Tutorial Endpoint." You are welcome to use any data you like, just update the `source_endpoint` and source_directory`

In [None]:
# define the source endpoint and directory to be copied for publication
# (Globus Tutorial Endpoint 1):/share/godata/
source_endpoint = 'ddb59aef-6d04-11e5-ba46-22000b92c6ec'
source_directory= '/share/godata/'

# TransferData is a helper function for building good Transfer Task documents for the Globus Transfer Service
tdata = globus_sdk.TransferData(
    transfer, source_endpoint, publication_endpoint,
    label='Tutorial copy data', sync_level='checksum')

# you can add multiple files and directories to transfer -- for our case, just add one
tdata.add_item(source_directory, share_path, recursive=True)

# once it's built, submit the transfer and get a task document to describe it
task_description = transfer.submit_transfer(tdata)

We now wait for the transfer to complete using the Globus SDK `task_wait` function. To confirm that the data is transferred correctly we preform an `ls` operation on the shared endpoint. Note: in this example we also record the last file name in the publication directory so that we can associate metadata later in the tutorial. 

In [None]:
# NOTE: It's technically possible for the task to terminate with a failure. This code does not handle this condition.

# wait up to 100s, checking every 1s
completed = transfer.task_wait(
    task_description['task_id'], timeout=100, polling_interval=1)

share_file = None

if not completed:
    print('Transfer still not completed!')
else:
    for f in transfer.operation_ls(publication_endpoint, path=share_path):
        print(f['name'])
        share_file = f['name']

Now that the data is placed on a shared endpoint, and in a unique publication directory, we can share the published data with individuals or groups of users. Below we share the published data with the "Tutorial Users Group" so that other tutorial participants will be able to view and download your published data. 

In [None]:
# this is a rule which
# - allows Read access, permissions="r"
# - on the directory we generated above, share_path
# - for the Tutorial Users Group, access_group
rule_data = {
    'DATA_TYPE': 'access',
    'principal_type': 'group', 
    'principal': access_group,
    'path': share_path,
    'permissions': 'r'
}

result = transfer.add_endpoint_acl_rule(publication_endpoint, rule_data)
print(result['message'])

# 3. Create Metadata to describe our dataset

We will define simple metadata which describes our dataset. This metadata will be used for registering the identifier and also for loading into our search index to enable discovery of the published dataset.

You should update the metadata below to reflect your publication. Add your name as a contributor and update the title, date, and keywords. 

In [2]:
metadata = {
    'title': 'My GlobusWorld Tour Publication',
    'contributors': ['John Smith', 'Jane Doe', 'Zaphod Beeblebrox'],
    'date': '2018-12-12',
    'keywords': ['Hitchhiker', 'Blanket', 'Panic']
}

#  4. Associate an Identifier

Next we associate a persistent and unambiguous identifier with the dataset. This allows others to refer to a permanent name rather than a potentially volatile storage location reference. 

The Globus Identifier service allows users to create identifiers within user-managed namespaces. Namespaces abstract use of an external persistent identifier (PID) provider and a valid account (or shoulder) within that provider. 

When minting an identifier the following information must be provided:
* One or more locations to access the data, such as a URL representing a particular path on a Globus endpoint
* Metadata describing a mixture of publication-specific attributes (e.g., creator, checksum) and optionally extensible, user-defined attributes
* Access policies governing which users can access the identifier

First, we'll introspect the namespace to confirm it is the correct namespace for our publication. The result of introspection includes the administrators of the namespace, creators who are able to mint identifiers as well as various metadata fields that describe the namespace (e.g., name and description). 

In [None]:
identifier_namespace = identifiers.get_namespace(identifiers_namespace)

print("Name: %s" % identifier_namespace.data['display_name'])
print("Description: %s" % identifier_namespace.data['description'])
print("Provider: %s" % identifier_namespace.data['provider_type'])

Now, we will create an identifier in the namespace. We set the location for the data to a Globus URI that includes the endpoint and folder where we stored the data. We also associate basic metadata about the dataset. Finally, we will set the identifier to be visible to all users ('public').

The Identifier service returns a JSON description of the identifier, including the metadata we defined above and the newly minted identifier. In this case our namespace is configured to create an ARK using the test shoulder. 

We also add the newly minted identifier to the dataset's metadata so that we can load it into our search index. 

In [1]:
# define a location for accessing the data
dataset_location = "globus://%s%s" % (publication_endpoint, share_path)
visible_to = ['public']
dataset_identifier = identifiers.create_identifier(
    namespace=identifiers_namespace,
    location=[dataset_location],
    metadata={
        'title': metadata['title'],
        'date': metadata['date'],
        'contributors': metadata['contributors']
    },
    visible_to=visible_to)

metadata['identifier'] = dataset_identifier.data['identifier']

print("Identifier %s" % dataset_identifier.data['identifier'])
print("location %s" % dataset_identifier.data['location'])
print("Metadata %s" % dataset_identifier.data['metadata'])

NameError: name 'publication_endpoint' is not defined

Now that we have minted the identifier we can resolve it to find out metadata and retrieve a link to the data. For this purpose we use an online resolver: the name 2 thing resolver (n2t.net). 

Note: registration takes a few moments to propogate. If the identifier doesn't resolve, please wait a few seconds and try again.

In [None]:
print('https://n2t.net/{}'.format(metadata['identifier']))

# 5. Index descriptive metadata

In this stage of the flow we aim to index the metadata that describes our published dataset. For this purpose we use Globus Search, a flexible, schema-agnostic search platform with fine grained access control on data and metadata. Globus Search provides powerful, free-text search capabilities via which others can discover our published dataset.

Globus Search supports user-managed indexes in which an adminstrator may create an index and define policies regarding its use, including who can manage the index, ingest metadata, and query the index. 

Complete documentation for using Globus Search is available at https://docs.globus.org/api/search/.

We have created an index for this tutorial. You can use the Globus SDK to retrieve information about the index as follows:

In [None]:
tutorial_index = search.get_index(index_id=search_index)
print(tutorial_index['display_name'])
print(tutorial_index['description'])

## Indexing Data

Globus Search supports scalable indexing of arbitrary entries into a selected index. An entry is comprised of three types of information:
1. A subject, which represents a name or target for the entry (e.g., a URL for a Globus-accesible file or directory)
1. Arbitrary metadata represented as a collection of attributes in nested JSON structure
1. A visibility policy that defines which users or groups are able to view and query the subject and its metadata

To index metadata we construct an JSON object that includes this information, and use the `ingest` function to add it to the index:

In [None]:
subject =  "globus://%s%s%s" % (publication_endpoint, share_path, share_file)
ingest_data = {
    "ingest_type": "GMetaEntry",
    "ingest_data": {
        "subject": subject,
        "visible_to": ["public"],
        "content": metadata
    }
}
result = search.ingest(search_index, ingest_data)
print("Documents indexed: %s" % result['num_documents_ingested'])
print("Subject: %s" % subject)
print(metadata)

## Search

Globus Search implements a flexible query model that supports two types of queries: simple, free-text queries and complex, structured queries.

Simple queries perform basic sub-string matching against any metadata fields that are visible to the querying user.
As with web search, the results of a simple search are ordered based on the computed "best match" for the query. 

A simple query is as easy as passing a string to the `search` function.  The results are an ordered list of result objects. 

Update the following free text query to discover your dataset. 

In [None]:
query='john'

search_results = search.search(index_id=search_index, q=query)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s" % json.dumps(i['content']))

Globus Search also supports an advanced query mode in which more precise queries can be expressed. For examples, queries that search specific attributes, range expressions, exact matches, and so forth.

First we search for your published dataset using the minted identiifer, we then query for all publications with a specific contributor. 

In [None]:
search_results = search.search(search_index, q='identifier: "%s"' % metadata['identifier'], advanced=True)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s" % i['content'])  

In [None]:
search_results = search.search(search_index, 'contributors: "John Smith"', advanced=True)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s" % json.dumps(i['content']))

## Complex queries

Complex queries take the form of a structured JSON document, and are more commonly used when the queries is created programmatically. They may reference specific metadata fields, and may apply criteria such as value ranges, wildcards, and regular expressions. 

For example, to conduct the same free-text search as above&mdash;but to limit results to publications between 2010-2020&mdash;we can add a filter to the query.

Note: We use the Globus SDK SearchQuery to construct complex queries. We also show the resulting JSON query object used to execute the query. 

In [None]:
structured_query = (globus_sdk.SearchQuery(q=query)
                    .add_filter('date', [{'from': 2000, 'to': 2020}], type='range'))
search_results = search.post_search(search_index, structured_query)

print("Structured Query Object: %s\n" % json.dumps(structured_query))
print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s\n" % json.dumps(i['content']))

Complex queries may also specify facets&mdash;a method for generating categories and associated frequencies for particular metadata fields. For example, here is a query to produce keyword facets:

In [None]:
structured_query = (globus_sdk.SearchQuery(q='*').add_facet('Publication Keywords', 'keywords'))
search_results = search.post_search(search_index, structured_query)

print("Structured Query Object: %s\n" % json.dumps(structured_query))

print("Results\nCount: %s" % search_results['count'])
#for i in search_results['gmeta']:
#    print("Subject: %s" % i['subject'])
#    print("Content: %s" % json.dumps(i['content']))

print("\nFacets")
for i in search_results['facet_results']:
    for j in i['buckets']:
        print ("%s (%s)" % (j['value'], j['count']))

# Advanced indexing

One of the benefits of the Globus Search model is that you can associate visibility policies with records and metadata. Here we demonstrate how you can add a new metadata entry to a record and make it visible only to a particular group of users. 

Update the metadata added below, and confirm that the queries now show the updated metadata. Note: When querying over these entities the results will collapse metadata for the same root subject. 

In [None]:
ingest_data = {
    "ingest_type": "GMetaEntry",
    "ingest_data": {
        "subject": "globus://%s%s%s" % (publication_endpoint, share_path, share_file),
        "id": "rating",
        "visible_to": ['urn:globus:groups:id:%s' % access_group],
        "content": {
            "rating": "good",
        }
    }
}
result = search.ingest(search_index, ingest_data)
print("Documents indexed: %s" % result['num_documents_ingested'])

search_results = search.search(search_index, q='identifier: "%s"' % metadata['identifier'], advanced=True)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s" % i['content'])                                   