# Globus data publication platform tutorial

In this tutorial we will demonstrate how the Globus platform can be used to create flexible publication pipelines that can be used to publish arbitrary data, with flexible access control, descriptive metadata, and persistent identifiers. 

We will walk through an example data publication workflow as follows. First we will move data to a remote endpoint, make that data immutable, and share it with those who can access it; we then use the Globus Identifier service to mint a persistent identifier for the data; we then index descriptive metadata in Globus Search such that is discoverable by other users; finally, we demonstrate an example web portal for discovering and accessing the published datasets 

The basic tutorial flow is illustrated below.  

![title](publication-flow.png)

In [None]:
from __future__ import print_function  # for python 2
import globus_sdk
from identifier_api import IdentifierClient
import json

# Globus Endpoint for storing publication (Petrel#Testbed)
publication_endpoint = "e56c36e4-1063-11e6-a747-22000bf2d559"

# Globus Group which can view publications
access_group = "50b6a29c-63ac-11e4-8062-22000ab68755"

# HTTPS URL for the publication endpoint
http_base_url = "https://testbed.petrel.host/"

# Search index ID to store metadata
search_index = "3e117028-2513-4f5b-b53c-90fda3cd328b"

# ID of namespace where we create identifiers
identifiers_namespace = "2lxaZcq_7D4j"

# ID of this tutorial notebook as a Globus App
CLIENT_ID = 'd61ed2e0-b4f9-4fe9-9433-41e2528a807d'

# python2/3 safe simple input reading
get_input = getattr(__builtins__, 'raw_input', input)

# 1) Authentication

Boefore implementing the publication workflow we must first authenticate with Globus and request access tokens to access the transfer, search, and identifier services. 

We follow a standard OAuth 2 authenticaiton flow for native applications. The first step is to create a unique link via which a user can authenticate. We then capture the resulting auth code as input in the notebook. 

In [None]:
# create a client which is responsible for managing interactions with the Globus Auth service
# it manages the entire OAuth2 login flow
native_auth_client = globus_sdk.NativeAppAuthClient(CLIENT_ID)

# start a flow with a specific set of requested scopes -- levels of access to Globus Apps & Services
# When you login, you will be prompted to accept this App's access to these services
transfer_scope = 'urn:globus:auth:scope:transfer.api.globus.org:all'
search_scope = 'urn:globus:auth:scope:search.api.globus.org:all'
identifiers_scope = 'https://auth.globus.org/scopes/identifiers.globus.org/create_update'
native_auth_client.oauth2_start_flow(
    requested_scopes=[
        transfer_scope,
        search_scope,
        identifiers_scope
    ]
)

# Login link for an authorization code
# this is like a one-time password used to fetch longer-lasting credentials (tokens)
print("Login Here:\n\n{0}".format(native_auth_client.oauth2_get_authorize_url()))
print(("\n\nNote that this link can only be used once! "
       "If login or a later step in the flow fails, you must restart it."))

# fill this line in with the code that you got
auth_code = get_input("Enter resulting code:")

# and exchange it for a response object containing your token(s)
# we'll use this "tokens" object in later steps
tokens = native_auth_client.oauth2_exchange_code_for_tokens(auth_code)

In the second phase of the OAuth2 flow we use the access code to obtain access tokens for each service and create Python clients for each service using the Globus SDK. 

In [None]:
# extract the Access Tokens for Globus Transfer, Globus Search, and Globus Identifiers
#
# for full detail, see the SDK documentation:
# http://globus-sdk-python.readthedocs.io/en/stable/responses/auth/#globus_sdk.auth.token_response.OAuthTokenResponse
transfer_tokdata = tokens.by_scopes[transfer_scope]
search_tokdata = tokens.by_scopes[search_scope]
identifiers_tokdata = tokens.by_scopes[identifiers_scope]

# pull out Access Tokens -- 48-hour credentials which can be used to access the services
transfer_token = transfer_tokdata['access_token']
search_token = search_tokdata['access_token']
identifiers_token = identifiers_tokdata['access_token']

# to pass tokens to clients, wrap them in GlobusAuthorizers
# these are generic objects which support multiple authentication methods -- Access Tokens are just one
# and pass the results to client objects
transfer = globus_sdk.TransferClient(
    authorizer=globus_sdk.AccessTokenAuthorizer(transfer_token))
search = globus_sdk.SearchClient(
    authorizer=globus_sdk.AccessTokenAuthorizer(search_token))
identifiers = IdentifierClient(
    authorizer=globus_sdk.AccessTokenAuthorizer(identifiers_token))

# 2) Assemble a dataset

In the first stage of the publication workflow we move the data to be published to a location that is immuatable, accessible to those who might wish to access the data, and can scale to the required data size. For this purpose we use a Globus shared endpoint as it allows us to dynamically manage access to data. 

To isloate users' publications from each other we create a unique directory for our publication on our shared endpoint. To avoid conflcits we will name the directory using a UUID.

Note: to follow these instructions you will need to make sure you are in the [Tutorial Users Group](https://www.globus.org/app/groups/50b6a29c-63ac-11e4-8062-22000ab68755).

In [None]:
import uuid

# create a unqiue id for the directory name
share_path = '/' + str(uuid.uuid4()) + '/'
r = transfer.operation_mkdir(publication_endpoint, path=share_path)

print("Publication path: %s" % share_path)
print("https://www.globus.org/app/transfer?origin_id=%s&origin_path=%s" % (publication_endpoint, share_path))

Having created the directory we now need to populate it with our dataset. For simplicity, we will move sample Globus data from the "Globus Tutorial Endpoint." You are welcome to use any data you like, just update the `source_endpoint` and source_directory`

In [None]:
# define the source endpoint and directory to be copied for publication
# (Globus Tutorial Endpoint 1):/share/godata/
source_endpoint = 'ddb59aef-6d04-11e5-ba46-22000b92c6ec'
source_directory= '/share/godata/'

# TransferData is a helper function for building good Transfer Task documents for the Globus Transfer Service
tdata = globus_sdk.TransferData(
    transfer, source_endpoint, publication_endpoint,
    label='Tutorial copy data', sync_level='checksum')

# you can add multiple files and directories to transfer -- for our case, just add one
tdata.add_item(source_directory, share_path, recursive=True)

# once it's built, submit the transfer and get a task document to describe it
task_description = transfer.submit_transfer(tdata)

We now wait for the transfer to complete using the Globus SDK `task_wait` function. To confirm that the data is transferred correctly we preform an `ls` operation on the shared endpoint. Note: in this example we also record the last file name in the publication directory so that we can associate metadata later in the tutorial. 

In [None]:
# NOTE: It's technically possible for the task to terminate with a failure. This code does not handle this condition.

# wait up to 100s, checking every 1s
completed = transfer.task_wait(
    task_description['task_id'], timeout=100, polling_interval=1)

share_file = None

if not completed:
    print('Transfer still not completed!')
else:
    for f in transfer.operation_ls(publication_endpoint, path=share_path):
        print(f['name'])
        share_file = f['name']

Now that the data is placed on a shared endpoint, and in a unique publication directory, we can share the published data with individuals or groups of users. Below we share the published data with the "Tutorial Users Group" so that other tutorial participants will be able to view and download your published data. 

In [None]:
# this is a rule which
# - allows Read access, permissions="r"
# - on the directory we generated above, share_path
# - for the Tutorial Users Group, access_group
rule_data = {
    'DATA_TYPE': 'access',
    'principal_type': 'group', 
    'principal': access_group,
    'path': share_path,
    'permissions': 'r'
}

result = transfer.add_endpoint_acl_rule(publication_endpoint, rule_data)
print(result['message'])

# 3) Create Metadata to describe our dataset

We will define simple metadata which describes our dataset. This metadata will be used for registering the identifier and also for loading into our search index to enable discovery of the published dataset.

You should update the metadata below to reflect your publication. Add your name as a contributor and update the title, year and keywords. 

In [None]:
metadata = {
    'title': 'My Publication for GW18',
    'contributors': ['John Smith', 'Jane Doe', 'Zaphod Beeblebrox'],
    'date': '2018-12-12',
    'keywords': ['Hitchhiker', 'Blanket', 'Panic']
}

#  4) Associate an Identifier

The next stage of our workflow is to associate an persistent and unambiguous identifier with the dataset. This is advantageous as it allows others to refer to a permanant name rather than a potentially volatile reference to a storage location. 

The Globus Identifier service allows users to create identifiers within user-managed namespaces. Namespaces abstract use of an external persistent identifier (PID) provider and a valid account (or shoulder) within that provider. 

When minting an identifier in a namespace the following information must be provided: 1) one or more locations to access the data such as a URL representing a particular path on a Globus endpoint; 2) metadata describing a mixture of publication-specific attributes (e.g., creator, checksum) and optionally extensible, user-defined attributes; 3) access policies governing which users can access the identifier. 

First, we'll introspect the namespace to confirm it is the correct namespace for our publication. The result of introspection includes the administrators of the namespace, creators who are able to mint identifiers as well as various metadata fields that describe the namespace (e.g., name and description). 

In [None]:
identifier_namespace = identifiers.get_namespace(identifiers_namespace)

print("Name: %s" % identifier_namespace.data['display_name'])
print("Description: %s" % identifier_namespace.data['description'])
print("Provider: %s" % identifier_namespace.data['provider_type'])

Now, we will create an identifier in the namespace. We set the location for the data to a Globus URI that includes the endpoint and folder where we stored the data. We also associate basic metadata about the dataset. Finally, we will set the identifier to be visible to all users ('public').

The Identifier service returns a JSON description of the identifier including the metadata we defined above and the newly minted identifier.  In this case our namespace is configured to create an ARK using the test shoulder. 

We also add the newly minted identifier to the dataset's metadata so that we can load it into our search index. 

In [None]:
# define a location for accessing the data
dataset_location = "globus://%s%s" % (publication_endpoint, share_path)
visible_to = ['public']
dataset_identifier = identifiers.create_identifier(
    namespace=identifiers_namespace,
    location=[dataset_location],
    properties={
        'title': metadata['title'],
        'date': metadata['date'],
        'contributors': metadata['contributors']
    },
    visible_to=visible_to)

metadata['identifier'] = dataset_identifier.data['identifier']

print("Identifier %s" % dataset_identifier.data['identifier'])
print("location %s" % dataset_identifier.data['location'])
print("Metadata %s" % dataset_identifier.data['properties'])

Now that we have minted the identifier we can resolve it to find out metadata and retrieve a link to the data. For this purpose we use an online resolver: the name 2 thing resolver (n2t.net). 

Note: registration takes a few moments to propogate. If the identifier doesn't resolve, please wait a few seconds and try again.

In [None]:
print('https://n2t.net/{}'.format(metadata['identifier']))

# 5) Index descriptive metadata

In the third stage of the workflow we aim to index the metadata that describes our published dataset. For this purpose we use Globus Search: a flexible, schema-agnostic search platform with fine grain access control on published records and metadata. Globus Search provides powerful free-text search capabilities via which others can discover our published dataset.

Globus Search supports user-managed indexes in which an adminstrator may create an index and define policies regarding its use including who can manage the index, ingest metadata, and query the index. 

Complete documentation for using Globus Search is available online: https://docs.globus.org/api/search/

We have created an index for this tutorial. You can use the Globus SDK to retrieve information about the index as follows:

In [None]:
tutorial_index = search.get_index(index_id=search_index)
print(tutorial_index['display_name'])
print(tutorial_index['description'])

## Indexing Data

Globus Search supports scalable indexing of arbitrary entries into a selected index.  An entry is comprised of three types of information: 1) a subject, which represents a name or target for the entry (e.g., a URL for a Globus-accesible file or directory); 2) arbitrary metadata represented as a collection of attributes in nested JSON structure; and 3) a visibility policy that defines which users or groups are able to view and query the subject and its metadata.

To index metadata we construct an JSON object which includes this information as follows and use the `ingest` function to add it to the index:

In [None]:
subject =  "globus://%s%s%s" % (publication_endpoint, share_path, share_file)
ingest_data = {
    "ingest_type": "GMetaEntry",
    "ingest_data": {
        "subject": subject,
        "visible_to": ["public"],
        "content": metadata
    }
}
result = search.ingest(search_index, ingest_data)
print("Documents indexed: %s" % result['num_documents_ingested'])
print("Subject: %s" % subject)
print(metadata)

## Search

Globus Search implements a flexible query model that supports two types of queries: simple, free-text queries and complex, structured queries.

Simple queries perform basic sub-string matching against any metadata fields that are visible to the querying user.
As with web search, the results of a simple search are ordered based on the computed "best match" for the query. 

A simple query is as easy as passing a string to the `search` function.  The results are an ordered list of result objects. 

Update the following free text query to discover your dataset. 

In [None]:
query='john'

search_results = search.search(index_id=search_index, q=query)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s" % json.dumps(i['content']))

Globus Search also supports an advanced query mode in which more precise queries can be expressed. For examples, queries that search specific attributes, range expressions, exact matches, and so forth.

First we search for your published dataset using the minted identiifer, we then query for all publications with a specific contributor. 

In [None]:
search_results = search.search(search_index, q='identifier: "%s"' % metadata['identifier'], advanced=True)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s" % i['content'])  

In [None]:
search_results = search.search(search_index, 'contributors: "John Smith"', advanced=True)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s" % json.dumps(i['content']))

## Complex queries

Complex queries, in contrast, take the form of a structured JSON document, and are more commonly used when queries are created programmatically. They may reference specific metadata fields, and may apply criteria such as value ranges, wildcards, and
regular expressions. 

For example, to conduct the same free-text search as above but to limit results to publications between 2010-2020 we can add a filter to the query.

Note: we use the Globus SDK SearchQuery to construct complex queries.  We also show the resulting JSON query object used to execute the query. 

In [None]:
structured_query = (globus_sdk.SearchQuery(q=query)
                    .add_filter('year', [{'from': 2000, 'to': 2020}], type='range'))
search_result = search.post_search(search_index, structured_query)

print("Structured Query Object: %s\n" % json.dumps(structured_query))
print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s\n" % json.dumps(i['content']))

Complex queries may also specify facets---a method for generating categories and associated frequencies for particular metadata fields.  For example, here is a query to produce keyword facets:

In [None]:
structured_query = (globus_sdk.SearchQuery(q='*').add_facet('Publication Keywords', 'keywords'))
search_results = search.post_search(search_index, structured_query)

print("Structured Query Object: %s\n" % json.dumps(structured_query))

print("Results\nCount: %s" % search_results['count'])
#for i in search_results['gmeta']:
#    print("Subject: %s" % i['subject'])
#    print("Content: %s" % json.dumps(i['content']))

print("\nFacets")
for i in search_results['facet_results']:
    for j in i['buckets']:
        print ("%s (%s)" % (j['value'], j['count']))

# Advanced indexing

One of the benefits of the Globus Search model is that you can associate visibility policies to records and metadata. Here we demonstrate how you can add a new metadata entry to a record and make it visible only to a particular group of users. 

Update the metadata added below and confirm that the queries now show the updated metadata. Note: when querying over these entities the results will collapse metadata for the same root subject. 

In [None]:
ingest_data = {
    "ingest_type": "GMetaEntry",
    "ingest_data": {
        "subject": "globus://%s%s%s" % (publication_endpoint, share_path, share_file),
        "id": "rating",
        "visible_to": ['urn:globus:groups:id:%s' % access_group],
        "content": {
            "rating": "good",
        }
    }
}
result = search.ingest(search_index, ingest_data)
print("Documents indexed: %s" % result['num_documents_ingested'])

search_results = search.search(search_index, q='identifier: "%s"' % metadata['identifier'], advanced=True)

print("Count: %s" % search_results['count'])
for i in search_results['gmeta']:
    print("Subject: %s" % i['subject'])
    print("Content: %s" % i['content'])                                   

# 6) Browse published datasets

Finally, we can build powerful user interfaces on top of Globus publication platform services. As one example we have developed a simple Django portal for browsing and searching publised datasets. 

The portal is available here: https://portal.demo.globus.org

Try the same queries as above to find your indexed data. The datasets you ingested above will be immediately avaialble in the search portal.