# KC2 GUID Interoperability Demo

### Date
Oct 12, 2018

### Authors
Max Levinson & Tim Clark, University of Virginia (Team Sodium)

Kyle Chard, University of Chicago (Team Argon)

Martin Fenner, Datacite (Team Sodium)

Isma Gilani & Ray Idaszak, UNC Chapel Hill (Team Helium)

Gareth Harvey & Gabriel Oscares, Elsevier (Team Xenon)

Zachary Flamig, Garret Rupp & Pauline Ribeyre, University of Chicago (Team Calcium) 

### Abstract
This notebook outlines how the KC2 working group common metadata model, identifer services, and content negotiation enable interoperability on data objects between commons participants. In phase 2 we plan to extend this capability to all digital objects in the commons.

Specifically, we show how two identifier types (DOIs and Minids/ARKs) minted by three commons teams (Helium, Sodium, and Argon) can be used interchangably. We show that given either identifier type we can resolve, interpret the metadata, and then download the data irrespective of the identifier type or stack on which the identifier was minted.

We then show how new identifiers can be minted and registered using the Argon Identifiers service and the Sodium ORS service.

In [1]:
import requests
import json
import hashlib

## Topmed DOI Identifiers

Datacite DOI's have been registered for all TOPMed public and private files by Martin Fenner of Datacite.

Only public TOPMed metadata specifies the object locations at present.

We begin the demo with a selected public TOPMed DOI, which we will resolve to obtain the object location and other metadata.

In [2]:
identifier = "10.23725/ttff-7p47"       # TOPMed DOI for CRAM

In [3]:
cram_metadata = requests.get(
    'https://doi.org/'+identifier,
    headers = {'Accept': 'application/json'}
)

In [4]:
cram_metadata.json()

{'@context': 'http://schema.org',
 '@type': 'Dataset',
 '@id': 'https://doi.org/10.23725/ttff-7p47',
 'identifier': [{'@type': 'PropertyValue',
   'propertyID': 'doi',
   'value': 'https://doi.org/10.23725/ttff-7p47'},
  {'@type': 'PropertyValue',
   'propertyID': 'minid',
   'value': 'ark:/99999/fk41KzQJxPHEjuMd'},
  {'@type': 'PropertyValue',
   'propertyID': 'dataguid',
   'value': 'dg.4503/81d07e56-9521-407a-b840-b80dd0c291f0'},
  {'@type': 'PropertyValue',
   'propertyID': 'md5',
   'value': '9c02f9b2d91059675e59c9485aac7191'}],
 'url': 'https://ors.datacite.org/doi:/10.23725/ttff-7p47',
 'additionalType': 'CRAM file',
 'name': 'NWD580039.recab.cram',
 'author': {'name': 'TOPMed'},
 'description': 'TOPMed: NWD580039 <br>HapMap_1000G: NA12878 <br>Seq Ctr: Broad <br>File:  CRAM file',
 'keywords': 'topmed, whole genome sequencing',
 'datePublished': '2017-11-30',
 'contentUrl': ['s3://cgp-commons-public/topmed_open_access/96c790a1-ebdb-5eff-9b71-b1c114c7f01a/NWD580039.recab.cram',
 

### Helium-Created VCF File

Computational Biologists at team Helium resolved this DOI to the object contents and from those contents computed a VCF file.

We then created and minted an Ark/Minid using the Argon Identifier service and a DOI using the Sodium ORS service (not shown here).

We then demonstrate interoperability for the JSON-LD Schema.org metadata registered with these identifiers.
The metadata returned share a common schema across identifier types which is a core component of interoperability.
Additionally both identifiers can be used to programatically access the file and verify the integrity of the contents via the checksum.

In [13]:
# identifier to resolve 
#identifier = "doi:/10.25489/DC4YZZ"          # Helium DOI for VCF
identifier = "ark:/57799/b91CbVyWo5PEQ9j"   # Argon Minid/ARK for VCF

### 1) Resolve identifier

All commons identifiers are registered in global resolvers (e.g., n2t.net, doi.org, and identifiers.org). This allows any user, without knowledge of where an identifier was minted, to be able to resolve the identiifer and obtain a reference to its landing service/page. 

The resolver replies with an HTTP 302 message and a redirect URL that can be followed to locate the landing service/page.

In [14]:
r = requests.get('http://n2t.net/%s' % identifier, allow_redirects=False)
print(r.status_code, r.headers['Location'])

302 https://identifiers.globus.org/ark:/57799/b91CbVyWo5PEQ9j


### 2) Obtain Core Metadata

KC2 has agreed on content negotiation as the means to choose between a human readable (HTML) landing page and a machine readable (JSON-LD) landing page. A landing page should contain embedded Schema.org JSON-LD for indexing by crawlers. To access the machine readable page we set the accepts header to JSON-LD using the schema.org metadata schema.

In [7]:
headers= {'Accept': 'application/vnd.schemaorg.ld+json'}
#headers = {'Accept': 'application/json'}
 
r = requests.get('http://n2t.net/%s' % identifier, headers=headers)
 
if r.status_code == 200:
    metadata = r.json()
    print(json.dumps(metadata))
else:
    print ("Error getting identifier metadata (HTTP %s)" % r.status_code)

{"@context": "http://schema.org", "@type": "Dataset", "@id": "https://doi.org/10.25489/dc4yzz", "identifier": [{"@type": "PropertyValue", "propertyID": "doi", "value": "https://doi.org/10.25489/dc4yzz"}, {"@type": "PropertyValue", "propertyID": "sha256", "value": "0deb9c69ce87af37937d87df3ec3d0c0b2e7501a4600f34d214685d9dbaf0207"}], "additionalType": "XML", "name": "Helium VCF Bag for KC2 Demo", "author": {"name": "Isma Gilani"}, "datePublished": "2018", "schemaVersion": "http://datacite.org/schema/kernel-4", "publisher": {"@type": "Organization", "name": "KC2"}, "contentUrl": ["https://helium.commonsshare.org/django_irods/download/bags/d8088f48bef4408cbc7acb8063ba7a72.zip"], "fileFormat": ["application/gzip"]}


### 3) Validate Core Metadata

All commons identifiers provide metadata following the schema.org schema elements. Before using the metadata we can first validate that it conforms to the schema.org Dataset schema. 

Validation is performed here using the [Google Structured Data Testing Tool](https://search.google.com/structured-data/testing-tool) , using the graphical interface and cut and pasting the metadata listed above.

Warnings for contentUrl are not significant but will be addressed in phase 2, by coordinating with the Schema.org maintainers and Google Dataset Search team.



### 4) Download and validate content

Finally, we can introspect the metadata to discover information about the identified object. This metadata includes one or more locations at which the data is avaialble as well as a checksum that can be used to validate the integrity of the data.

In [71]:
import urllib.request
download_file = metadata['contentUrl'][0].split('/')[-1]
urllib.request.urlretrieve(metadata['contentUrl'][0], download_file)

('d8088f48bef4408cbc7acb8063ba7a72.zip',
 <http.client.HTTPMessage at 0x10fd84470>)

In [72]:
checksum = {}
for i in metadata['identifier']:
    if isinstance(i, dict): 
        if i.get('propertyID') == 'sha256':
            checksum['sha256'] = i['value']

In [73]:
contents = open(download_file, 'rb').read()

print(hashlib.sha256(contents).hexdigest())
print(checksum['sha256'])

0deb9c69ce87af37937d87df3ec3d0c0b2e7501a4600f34d214685d9dbaf0207
0deb9c69ce87af37937d87df3ec3d0c0b2e7501a4600f34d214685d9dbaf0207


### 5) Resolving a Dataguid

Using the team calcium gen3 package we can resolve a Dataguid to obtain the object metadata

The indexd service returns dataguids metadata in native form. These are then passed to a translation interface at dcp.bionimbus.org which produces valid JSON-LD Schema.org.

In phase 2 we plan to handle this translation via content negotiation.

In [16]:
import gen3
import shutil
from gen3.indexclient.client import IndexClient

In [17]:
identifier = "dg.4503/33ceb094-68f0-4aed-8fd2-c1f3ff169254"

In [18]:
indexHost = 'https://dataguids.org/index/'
indexVersion = 'v0',
ic = IndexClient(indexHost, indexVersion)

In [19]:
doc = ic.global_get(identifier)
response = doc.to_json()
host = response.get('from_index_service').get('host').replace('index/', 'coremetadata/')

In [20]:
from gen3.auth import Gen3Auth
auth = Gen3Auth('https://dcp.bionimbus.org/' , 
                refresh_file = 'credentials.json')
headers = {'Accept': 'application/vnd.schemaorg.ld+json'}
r = requests.get(host + identifier, headers=headers, auth=auth)

if r.status_code == 200:
    metadata = r.json()
    print(json.dumps(metadata, sort_keys=True, indent=4))
else:
    print ("Error getting identifier metadata (HTTP %s)" % r.status_code)    

{
    "@context": "http://schema.org",
    "@id": "https://dataguids.org/index/dg.4503/33ceb094-68f0-4aed-8fd2-c1f3ff169254",
    "@type": "Dataset",
    "additionalType": "submitted_aligned_reads",
    "author": {
        "name": "Francisco Ortuno"
    },
    "datePublished": "2018-06-22T17:00:20.893899+00:00",
    "description": "TopMED Open Access Aligned Reads",
    "identifier": [
        {
            "@type": "PropertyValue",
            "propertyID": "dataguid",
            "value": "dg.4503/33ceb094-68f0-4aed-8fd2-c1f3ff169254"
        },
        {
            "@type": "PropertyValue",
            "propertyID": "md5",
            "value": "785c3fc4497e7bb3b7273ddd17071d68"
        }
    ],
    "name": "NWD231092.0005.recab.cram",
    "publisher": {
        "@type": "Organization",
        "name": "DCP Data Commons"
    }
}


### 6) Compact Identifier Resolution

Because an appropriate namespace has been defined in Identifiers.org, we may represent the minid as a Compact Identifier and resolve it at identifiers.org or n2t.net.

In [21]:
compact_identifier = "minid:b91CbVyWo5PEQ9j" 
r = requests.get('http://identifiers.org/%s' % compact_identifier, allow_redirects=True,
                headers = {'Accept': 'application/vnd.schemaorg.ld+json'})
r.json()

{'@context': 'http://schema.org',
 '@id': 'https://identifiers.globus.org/ark:/57799/b91CbVyWo5PEQ9j',
 '@type': 'Dataset',
 'contentUrl': ['https://helium.commonsshare.org/django_irods/download/bags/d8088f48bef4408cbc7acb8063ba7a72.zip'],
 'dateCreated': '2018-10-05 02:02:09.887625',
 'identifier': [{'@type': 'PropertyValue',
   'propertyID': 'sha256',
   'value': '0deb9c69ce87af37937d87df3ec3d0c0b2e7501a4600f34d214685d9dbaf0207'}],
 'url': 'https://identifiers.globus.org/ark:/57799/b91CbVyWo5PEQ9j'}

# Create a new GUID

In this section we will demonstrate how to create and register a new identifier using three of the Commons GUID services for two different identifier types. 

First we'll create a file and then we will compute a checksum and create some basic metadata for our new identifier.

#### 1) Create and Upload File

Recall that we downloaded a compressed BDBag containing our VCF file, "d8088f48bef4408cbc7acb8063ba7a72.zip". We unzip it and then using the bdbag utilities we fetch the VCF file we need, and verify the checksum on the file.

Then using the package [SnpEff](http://snpeff.sourceforge.net/index.html) we annotate the VCF file, and upload the annotated VCF to the cloud 

In [1]:
! unzip d8088f48bef4408cbc7acb8063ba7a72.zip

Archive:  d8088f48bef4408cbc7acb8063ba7a72.zip
  inflating: bag/manifest-sha256.txt  
  inflating: bag/fetch.txt           
  inflating: bag/bagit.txt           
  inflating: bag/bag-info.txt        
  inflating: bag/tagmanifest-md5.txt  
  inflating: bag/tagmanifest-sha256.txt  


In [2]:
! bdbag --resolve-fetch all bag


2018-10-12 14:08:38,329 - INFO - Attempting to resolve remote file references from fetch.txt...
2018-10-12 14:08:38,335 - INFO - Attempting GET from URL: https://helium.commonsshare.org/django_irods/download/d8088f48bef4408cbc7acb8063ba7a72/commonssharetestZone/topmed/public/vcfs/NWD580039.recab.vcf
2018-10-12 14:09:31,100 - INFO - File [/Users/tim/Jupyter/bag/data/NWD580039.recab.vcf] transfer successful. 742.841 MB transferred at 14.491 MB/second. Elapsed time: 0:00:51.263885. 
2018-10-12 14:09:31,101 - INFO - Fetch complete. Elapsed time: 0:00:52.766775



In [3]:
! bdbag --validate full bag


2018-10-12 14:09:36,840 - INFO - Validating bag: /Users/tim/Jupyter/bag
2018-10-12 14:09:36,843 - INFO - Verifying checksum for file /Users/tim/Jupyter/bag/data/NWD580039.recab.vcf
2018-10-12 14:09:40,072 - INFO - Verifying checksum for file /Users/tim/Jupyter/bag/manifest-sha256.txt
2018-10-12 14:09:40,073 - INFO - Verifying checksum for file /Users/tim/Jupyter/bag/fetch.txt
2018-10-12 14:09:40,073 - INFO - Verifying checksum for file /Users/tim/Jupyter/bag/bagit.txt
2018-10-12 14:09:40,073 - INFO - Verifying checksum for file /Users/tim/Jupyter/bag/bag-info.txt
2018-10-12 14:09:40,074 - INFO - Bag /Users/tim/Jupyter/bag is valid



The following command is for reference - it was used to generate the VCF annotation - which takes quite a while, so we will not perform it in real time here. The cloud upload step as well, has already been performed. 

In [4]:
#! java -Xmx4g -jar snpEFF/snpeff_latest_core/snpEff/snpEff.jar GRCh37.75 bag/data/NWD580039.recab.vcf > NWD580039.ann.vcf

#### 2) Compute the Checksum

In [5]:
import hashlib
import os
import datetime

filename = 'NWD580039.ann.vcf'

algorithm = hashlib.sha256()
with open(os.path.abspath(filename), 'rb') as open_file:
    buf = open_file.read(65536)
    while len(buf) > 0:
        algorithm.update(buf)
        buf = open_file.read(65536)
    checksum = algorithm.hexdigest()

In [6]:
checksum

'eba8426f9662f3b4ad7e16d481c640d02cd341125198297fe56c7ead571a57db'

#### 3) Fill out Metadata

In [36]:
data = {
    "@context": "https://schema.org", 
    "@type": "Dataset", 
    "identifier": [
        { 
            "@type": "PropertyValue", "propertyID": "sha256", "value": checksum
        }],
    "name": 'NWD580039.ann.vcf', 
    "author": [{"@type": "Person", "name": 'Tim Clark'}], 
    "publisher": [{"@type": "Organization", "name": "KC2"}], 
    "datePublished":  '2018', 
    "fileFormat": "text/plain",
    "additionalType": "vcf",
    "contentSize": os.path.getsize(filename),
    "contentUrl": ["https://drive.google.com/file/d/1y4ws6lHRzNhI-KlaS4qfaXfQefWPEgnX/"],
}

#### 4) Authenticate Via Globus Auth

Using the globus auth service, we obtain scoped authentication tokens from an Oauth2 flow. This allows service to service authentication by granting tokens on behalf of another service. In this example a client application grants tokens for the Argon Identifier Service and the Sodium ORS service.

In [10]:
import datetime
import globus_sdk
from identifiers_client.identifiers_api import IdentifierClient
from identifiers_client.config import config

identifiers_namespace = "HHxPIZaVDh9u"

CLIENT_ID = '5db98a49-26d0-4991-9faf-676c2b5b4231'

# python2/3 safe simple input reading
get_input = getattr(__builtins__, 'raw_input', input)

# Perform OAuth flow to get access tokens
native_auth_client = globus_sdk.NativeAppAuthClient(CLIENT_ID)
transfer_scope = 'urn:globus:auth:scope:transfer.api.globus.org:all'

# required scopes
identifiers_scope = 'https://auth.globus.org/scopes/identifiers.globus.org/create_update'
ors_scope =  'https://auth.globus.org/scopes/e94d4c43-ff3e-4032-8e1e-b29c99ef614a/ors'

native_auth_client.oauth2_start_flow(
    requested_scopes=[identifiers_scope, ors_scope]
)
print("Login Here:\n\n{0}".format(native_auth_client.oauth2_get_authorize_url()))
print(("\n\nNote that this link can only be used once! "
       "If login or a later step in the flow fails, you must restart it."))

auth_code = get_input("Enter resulting code:")

tokens = native_auth_client.oauth2_exchange_code_for_tokens(auth_code)
identifiers_token = tokens.by_scopes[identifiers_scope]['access_token']
identifiers = IdentifierClient('identifiers', base_url='https://identifiers.globus.org/',
    authorizer=globus_sdk.AccessTokenAuthorizer(identifiers_token))

Login Here:

https://auth.globus.org/v2/oauth2/authorize?client_id=5db98a49-26d0-4991-9faf-676c2b5b4231&redirect_uri=https%3A%2F%2Fauth.globus.org%2Fv2%2Fweb%2Fauth-code&scope=https%3A%2F%2Fauth.globus.org%2Fscopes%2Fidentifiers.globus.org%2Fcreate_update+https%3A%2F%2Fauth.globus.org%2Fscopes%2Fe94d4c43-ff3e-4032-8e1e-b29c99ef614a%2Fors&state=_default&response_type=code&code_challenge=0p2NsN3TR03Fx-ai_eXa7oNI_LtRKY_eV1qMfHnwQIQ&code_challenge_method=S256&access_type=online


Note that this link can only be used once! If login or a later step in the flow fails, you must restart it.
Enter resulting code:AR1tgPcmdjyeHfF5ZnAdK5WcEW0ddc


### Create and Register a GUID with three Services

### 1) Argon Minids

Example showing how to use the Argon Identifier Service, to mint and register an ARK/Minid by posting the metadata to the Identifier Service REST API.

To run this part of the notebook you will need to install the Globus Identifiers client. Download and pip install here: https://github.com/globus/globus-identifiers-client

In [43]:
visible_to = 'public'
dataset_identifier = identifiers.create_identifier(
    namespace=identifiers_namespace,
    location=json.dumps([data['contentUrl'][0]]),
    checksums=json.dumps([{'function' : 'sha256', 'value': data['identifier'][0]['value']}]),
    metadata=json.dumps({
        'title': data['name'],
        'date': data['datePublished'],
        'contentSize': data['contentSize'],
        'author': data['author'][0]['name'],
    }),
    visible_to=json.dumps([visible_to]))

print('https://n2t.net/' + dataset_identifier.data['identifier'])

https://n2t.net/ark:/99999/fk4Sp8J2cwDiV3G


### 2) Sodium ORS

Example showing how to use the Sodium Object Registration Service (ORS) to mint and register a DOI. We map our metadata to the Schema.org metadata specification and put to the ORS REST API. This API is registered with the KC3 OpenAPI registry.

In [50]:
ors_token = tokens.by_scopes[ors_scope]['access_token']

In [52]:
headers= {
    'Accept': 'application/json', 
    'Content-Type': 'application/json', 
    'Authorization': 'Bearer {}'.format(ors_token)
}

r = requests.put('https://ors.test.datacite.org/doi/put', 
                 headers=headers, 
                 data=json.dumps(data)
                )

print( 'https://doi.org/' + r.json().get('@id').replace('doi:/', ''))

https://doi.org/10.25489/XPA7G1


In [53]:
ors_get = requests.get('https://ors.test.datacite.org/' + r.json().get("@id"),
            headers={'Accept': 'application/json'})

In [54]:
json.loads(ors_get.content.decode('utf-8'))

{'@id': 'https://doi.org/10.25489/XPA7G1',
 '@context': 'https://schema.org',
 'identifier': ['https://doi.org/10.25489/XPA7G1',
  {'@type': 'PropertyValue',
   'propertyID': 'sha256',
   'value': 'eba8426f9662f3b4ad7e16d481c640d02cd341125198297fe56c7ead571a57db'}],
 '@type': 'Dataset',
 'name': 'NWD580039.ann.vcf',
 'datePublished': '2018',
 'author': [{'name': 'Tim Clark'}],
 'fileFormat': ['text/plain'],
 'contentUrl': ['https://drive.google.com/file/d/1y4ws6lHRzNhI-KlaS4qfaXfQefWPEgnX/']}

### 3) Xenon DOI Registration

In [55]:
data['url'] = "https://drive.google.com/file/d/1QCjoYimsI-zW76Bhjw18dt975ZKAaAUP/"
data.pop('publisher')
data['identifier'] = ['https://example.org', data['identifier'][0] ]
data['contentUrl'] = data['contentUrl'][0]

In [56]:
headers= {'Accept': 'application/json', 'Content-Type': 'application/json'}
r = requests.post(
    'http://nih-guid-broker-elb-1813020435.eu-west-1.elb.amazonaws.com/broker/doi',
    headers=headers, 
    data=json.dumps(data)
)

In [57]:
print('https://doi.org/' + r.json().get('@id').replace('doi:/', ''))

https://doi.org/10.4124/test59202


In [58]:
get_request = requests.get(
    'http://nih-guid-broker-elb-1813020435.eu-west-1.elb.amazonaws.com/broker/doi/'+ r.json().get('@id').replace('doi:/', '')
)
get_request.json()

{'@context': 'https://schema.org',
 '@id': '10.4124/test59202',
 '@type': 'Dataset',
 'identifier': ['https://example.org',
  {'@type': 'PropertyValue',
   'propertyID': 'sha256',
   'value': 'eba8426f9662f3b4ad7e16d481c640d02cd341125198297fe56c7ead571a57db'}],
 'url': 'https://drive.google.com/file/d/1QCjoYimsI-zW76Bhjw18dt975ZKAaAUP/',
 'contentUrl': 'https://drive.google.com/file/d/1y4ws6lHRzNhI-KlaS4qfaXfQefWPEgnX/',
 'name': 'NWD580039.ann.vcf',
 'author': [{'@type': 'Person', 'name': 'Tim Clark'}],
 'datePublished': '2018',
 'additionalType': 'vcf',
 'contentSize': '2923139598',
 'fileFormat': 'text/plain'}