# Using DOS to download protected data

This example shows how DOS can create an interoperability layer to work with data in `indexd`. As we will see, indexd works with fence to provide the credentials to perform URL signing.

## Accessing metadata from indexd

A lambda has been set up to point at dev.bionimbus.org. Let's get some DataObjects from it.

In [18]:
from ga4gh.dos.client import Client
client = Client("https://mkc9oddwq0.execute-api.us-west-2.amazonaws.com/api")
local_client = client.client
models = client.models

Now that we've set up the client we can access data using `ListDataObjects`.

In [25]:
ListDataObjectsRequest = models.get_model('ga4ghListDataObjectsRequest')
data_objects = local_client.ListDataObjects(body=ListDataObjectsRequest(page_size=100)).result().data_objects
print("Returned {} data objects.".format(len(data_objects)))

Returned 14 data objects.


## Downloading data

These Data Objects point to s3 addresses.

In [34]:
data_object = local_client.GetDataObject(data_object_id=data_objects[11].id).result().data_object
print(data_object)

ga4ghDataObject(aliases=None, checksums=[ga4ghChecksum(checksum=u'73d643ec3f4beb9020eef0beed440ad0', type=u'md5')], created=datetime.datetime(2018, 1, 22, 18, 34, 12, tzinfo=tzutc()), description=None, id=u'c8215adc-d77a-4cb1-b1e4-8dd96d7e8821', mime_type=None, name=u'testdata.txt', size=9L, updated=datetime.datetime(2018, 1, 22, 18, 34, 12, tzinfo=tzutc()), urls=[ga4ghURL(system_metadata=protobufStruct(fields=None), url=u's3://cdistest-gen3data/testdata.txt', user_metadata=protobufStruct(fields=None))], version=u'5582aee1')


Ordinarily these data will only be accessible with a third party client. If the data are in public buckets with requester pays, specially formatted URLs may be available.

## Logging in to sign a URL

In `fence` my email `davidcs@ucsc.edu` has been granted access to one of the files for demonstration. To get the signed URL, we need to get a `fence_session` token. Please consider this a preliminary demonstration of crossing auth domains.

First, we must access the google login for `bionimbus.org`.

Clicking this URL will take us to the bionimbus login process.

https://dev.bionimbus.org/user/login/google?redirect=https://dev.bionimbus.org/

On successful authentication we are redirected to bionimbus and a session token can be used to authorize requests.

In [119]:
data_object = local_client.GetDataObject(data_object_id="838a5d53-a02b-452b-9ba1-e7dd0cf01ae3").result().data_object
print(data_object.urls[0].url)
print(data_object.id)

s3://cdis-presigned-url-test/testdata
838a5d53-a02b-452b-9ba1-e7dd0cf01ae3


In [120]:
# FIXME TO be replaced with API key and improved DOS client
import json
fence_session = "50b4b8b6-3047-4ad9-b59f-e6ce0011a605"
data_object_url = "https://mkc9oddwq0.execute-api.us-west-2.amazonaws.com/api/ga4gh/dos/v1/dataobjects/{}".format(data_object.id)
res = !http get $data_object_url "fence_session:$fence_session"

In [121]:
data_object = json.loads(res[0])["data_object"]
signed_url = data_object['urls'][1]['url']
print(signed_url)

https://cdis-presigned-url-test.s3.amazonaws.com/testdata?AWSAccessKeyId=AKIAJO3MS2GL7DOHQ55A&Expires=1516774569&Signature=J4e%2FVdwenVOyHHSMhGkkW65UdJU%3D


## Downloading from a signed URL

Since this URL is accessible using plain HTTP downloading, we can use `wget`, `curl`, etc.

In [126]:
!!wget "$signed_url" -O "output"

['--2018-01-23 21:47:10--  https://cdis-presigned-url-test.s3.amazonaws.com/testdata?AWSAccessKeyId=AKIAJO3MS2GL7DOHQ55A&Expires=1516774569&Signature=J4e%2FVdwenVOyHHSMhGkkW65UdJU%3D',
 'Resolving cdis-presigned-url-test.s3.amazonaws.com (cdis-presigned-url-test.s3.amazonaws.com)... 52.216.230.35',
 'Connecting to cdis-presigned-url-test.s3.amazonaws.com (cdis-presigned-url-test.s3.amazonaws.com)|52.216.230.35|:443... connected.',
 'HTTP request sent, awaiting response... 200 OK',
 'Length: 40 [binary/octet-stream]',
 'Saving to: \xe2\x80\x98output\xe2\x80\x99',
 '',
 '     0K                                                       100% 1.23M=0s',
 '',
 '2018-01-23 21:47:10 (1.23 MB/s) - \xe2\x80\x98output\xe2\x80\x99 saved [40/40]',
 '']

We can then verify checksums as necessary.

In [128]:
import hashlib
# https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()
print(md5('output'))
print(data_object['checksums'][0]['checksum'])
!!cat output

a17a26fd6323d6079b31480947a3389e
73d643ec3f4beb9020eef0beed440ad0


['Hi Zac!', 'cdis-data-client uploaded this!']