# Introduction
This notebook walks through the basics of using the Data Hub API to work on, validate, and submit your data.  These APIs are designed to allow users to perform all the actions that can be done via the [Data Submission Portal](https://hub.datacommons.cancer.gov/) from a notebook or script.  The intent is to allow submitters to operate directly from their own environments if they so choose rathar than work through the graphical submission interface.

There are a few prerequists that you have to meet before you can use this API:

## Prerequisites

### GraphQL
The Data Hub API uses [GraphQL](https://graphql.org/) and a good understanding of how to use GraphQL is required.  Since GraphQL can be complex, a tutorial is beyond the scope of this document, however the [GraphQL Documentation](https://graphql.org/learn/) can be very useful.

### Login.gov account
Use of Data Hub in general requires that a user have an account registered with [Login.gov](https://www.login.gov/) (NIH users can use their NIH account and PIV card).  Note that a Login.gov account is distinct from an eRA Commons identity that is frequently used at NIH.  They are not the same thing.

### Approved Submission
You must recieve approval to submit data to CRDC prior to using the Data Hub APIs.  If you need approval, please read and follow the [Submissions Request Instructions](https://datacommons.cancer.gov/submit).  Instructions for using the graphical data submission process are on the same page.

### An API Token
If you are an approved submitter with a Login.gov or NIH account, you can generate an API token from the graphical interface.  Log into the system, then click on your user name and select the **API Token** menu option.  This will bring up a dialog box that allows you to create an API token and copy it to your clipboard.  There are two things to note about API tokens
- The token is tied to your user identity and can be used on any submission that you're approved to work on.
- You can have only one token at a time.  Generating a new token will revoke the previous token.


In [2]:
import requests
import os

The imports below are just used for display purposes in this notebook, they're not required to interact with the Data Hub API

In [30]:
import pandas as pd
from IPython.display import display, Markdown, Latex

API Endpoints

In [3]:
prod = 'https://hub.datacommons.cancer.gov/api/graphql'
#Note that use of Dev2 requires a VPN connection through the NIH firewall
dev2 = 'https://hub-dev2.datacommons.cancer.gov/api/graphql'

# Security Note
It is ***highly*** recommended that you keep your API token secure and not include it in any code.  While there are many ways to do this, for the purposes of this notebook it's been set in an environment variable names "DEV2API".

In [29]:
dev2APIKey = os.environ['DEV2API']

In [78]:
def apiQuery(url, query, variables,headers):
    token = os.environ['DEV2API']
    if headers is None:
        headers = {"Authorization": f"Bearer {token}"}
    else:
        headers["Authorization"] = f"Bearer {token}"
    try:
        if variables is None:
            result = requests.post(url = url, headers = headers, json={"query": query})
        else:
            result = requests.post(url = url, headers = headers, json = {"query":query, "variables":variables})
        if result.status_code == 200:
            return result.json()
        else:
            print(f"Error: {result.status_code}")
            return result.content
    except requests.exceptions.HTTPError as e:
        return(f"HTTP Error: {e}")

## Step 1: Understanding the landscape

Let's assume that this is our first submission using the API, so what we need to do is list the studies that my orgnaization is approved for so I can submit to the correct study. That's done with the *listApprovedStudiesOfMyOrganization* query

In [6]:
org_query = """
{
  listApprovedStudiesOfMyOrganization{
    originalOrg
    dbGaPID
    studyAbbreviation
    studyName
    _id
  }
}
"""

Note that the actual results returned by this query will vary for each organization.  These are examples only and shouldn't be used.

In [79]:
org_res = apiQuery(dev2, org_query,None, None)

In [80]:
org_df = pd.DataFrame(org_res['data']['listApprovedStudiesOfMyOrganization'])
display(Markdown(org_df.to_markdown()))

|    | originalOrg                                    | dbGaPID   | studyAbbreviation   | studyName                                                                                       | _id                                  |
|---:|:-----------------------------------------------|:----------|:--------------------|:------------------------------------------------------------------------------------------------|:-------------------------------------|
|  0 | Purdue Center for Cancer Research              |           | UBC01               | Antitumor Activity and Molecular Effects of Vemurafenib in Dogs with BRAF-mutant Bladder Cancer | b9e9ab79-d90b-4ec1-83b7-f83a5a75f5b5 |
|  1 | Comparative Molecular Characterization Program |           | OSA01               | A Multi-Platform Sequencing Analysis of Canine Appendicular Osteosarcoma                        | e3feefe9-cc70-4ae0-be06-9df7f29d84e8 |
|  2 | Comparative Molecular Characterization Program |           | TCL01               | Whole exome sequencing analysis of canine cancer cell lines                                     | 6c7fa436-efa3-42c6-af4c-7f5b70a1d35d |
|  3 | NCI BBRB                                       |           | CMB                 | Cancer Moonshot Biobank                                                                         | 4c2b6522-20b8-4841-8c7a-318b325c99b4 |
|  4 | CCDI                                           | phs003432 | TALLsc              | T-cell Acute Lymphoblastic Leukemia Single Cell RNA Sequencing and ATAC Sequencing              | 49a69fef-71f8-44e6-ad3b-f7a62d91e348 |

## Step 2: Creating a new submission

For the purposes of this demonstration, we'll use the CCDI TALLsc study as the example.  In order to submit data you first step create a new submission within the study.  **Do not do this if you're continuing with an exsiting study**.

From the data we obtained in the first query, we'll have to parse out the infrmiaton that's relevant to the CCDI TALLsc study.  We'll need these to construct the query that creates the new submission

In [8]:
for study in org_res['data']['listApprovedStudiesOfMyOrganization']:
    if study['originalOrg'] == 'CCDI':
        org = study['originalOrg']
        dbgap = study['dbGaPID']
        abbrev = study['studyAbbreviation']
        name = study['studyName']
        studyid = study['_id']

dc = "CDS"
name = "Demo create submission Jupyter"
intention = "New/Update"
datatype = "Metadata and Data Files"


### createSubmissions mutation

Creating submissions requires the use of a mutation that calls createSubmissions.  There are multiple required variables that have to be provided in a GraphQL compatible way:
- studyID:  This is the Study ID that can be obtained from the graphical interface
- dbGaPID: Obtained when registering the study at dbGaP.  This is required for all controlled access studies
- dataCommons: This is the CRDC Data Commons the submissions will be deposited in
- name: This can be anything that allows you to identify this specific submission
- intention: Can be “New/Update” if you are adding information to the submission or  “Delete” if you are removing information from the submission
- dataType: Can be either "Metadata and Data Files" or “Metadata Only”.  Which one is selected depends on whether or not data files will be included in the submission

  This query will return the _id field which will be the newly created submission ID. It will also return a number of other fields that can be checked to make sure the submission was created properly.

In [55]:
create_submission_query = """
mutation CreateNewSubmission(
  $studyID: String!,
  $dbGaPID: String!,
  $dataCommons: String!,
  $name: String!,
  $intention:String!,
  $dataType: String!,
){
  createSubmission(
    studyID: $studyID,
    dbGaPID: $dbGaPID,
    dataCommons: $dataCommons,
    name: $name,
    intention: $intention,
    dataType: $dataType
  ){
    _id
    studyID
    dbGaPID
    dataCommons
    name
    intention
    dataType
    status
  }
}"""

In [56]:
variables = {"studyID":studyid, "dbGaPID":dbgap, "dataCommons":dc, "name":name, "intention":intention,"dataType":datatype}

In [57]:
create_res = apiQuery(dev2,create_submission_query, variables, None)

In [75]:
print(create_res)

{'data': {'createSubmission': {'_id': 'eda77bf5-37cd-4f3b-822e-cbceb31fb05c', 'studyID': '49a69fef-71f8-44e6-ad3b-f7a62d91e348', 'dbGaPID': 'phs003432', 'dataCommons': 'CDS', 'name': 'Demo create submission Jupyter', 'intention': 'New/Update', 'dataType': 'Metadata and Data Files', 'status': 'New'}}}


Parse out the submission ID since we'll need it later

In [76]:
submissionid = create_res['data']['createSubmission']["_id"]
subname = create_res['data']['createSubmission']['name']

### Side trip

At this point if you go to the graphical interface you should see that a new submission has been created using the name provided in the query

## Step 3: Uploading Submission templates

Once the study is created, the next step is to start uploading metadata submission templates.  There are two ways of accomplishing this upload:
1) Using the Upload CLI Tool : This is generally the easiest method and can be used to upload both the metadata templates and the data files.  The use of the Uploader CLI Tool [is documented elsewhere](https://github.com/CBIIT/crdc-datahub-cli-uploader/tree/master)
2) Using the API : If you wish to provide metadata only via a program, the API can be used as will be demonstrated in this notebook.

**Note that while the API can be used to upload metadata, the actual data files MUST be uploaded with the Upload CLI Tool**

### Getting temporary credentials
The first step of submitting metadata via an API is to use the createTempCredentials mutation to get credentials that allow the submisison.

In [81]:
get_temp_cred_query = """
 mutation CreateTempCredentials(
        $submissionID: ID!
    ){
        createTempCredentials(submissionID: $submissionID){
          accessKeyId
          secretAccessKey
          sessionToken          
        }
    }
"""

In [82]:
cred_variables = {"submissionID" : submissionid}

In [83]:
cred_res = apiQuery(dev2, get_temp_cred_query, cred_variables, None)

In [84]:
accessKeyID = cred_res['data']['createTempCredentials']['accessKeyId']
secretKey = cred_res['data']['createTempCredentials']['secretAccessKey']
sessionToken = cred_res['data']['createTempCredentials']['sessionToken']

### Collecting information about the metadata files to upload
Let's set up the list of metadata files we want to upload.  This will be a list of **FileInput** objects.  A FileInput object consiste of a dictionary with *fileName* and *size* as the keys.

- fileName: The full path file name.  Note that this will vary depending on the operating system being used.
- size: The size of the file in bytes

The last field required for the query is the *type* field is either "metadata" or "data file" and "data file" isn't allowed ouside of the Upload CLI Tool, we'll set it to "metadata"

In [85]:
metadatafiles = [{"fileName":"/home/pihl/testdata/PDXNet_participant.tsv", "size": 2106 }, {"fileName":"/home/pihl/testdata/PDXNet_sample.tsv", "size":12416}]
type = "metadata"

### The createBatch mutation
Now that we've got credentials and the list of files, we create a "batch", which is the term for one or more files uploaded at the same time.  We do this by using the createBatch muations as shown below.  

One of the critical pieces of information returned is the signed URL that is used to actually trasfer the files to Data Hub.

In [86]:
create_batch_query = """
mutation CreateBatch($submissionID: ID!, $type: String!, $file: [FileInput]) {
  createBatch(submissionID: $submissionID, type: $type, files: $file) {
    _id
    files {
      fileName
      signedURL
    }
  }
}
"""

In [87]:
create_batch_variables = {"submissionID":submissionid, "type":type, "file":metadatafiles}
print(create_batch_variables)

{'submissionID': 'eda77bf5-37cd-4f3b-822e-cbceb31fb05c', 'type': 'metadata', 'file': [{'fileName': '/home/pihl/testdata/PDXNet_participant.tsv', 'size': 2106}, {'fileName': '/home/pihl/testdata/PDXNet_sample.tsv', 'size': 12416}]}


In [89]:
create_batch_res = apiQuery(dev2, create_batch_query, create_batch_variables,None)


The results from this mutation will have the signed URLs (again, for security reasons it's a good idea to not print them out).  We'll use these to upload the files.  Make sure that you're using the correct signed URL for each file.

In [90]:
batchid = create_batch_res['data']['createBatch']['_id']


In [91]:
def fileUpload(file, signedurl):
    try:
        files = {"file": open(file, 'rb')}
        res = requests.post(signedurl, files=files)
        if res.status_code == 200:
            return res.json()
        else:
            print(f"Error: {res.status_code}")
            return res.content
    except requests.exceptions.HTTPError as e:
        return(f"HTTP Error: {e}")


In [95]:
def awsFileUpload(file, signedurl):
    #https://docs.aws.amazon.com/AmazonS3/latest/userguide/example_s3_Scenario_PresignedUrl_section.html
    try:
        with open(file, 'r') as f:
            filetext = f.read()
        res = requests.post(signedurl, data=filetext)
        if res.status_code == 200:
            return res.json()
        else:
            print(f"Error: {res.status_code}")
            return res.content
    except requests.exceptions.HTTPError as e:
        return(f"HTTP error: {e}")

In [97]:
for entry in metadatafiles:
    for metadatafile in create_batch_res['data']['createBatch']['files']:
        if entry['fileName'] == metadatafile['fileName']:
            #metares = fileUpload(metadatafile['fileName'], metadatafile['signedURL'])
            metares = awsFileUpload(metadatafile['fileName'], metadatafile['signedURL'])
            print(metares)

Error: 403
b'<?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>SignatureDoesNotMatch</Code><Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message><AWSAccessKeyId>ASIA3MJN7XTZQ3KX6GEA</AWSAccessKeyId><StringToSign>POST\n\n\n1724795560\nx-amz-acl:private\nx-amz-security-token:IQoJb3JpZ2luX2VjEAQaCXVzLWVhc3QtMSJGMEQCIDf/79DWRcoVjMYQRDXvfLQsKId2GGYA1Plicgz45IwNAiBJ+21D5gbDfgO3289B3QO4humSGZ9NakwY1e5L7PfEGyr5AwgdEAAaDDc4MjMxNzM3MDYxMSIMpYTbbC+zwfCgR+U+KtYDR4I3UUe9lMpHN8vYu8QyOsEuROqZmqmQ/iqp9UHKiYSuk42XfvcRWDf/ilq4Ms6atUDV2kWQg+Ow5BB2I7Qw0kLIlckMv3J31gmHFPNBjzso/0dpZ93oZUlQeOFgHBmn1GmCboO2fQv6SKOz/iYed5llPNdGffIHkJnZboD0xxcNSvCc5XVcB9nPdIBFwzUl/y67aRwyKUvvcG+MfgBpcVYsn0Fb66KCiFfJX9ZX8yRQzo4CyxUN/2nKnXmBP6/OQqFn/zo6P8mmyKXBFKrp6zEJjTMoUBFn0lw2Fq//E787OYG6ln5v8kuZqgX97KYqPWM5bu7MQpwnyUDII+zAkemSG9LHLOKnw1Ar+9FD2iBkEuAneqPVXbF8sbwBjKB0r3qc5xXhmZgpI5tZzWC7PNhr7Xkxu0BKNogq3JrSlsZZV1IXblSuakYwTHU++Ey0Su/Cvt1M3F+K2pIuE

After files have been uploaded, the next step is to update the batch by calling the updateBatch mutation