<a id='top'></a>
# Quick Links
[Introduction](#introduction)\
[Prerequisites](#prerequisites)\
[Security Note](#securitynote)\
[Step 1: Understanding the Landscape](#step1)\
[Step 2: Creating a new submission or using an existing submission](#step2)\
- [Creating a new submission](#newSubmission)
- [Step 2b: Working with existing submissions](#existingSubmission)

[Step 3: Uploading Submission templates](#step3)\
[Step 4: Running the Validations](#step4)\
[Step 5: Submitting, Canceling, or Withdrawing](#step5)


<a id='introduction'></a>
# Introduction
This notebook walks through the basics of using the Data Hub API to work on, validate, and submit your data.  These APIs are designed to allow users to perform all the actions that can be done via the [Data Submission Portal](https://hub.datacommons.cancer.gov/) from a notebook or script.  The intent is to allow submitters to operate directly from their own environments if they so choose rathar than work through the graphical submission interface.

There are a few prerequists that you have to meet before you can use this API:

<a id='prerequisites'></a>
# Prerequisites


## GraphQL
The Data Hub API uses [GraphQL](https://graphql.org/) and a good understanding of how to use GraphQL is required.  Since GraphQL can be complex, a tutorial is beyond the scope of this document, however the [GraphQL Documentation](https://graphql.org/learn/) can be very useful.

## Login.gov account
Use of Data Hub in general requires that a user have an account registered with [Login.gov](https://www.login.gov/) (NIH users can use their NIH account and PIV card).  Note that a Login.gov account is distinct from an eRA Commons identity that is frequently used at NIH.  They are not the same thing.

## Approved Submission
You must recieve approval to submit data to CRDC prior to using the Data Hub APIs.  If you need approval, please read and follow the [Submissions Request Instructions](https://datacommons.cancer.gov/submit).  Instructions for using the graphical data submission process are on the same page.

## An API Token
If you are an approved submitter with a Login.gov or NIH account, you can generate an API token from the graphical interface.  Log into the system, then click on your user name and select the **API Token** menu option.  This will bring up a dialog box that allows you to create an API token and copy it to your clipboard.  There are two things to note about API tokens
- The token is tied to your user identity and can be used on any submission that you're approved to work on.
- You can have only one token at a time.  Generating a new token will revoke the previous token.


In [53]:
import requests
import os

The imports below are just used for display purposes in this notebook, they're not required to interact with the Data Hub API

In [54]:
import pandas as pd
from IPython.display import display, Markdown, Latex

API Endpoints

In [55]:
prod = 'https://hub.datacommons.cancer.gov/api/graphql'
#Note that use of Dev2 is for example purposes only.  Unless you are an approved tester, you should be using the production URL
dev2 = 'https://hub-dev2.datacommons.cancer.gov/api/graphql'

<a id='securityNote'></a>
# Security Note
It is ***highly*** recommended that you keep your API token secure and not include it in any code.  While there are many ways to do this, for the purposes of this notebook it's been set in an environment variable names "DEV2API".

In [56]:
dev2APIKey = os.environ['DEV2API']

In [61]:
def apiQuery(url, query, variables):
    token = os.environ['DEV2API']
    headers = {"Authorization": f"Bearer {token}"}
    try:
        if variables is None:
            result = requests.post(url = url, headers = headers, json={"query": query})
        else:
            result = requests.post(url = url, headers = headers, json = {"query":query, "variables":variables})
        if result.status_code == 200:
            return result.json()
        else:
            print(f"Error: {result.status_code}")
            return result.content
    except requests.exceptions.HTTPError as e:
        return(f"HTTP Error: {e}")

<a id='step1'></a>
# Step 1: Understanding the landscape
Let's assume that this is our first submission using the API, so what we need to do is list the studies that my orgnaization is approved for so I can submit to the correct study. That's done with the *listApprovedStudiesOfMyOrganization* query

In [62]:
org_query = """
{
  listApprovedStudiesOfMyOrganization{
    originalOrg
    dbGaPID
    studyAbbreviation
    studyName
    _id
  }
}
"""

Note that the actual results returned by this query will vary for each organization.  These are example results only and shouldn't be used.

In [63]:
org_res = apiQuery(dev2, org_query, None)

In [64]:
org_df = pd.DataFrame(org_res['data']['listApprovedStudiesOfMyOrganization'])
display(Markdown(org_df.to_markdown()))

|    | originalOrg                                    | dbGaPID   | studyAbbreviation   | studyName                                                                                       | _id                                  |
|---:|:-----------------------------------------------|:----------|:--------------------|:------------------------------------------------------------------------------------------------|:-------------------------------------|
|  0 | Purdue Center for Cancer Research              |           | UBC01               | Antitumor Activity and Molecular Effects of Vemurafenib in Dogs with BRAF-mutant Bladder Cancer | b9e9ab79-d90b-4ec1-83b7-f83a5a75f5b5 |
|  1 | Comparative Molecular Characterization Program |           | OSA01               | A Multi-Platform Sequencing Analysis of Canine Appendicular Osteosarcoma                        | e3feefe9-cc70-4ae0-be06-9df7f29d84e8 |
|  2 | Comparative Molecular Characterization Program |           | TCL01               | Whole exome sequencing analysis of canine cancer cell lines                                     | 6c7fa436-efa3-42c6-af4c-7f5b70a1d35d |
|  3 | NCI BBRB                                       |           | CMB                 | Cancer Moonshot Biobank                                                                         | 4c2b6522-20b8-4841-8c7a-318b325c99b4 |
|  4 | CCDI                                           | phs003432 | TALLsc              | T-cell Acute Lymphoblastic Leukemia Single Cell RNA Sequencing and ATAC Sequencing              | 49a69fef-71f8-44e6-ad3b-f7a62d91e348 |

<a id='step2'></a>
# Step 2:  Creating a new submission or using an existing submission
The next step in the process is to either create a new submission or to use one of your existing submissions.  It is not necessary to create a new submission every time, if you have an existing submission that you need to continue working on, simply start using that submission. 

<a id='newSubmission'></a>
## Step 2, Alternate 1: Creating a new submission
For the purposes of this demonstration, we'll use the CCDI TALLsc study as the example.  In order to submit data you first step create a new submission within the study.  **Do not do this if you're continuing with an exsiting study**.

From the data we obtained in the first query, we'll have to parse out the infrmiaton that's relevant to the CCDI TALLsc study.  We'll need these to construct the query that creates the new submission

In [65]:
for study in org_res['data']['listApprovedStudiesOfMyOrganization']:
    if study['originalOrg'] == 'CCDI':
        org = study['originalOrg']
        dbgap = study['dbGaPID']
        abbrev = study['studyAbbreviation']
        name = study['studyName']
        studyid = study['_id']

dc = "CDS"
name = "Jupyter Demo 4"
intention = "New/Update"
datatype = "Metadata and Data Files"


### createSubmissions mutation

Creating submissions requires the use of a mutation that calls createSubmissions.  There are multiple required variables that have to be provided in a GraphQL compatible way:
- studyID:  This is the Study ID that can be obtained from the graphical interface
- dbGaPID: Obtained when registering the study at dbGaP.  This is required for all controlled access studies
- dataCommons: This is the CRDC Data Commons the submissions will be deposited in
- name: This can be anything that allows you to identify this specific submission
- intention: Can be “New/Update” if you are adding information to the submission or  “Delete” if you are removing information from the submission
- dataType: Can be either "Metadata and Data Files" or “Metadata Only”.  Which one is selected depends on whether or not data files will be included in the submission

  This query will return the _id field which will be the newly created submission ID. It will also return a number of other fields that can be checked to make sure the submission was created properly.

In [66]:
create_submission_query = """
mutation CreateNewSubmission(
  $studyID: String!,
  $dbGaPID: String!,
  $dataCommons: String!,
  $name: String!,
  $intention:String!,
  $dataType: String!,
){
  createSubmission(
    studyID: $studyID,
    dbGaPID: $dbGaPID,
    dataCommons: $dataCommons,
    name: $name,
    intention: $intention,
    dataType: $dataType
  ){
    _id
    studyID
    dbGaPID
    dataCommons
    name
    intention
    dataType
    status
  }
}"""

In [67]:
variables = {"studyID":studyid, "dbGaPID":dbgap, "dataCommons":dc, "name":name, "intention":intention,"dataType":datatype}

In [68]:
create_res = apiQuery(dev2,create_submission_query, variables)

Parse out the submission ID since we'll need it later

In [69]:
submissionid = create_res['data']['createSubmission']["_id"]
subname = create_res['data']['createSubmission']['name']

#### Side trip

At this point if you go to the graphical interface you should see that a new submission has been created using the name provided in the query

<a id='existingSubmission'></a>
### Step 2, Alternate 2: Working with existing submissions
If you already have submissions in Data Hub that you've been working with, you can continue to work with them instead of creating a new submission.  To continue work on a submission, you will first have to identify the submissions using the *listSubmissions* query.

The listSubmissions query requires that **status** be provided as a parameter.  The status can be one of:
- All
- New
- In Progress
- Submitted
- Released
- Completed
- Archived
- Canceled
- Rejected
- Withdrawn
- Deleted

Details about what each of these states means can be found in the Submission Documentation.  For most submitters, the important states are **New**, **In Progress**, and **Submitted** as those will be the states that allow work to be done on the submission.

This allows for queries to bring back information about a specific state, but for the purposes of the demonstration, we'll use "All" to bring back everything.  We'll also return some additional information about each submission so we can identify the ones we want to work with.

For long lists, the *listSubmissions* query also allows the list to be sorted in ascending or descending order with the **sortDirection** field, and to request a sorting order by field with the **orderBy** field.  Additional fields for this query can be found in the API documentation. 

In [70]:
list_sub_query = """
    query ListSubmissions($status:String!){
          listSubmissions(status: $status){
            submissions{
              _id
              name
              status
              studyAbbreviation
              studyID
            }
          }
    }
"""

In [71]:
statusvariables = {"status":"All"}

In [72]:
list_sub_res = apiQuery(dev2, list_sub_query, statusvariables)

In [73]:
submissions_df = pd.DataFrame(list_sub_res['data']['listSubmissions']['submissions'])
display(Markdown(submissions_df.to_markdown()))

|    | _id                                  | name                           | status      | studyAbbreviation   | studyID                              |
|---:|:-------------------------------------|:-------------------------------|:------------|:--------------------|:-------------------------------------|
|  0 | 8a71b104-43d2-44d7-9693-d61f93e190a0 | Jupyter Demo 4                 | New         | TALLsc              | 49a69fef-71f8-44e6-ad3b-f7a62d91e348 |
|  1 | 162fb91f-75a8-4994-86e7-8df189ebc476 | Jupyter Demo 3                 | In Progress | TALLsc              | 49a69fef-71f8-44e6-ad3b-f7a62d91e348 |
|  2 | f3eb4e0d-872c-4cbe-a758-0a2df9a1200d | Jupyter Demo 2                 | In Progress | TALLsc              | 49a69fef-71f8-44e6-ad3b-f7a62d91e348 |
|  3 | 02862615-84b7-4815-becf-97a8593bf629 | Jupyter Demo 1                 | In Progress | TALLsc              | 49a69fef-71f8-44e6-ad3b-f7a62d91e348 |
|  4 | eda77bf5-37cd-4f3b-822e-cbceb31fb05c | Demo create submission Jupyter | Canceled    | TALLsc              | 49a69fef-71f8-44e6-ad3b-f7a62d91e348 |
|  5 | 04bd7dad-0859-49aa-8df1-5e6560e5482a | Demo create submission 1       | New         | TALLsc              | 49a69fef-71f8-44e6-ad3b-f7a62d91e348 |
|  6 | d77df872-384f-493f-b18f-449ed6fa7fdb | Demo create submission Jupyter | In Progress | TALLsc              | 49a69fef-71f8-44e6-ad3b-f7a62d91e348 |
|  7 | f41aea9c-bb76-4b48-8b53-27028317b434 | Demo create submission Jupyter | In Progress | TALLsc              | 49a69fef-71f8-44e6-ad3b-f7a62d91e348 |
|  8 | 107ba083-f107-4a2f-a848-824bb8746a01 | Demo create submission 1       | New         | TALLsc              | 49a69fef-71f8-44e6-ad3b-f7a62d91e348 |
|  9 | 181432cd-e915-46ff-b62e-1f167abb7e2f | API Demonstration              | New         | CMB                 | 4c2b6522-20b8-4841-8c7a-318b325c99b4 |

Since we're working with the TALLsc study, we need to work on one of the submissions related to that

In [74]:
for submission in list_sub_res['data']['listSubmissions']['submissions']:
    if submission['name'] == 'Jupyter Demo 4':
        submissionid = submission['_id']

<a id='step3'></a>
# Step 3: Uploading Submission templates
Once the study is created, the next step is to start uploading metadata submission templates.  There are two ways of accomplishing this upload:
1) Using the Upload CLI Tool : This is generally the easiest method and can be used to upload both the metadata templates and the data files.  The use of the Uploader CLI Tool [is documented elsewhere](https://github.com/CBIIT/crdc-datahub-cli-uploader/tree/master)
2) Using the API : If you wish to provide metadata only via a program, the API can be used as will be demonstrated in this notebook.

**Note that while the API can be used to upload metadata, the actual data files MUST be uploaded with the Upload CLI Tool**

## Collecting information about the metadata files to upload
Let's set up the list of metadata files we want to upload.  This will be a list of **FileInput** objects.  A FileInput object consiste of a dictionary with *fileName* and *size* as the keys.

- fileName: The full path file name.  Note that this will vary depending on the operating system being used.
- size: The size of the file in bytes

The last field required for the query is the *type* field is either "metadata" or "data file" and "data file" isn't allowed ouside of the Upload CLI Tool, we'll set it to "metadata"

In [75]:
metadatafiles = [{"fileName":"PDXNet_participant.tsv", "size": 2106 }, {"fileName":"PDXNet_sample.tsv", "size":12416},
                 {"fileName":"PDXNet_diagnosis.tsv", "size":6439},{"fileName":"PDXNet_file.tsv", "size":76940},
                 {"fileName":"PDXNet_genomic_info.tsv", "size":283886},{"fileName":"PDXNet_image.tsv", "size":3671},
                {"fileName":"PDXNet_program.tsv", "size":307},{"fileName":"PDXNet_study.tsv", "size":2171},
                 {"fileName":"PDXNet_treatment.tsv", "size":112}]
type = "metadata"

## The createBatch mutation
Now that we've got credentials and the list of files, we create a "batch", which is the term for one or more files uploaded at the same time.  We do this by using the createBatch muations as shown below.  

One of the critical pieces of information returned is the signed URL that is used to actually trasfer the files to Data Hub.

In [76]:
create_batch_query = """
mutation CreateBatch(
    $submissionID: ID!, 
    $type: String!, 
    $file: [FileInput]) {
  createBatch(submissionID: $submissionID, type: $type, files: $file) {
    _id
    files {
      fileName
      signedURL
    }
  }
}
"""

In [77]:
create_batch_variables = {"submissionID":submissionid, "type":type, "file":metadatafiles}

In [78]:
create_batch_res = apiQuery(dev2, create_batch_query, create_batch_variables)

The results from this mutation will have the signed URLs (again, for security reasons it's a good idea to not print them out).  We'll use these to upload the files.  Make sure that you're using the correct signed URL for each file.  We'll also need the batch ID, so that shoudl be parsed out.

In [79]:
batchid = create_batch_res['data']['createBatch']['_id']

In [80]:
def awsFileUpload(file, signedurl, datadir):
    #https://docs.aws.amazon.com/AmazonS3/latest/userguide/example_s3_Scenario_PresignedUrl_section.html
    headers = {'Content-Type': 'text/tab-separated-values'}
    try:
        fullFileName = datadir+file
        with open(fullFileName, 'rb') as f:
            filetext = f.read()
        res = requests.put(signedurl, data=filetext, headers=headers)
        if res.status_code == 200:
            return res
        else:
            print(f"Error: {res.status_code}")
            return res.content
    except requests.exceptions.HTTPError as e:
        return(f"HTTP error: {e}")

In [86]:
def processFilesForUpload(metadatafiles, datadir,batch_creation_results):
    #This takes the dictionary of metadata files and loops through, submitting each file using awsFileUpload.  Returns an UploadResult object
    file_upload_result = []
    for entry in metadatafiles:
        for metadatafile in create_batch_res['data']['createBatch']['files']:
            if entry['fileName'] == metadatafile['fileName']:
                metares = awsFileUpload(metadatafile['fileName'], metadatafile['signedURL'], datadir)
                if metares.status_code == 200:
                    succeeded = True
                else:
                    succeeded = False
                file_upload_result.append({'fileName':entry['fileName'], 'succeeded': succeeded, 'errors':[], 'skipped':False})
    return file_upload_result

As each file is uploaded, an *UploadResult* object has to be constructed.  This will get used in the batch update step.

In [87]:
datadir = "/home/pihl/testdata/"
file_upload_result = processFilesForUpload(metadatafiles, datadir, create_batch_res)

After files have been uploaded, the next step is to update the batch by calling the *updateBatch* mutation

In [90]:
update_batch_query = """
    mutation UpdateBatch(
        $batchID: ID!
        $files: [UploadResult]
        ){
        updateBatch(batchID:$batchID, files:$files){
            _id
            displayID
        }
        }
"""

In [91]:
update_variables = {'batchID':batchid, 'files':file_upload_result}

In [92]:
update_res = apiQuery(dev2, update_batch_query, update_variables)

#### Side Trip
If you log into the Data Hub interface, at this point you should see the files that have been uploaded along with any errors that were detected.

### Checking the upload
Before going any further, it's a good idea to make sure that the upload went as expected.  The best way to check for upload errors is wtih the *listBatches* query.  Since this returns all of the batches in a submission, you'll have to do a little parsing to see if there are any issues with the batch you just sent.

In [93]:
list_batches_query = """
query ListBatches($submissionID: ID!) {
  listBatches(submissionID: $submissionID) {
    batches {
      _id
      submissionID
      displayID
      type
      fileCount
      files {
        fileName
      }
      status
      errors
    }
  }
}
"""

In [94]:
batches_variables = {'submissionID':submissionid}

In [95]:
batch_error_res = apiQuery(dev2, list_batches_query, batches_variables)

In [96]:
batch_df = pd.DataFrame(batch_error_res['data']['listBatches']['batches'])
display(Markdown(batch_df.to_markdown()))

|    | _id                                  | submissionID                         |   displayID | type     |   fileCount | files                                                                                                                                                                                                                                                                                                                                     | status   | errors                                                                                                                                                                                                                                                                                                      |
|---:|:-------------------------------------|:-------------------------------------|------------:|:---------|------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  0 | d3405c53-ff40-4ece-b1dd-a36f9f9e7a29 | 8a71b104-43d2-44d7-9693-d61f93e190a0 |           1 | metadata |           9 | [{'fileName': 'PDXNet_participant.tsv'}, {'fileName': 'PDXNet_sample.tsv'}, {'fileName': 'PDXNet_diagnosis.tsv'}, {'fileName': 'PDXNet_file.tsv'}, {'fileName': 'PDXNet_genomic_info.tsv'}, {'fileName': 'PDXNet_image.tsv'}, {'fileName': 'PDXNet_program.tsv'}, {'fileName': 'PDXNet_study.tsv'}, {'fileName': 'PDXNet_treatment.tsv'}] | Failed   | ['“PDXNet_sample.tsv: 38”: conflict data detected: “sample_type”: "RNA".', '“PDXNet_sample.tsv: 74”: conflict data detected: “sample_type”: "DNA".', '“PDXNet_image.tsv:2”:  Key property “study_link_id” value is required.', '“PDXNet_treatment.tsv:2”:  Key property “treatment_id” value is required.'] |

Clearly there were some issues associated with the sample file that have to be corrected.  From the error message, it looks like there is a conflict in that the same sample has different sample_types.  While this almost certainly reflects how the project views its samples, it does create a situation that Data Hub cannot resolve, so sample identifiers have to be unique.  Additionally, there is a missing study_link_id and treatment_id that need to be supplied.  Additional information about what is required in those fields can be found in the repositories data dictionary, available at the Data Model Viewer in Data Hub.

So to correct the previous errors, we'll make new sample IDs that are unique, and we'll drop the image and treatment files for this demonstration.

In [97]:
new_metadatafiles = [{"fileName":"PDXNet_participant.tsv", "size": 2106 }, {"fileName":"PDXNet_sampleFIXED.tsv", "size":12881},
                 {"fileName":"PDXNet_diagnosis.tsv", "size":6439},{"fileName":"PDXNet_file.tsv", "size":76940},
                 {"fileName":"PDXNet_genomic_info.tsv", "size":283886},{"fileName":"PDXNet_program.tsv", "size":307},
                 {"fileName":"PDXNet_study.tsv", "size":2171}]

With that in place, we'll go through the same steps to add the files:

1. Create a new bactch and grab the batch ID

In [98]:
create_batch_variables = {"submissionID":submissionid, "type":type, "file":new_metadatafiles}
create_batch_res = apiQuery(dev2, create_batch_query, create_batch_variables)
batchid = create_batch_res['data']['createBatch']['_id']

2. Upload the files using the pre-signed URLs

In [99]:
file_upload_result = processFilesForUpload(new_metadatafiles, datadir, create_batch_res)

3. Update the batch

In [101]:
update_variables = {'batchID':batchid, 'files':file_upload_result}
update_res = apiQuery(dev2, update_batch_query, update_variables)

And lastly, check the batch for errors

In [102]:
batch_error_res = apiQuery(dev2, list_batches_query, batches_variables)
batch_df = pd.DataFrame(batch_error_res['data']['listBatches']['batches'])
display(Markdown(batch_df.to_markdown()))

|    | _id                                  | submissionID                         |   displayID | type     |   fileCount | files                                                                                                                                                                                                                                                                                                                                     | status    | errors                                                                                                                                                                                                                                                                                                      |
|---:|:-------------------------------------|:-------------------------------------|------------:|:---------|------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  0 | a0722e54-cf25-4627-ac34-1a05a7149b88 | 8a71b104-43d2-44d7-9693-d61f93e190a0 |           2 | metadata |           7 | [{'fileName': 'PDXNet_participant.tsv'}, {'fileName': 'PDXNet_sampleFIXED.tsv'}, {'fileName': 'PDXNet_diagnosis.tsv'}, {'fileName': 'PDXNet_file.tsv'}, {'fileName': 'PDXNet_genomic_info.tsv'}, {'fileName': 'PDXNet_program.tsv'}, {'fileName': 'PDXNet_study.tsv'}]                                                                    | Uploading |                                                                                                                                                                                                                                                                                                             |
|  1 | d3405c53-ff40-4ece-b1dd-a36f9f9e7a29 | 8a71b104-43d2-44d7-9693-d61f93e190a0 |           1 | metadata |           9 | [{'fileName': 'PDXNet_participant.tsv'}, {'fileName': 'PDXNet_sample.tsv'}, {'fileName': 'PDXNet_diagnosis.tsv'}, {'fileName': 'PDXNet_file.tsv'}, {'fileName': 'PDXNet_genomic_info.tsv'}, {'fileName': 'PDXNet_image.tsv'}, {'fileName': 'PDXNet_program.tsv'}, {'fileName': 'PDXNet_study.tsv'}, {'fileName': 'PDXNet_treatment.tsv'}] | Failed    | ['“PDXNet_sample.tsv: 38”: conflict data detected: “sample_type”: "RNA".', '“PDXNet_sample.tsv: 74”: conflict data detected: “sample_type”: "DNA".', '“PDXNet_image.tsv:2”:  Key property “study_link_id” value is required.', '“PDXNet_treatment.tsv:2”:  Key property “treatment_id” value is required.'] |

#### Side Trip

If you log into the Submission Portal, you should see that all files have uploaded and passed.

<a id='step4'></a>
# Step 4: Running the Validations
Once you have either metadata templates or data files successfully uploaded to the Submission Portal, you can start running validations.  Validations can be run at any time, you don't have to complete all uploads before running validations.  However, if you do run validations on incomplete submissions, you will see errors relating to the missing information.

It's important to remember that validations are run against everything in the submission, not just against a specific file, or subset of files.

Validations are triggered by running the *validateSubmission* mutation which requires the submission ID and the types of validation to run., and the scope of the validation.
#### Types
- "metadata" - run the validations for the uploaded metadata files
- "data file" - run the validations for the uploaded data files
- Note that both values can be used in a single validation run

#### Scope
- "New" - Run validations only against newly uploaded files.  Any files that have previously been validated will be ignored.
- "All" - Run validations against all the files, both new and previously uploaded.

In [103]:
run_validation_query = """
    mutation ValidateSubmission(
  $id: ID!
  $types: [String]
  $scope: String
){
  validateSubmission(_id: $id, types: $types, scope: $scope){
    success
    message
  }
}
"""

In [104]:
validation_variables = {"id":submissionid, "types":"metadata", "scope":"All"}

In [105]:
validation_res = apiQuery(dev2, run_validation_query, validation_variables)
print(validation_res['data']['validateSubmission']['success'])

True


While this validation came back as successful, not all validations will.  For unsuccessful validations, a listing of the errors encountered are retrieved by running the *submissionQCResults* query which will return all of the validation errors found in a submission.  There are a number of different options that can be provided as part of this query that can fine tune the results, be sure to consult the API Documentation for a listing of all the parameters.

For the purposes of this example, we'll just use two:  **_id** which is the submission ID and will pull back all results for the entire submission, and **severties** which can be set to one of the following:

- **All** - Return all errors regardless of severity
- **Error** - Return only Error level errors.  These will block submission of the study.
- **Warnings** - Return only Warning level errors.  Warnings will not block submission, however they should be corrected if possible.

In [106]:
qc_check_query = """
query GetQCResults(
  $id: ID!
  $severities: String
){
  submissionQCResults(_id:$id, severities:$severities){
    results{
      submissionID
      severity
      type
      errors{
        title
        description
      }
      warnings{
        title
        description
      }
    }
  }
}
"""

In [107]:
qc_variables = {"id":submissionid, "severities":"All"}

In [108]:
qc_res = apiQuery(dev2, qc_check_query, qc_variables)

In [109]:
qc_df = pd.DataFrame(qc_res['data']['submissionQCResults']['results'])
display(Markdown(qc_df.to_markdown()))

|    | submissionID                         | severity   | type   | errors                                                                                                                                                                                                                                                                                                                                                                                                     | warnings                                                                                                                                                       |
|---:|:-------------------------------------|:-----------|:-------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  0 | 8a71b104-43d2-44d7-9693-d61f93e190a0 | Error      | study  | [{'title': 'Missing required property', 'description': '[PDXNet_study.tsv: line 2] Required property "file_types_and_format" is empty.'}, {'title': 'Missing required property', 'description': '[PDXNet_study.tsv: line 2] Required property "study_access" is empty.'}, {'title': 'Missing required property', 'description': '[PDXNet_study.tsv: line 2] Required property "study_version" is empty.'}] | [{'title': 'Updating existing data', 'description': '[PDXNet_study.tsv: line 2] “study”: {“phs_accession": “phs002479"} already exists and will be updated.'}] |
|  1 | 8a71b104-43d2-44d7-9693-d61f93e190a0 | Error      | sample | [{'title': 'Value not permitted', 'description': '[PDXNet_sampleFIXED.tsv: line 93] "DNA" is not a permissible value for property “sample_type”.'}]                                                                                                                                                                                                                                                        | []                                                                                                                                                             |
|  2 | 8a71b104-43d2-44d7-9693-d61f93e190a0 | Error      | sample | [{'title': 'Value not permitted', 'description': '[PDXNet_sampleFIXED.tsv: line 133] "DNA" is not a permissible value for property “sample_type”.'}]                                                                                                                                                                                                                                                       | []                                                                                                                                                             |
|  3 | 8a71b104-43d2-44d7-9693-d61f93e190a0 | Error      | sample | [{'title': 'Value not permitted', 'description': '[PDXNet_sampleFIXED.tsv: line 142] "DNA" is not a permissible value for property “sample_type”.'}]                                                                                                                                                                                                                                                       | []                                                                                                                                                             |
|  4 | 8a71b104-43d2-44d7-9693-d61f93e190a0 | Error      | sample | [{'title': 'Value not permitted', 'description': '[PDXNet_sampleFIXED.tsv: line 139] "RNA" is not a permissible value for property “sample_type”.'}]                                                                                                                                                                                                                                                       | []                                                                                                                                                             |
|  5 | 8a71b104-43d2-44d7-9693-d61f93e190a0 | Error      | sample | [{'title': 'Value not permitted', 'description': '[PDXNet_sampleFIXED.tsv: line 141] "DNA" is not a permissible value for property “sample_type”.'}]                                                                                                                                                                                                                                                       | []                                                                                                                                                             |
|  6 | 8a71b104-43d2-44d7-9693-d61f93e190a0 | Error      | sample | [{'title': 'Value not permitted', 'description': '[PDXNet_sampleFIXED.tsv: line 138] "RNA" is not a permissible value for property “sample_type”.'}]                                                                                                                                                                                                                                                       | []                                                                                                                                                             |
|  7 | 8a71b104-43d2-44d7-9693-d61f93e190a0 | Error      | sample | [{'title': 'Value not permitted', 'description': '[PDXNet_sampleFIXED.tsv: line 92] "DNA" is not a permissible value for property “sample_type”.'}]                                                                                                                                                                                                                                                        | []                                                                                                                                                             |
|  8 | 8a71b104-43d2-44d7-9693-d61f93e190a0 | Error      | sample | [{'title': 'Value not permitted', 'description': '[PDXNet_sampleFIXED.tsv: line 56] "RNA" is not a permissible value for property “sample_type”.'}]                                                                                                                                                                                                                                                        | []                                                                                                                                                             |
|  9 | 8a71b104-43d2-44d7-9693-d61f93e190a0 | Error      | sample | [{'title': 'Value not permitted', 'description': '[PDXNet_sampleFIXED.tsv: line 9] "DNA" is not a permissible value for property “sample_type”.'}]                                                                                                                                                                                                                                                         | []                                                                                                                                                             |

#### Side Trip
As with other results, these can also be viewed in the Data Submission portal graphical interface.

<a href='step5'></a>
# Step 5:  Submitting, Canceling, or Withdrawing
The last step of this process techincally is the submission to CRDC, however the same query is used to cancel a submission, or to withdraw a submission.  Let's quickly go over what each of those means:

- **Submit** : Once all of the validation errors have been corrected and the validation results are either completely clean or only have warnings, the study is ready to be submitted.  Sending a submit request will hand over control of the files and data to the CRDC Data Team for final checks.  Note that once you submit a submission, no further edits are allowed.
  
- **Cancel** : If you want to abandon a submission *that has not been submitted to CRDC yet*, sending a cancellation request will lock the submission and withdraw it from the system.  Work is not allowed on cancelled submissions so be sure that you want to cancel before you issue this query.
  
- **Withdraw** : Withdraw is similar to cancel only it is used on submissions that have already been submitted to CRDC.  So if you find that a study was submitted before everythign was complete, or if other errors are found that necessitate stopping the submission process, sending a **Warning** query will prevent the release of the submitted data to the data commons and return the submission to it's previous, unsubmitted, state.



In [110]:
submission_query = """
mutation Submit(
    $id: ID!
    $action: String!
    $comment: String
){
    submissionAction(submissionID: $id, action: $action, comment: $comment){
        name
        submitterID
        submitterName
        dataCommons
        modelVersion
        studyAbbreviation
        dbGaPID
        status
    }
}
"""

In [111]:
submission_variables = {"id":submissionid, "action": "Submit", "comment":"Example submission"}

In [112]:
submission_res = apiQuery(dev2, submission_query, submission_variables)

# Conclusions
<a id='conclusions'></a>
At this point, we've walked through the basics of creating a submission, uploading, validating, and submitting (or not) data using the API system.  There are more queries and mutations that are available to provide additional information and capabilties for integrating with your systems and we suggest reading the API documentation for further details.  And while this example is in Python, any language that can use GraphQL queries is suitable for interaction with this API.

If you have any questions about using this API, please contact the CRDC Helpdesk.