This notebook has examples of how to use the CDS API to do common queries and return results that should be usable, and exportable to a Cloud Resource.
A few useful links:
- The CDS Data Model: https://dataservice.datacommons.cancer.gov/#/resources
- The CDS GraphQL endpoint: https://dataservice.datacommons.cancer.gov/v1/graphql/
- The GraphiQL interface in CDS (a good place to build queries): https://dataservice.datacommons.cancer.gov/#/graphql

Import a few useful libraries.  Requests for carrying out the communications and Pandas for manipulating the returns.  We also need json to help with formatting the results.

In [1]:
import requests
import pandas as pd
import json
import pprint

This is a routine that sends the query and returns the results as a Python dictionary.

In [2]:
def runGrapQLQuery(url,query):
    headers = {"content-type":"application/json"}
    try:
        #results = requests.post(url = url, json={"query":query, "headers":headers}, verify = False)
        results = requests.post(url = url, json={"query":query, "headers":headers})
        results = results.json()
    except requests.exceptions.HTTPError as exception:
        print(exception)
    return results

Query 1: Getting HTAN File name, case ID and sample IDs.  

In [3]:
query = """
{
  study(study_acronym: "HTAN"){
    files{
      file_name
      file_id
      samples{
        sample_id
        participant{
          participant_id
        }
      }
    }
  }
}
"""

For those not familar with GraphQL, here's a brief explanation of the query:  HTAN is considered a study, so definiing the study acronym will limit the returned data to just HTAN information.  The assumption is that a user will be most interested in getting files, so we've requested the file name and file ID, which is also the DRS ID and can be used to retrieve the actual file.  The samples and participants sections are structured this way so that the sample and participant associated with each file are returned as part of the record for each file.

The next step will be to run the query, so first let's define the URL the query need to be sent to

In [4]:
cds_api_url = "https://dataservice.datacommons.cancer.gov/v1/graphql/"
#cds_api_url = "https://dataservice-dev.datacommons.cancer.gov/v1/graphql/"

Now send the query to the runGraphQLQuery method

In [5]:
results = runGrapQLQuery(cds_api_url, query)

If no errors are returned, we've got our results and can start parsing them.  It's useful to run the query first in the GraphiQL interface in the CDS site, that will display the results, which can make deciding how to parse easier.  Here's an example of how the returned data start"
> {
>  "data": {
>    "study": [
>      {
>        "files": [
>          {
>            "file_name": "6109-AS-21_S1_L005_R1_001.fastq.gz",
>            "file_id": "dg.4DFC/9bfcd7b8-81e4-467b-98c7-0ce3c352c5c6",
>            "samples": [
>              {
>                "sample_id": "HTA11_8920_2000001011_DNA",
>                "participant": {
>                  "participant_id": "HTA11_8920"
>                }
>              }
>            ]
>          }

Before parsing out the information, the first step is setting up a dataframe to hold that information in a useful way

In [6]:
df_column_names = ["file_name", "file_id", "sample_id", "participant_id"]

In [7]:
htan_df = pd.DataFrame(columns=df_column_names)

The relevant information is the list indicated in the "files" section.  Since we've only asked for one study, the following should get that list.

In [8]:
file_list = results['data']['study'][0]['files']

Let's see how many files we have as a result

In [9]:
pprint.pprint(len(file_list))

18989


As of this writing, that's almost 19,000 files which is a lot to parse.  We'll do it just as an example, but narrowing down the query would be a good thing to do.

Both the sample ID and the participant ID are in their own structures for each file and those will take a little extra parsing

In [10]:
for file in file_list:
    file_name = file['file_name']
    file_id = file['file_id']
    for sample in file['samples']:
        sample_id = sample['sample_id']
        participant_id = sample['participant']['participant_id']
    htan_df.loc[len(htan_df.index)] = [file_name, file_id, sample_id, participant_id]

In [11]:
htan_df.head()

Unnamed: 0,file_name,file_id,sample_id,participant_id
0,6109-AS-21_S1_L005_R1_001.fastq.gz,dg.4DFC/9bfcd7b8-81e4-467b-98c7-0ce3c352c5c6,HTA11_8920_2000001011_DNA,HTA11_8920
1,TGTACCTT-CAGGCATT_S44_L004_R1_001.fastq.gz,dg.4DFC/2c44edca-8403-11ee-89ba-97f1d8e813d2,HTA6_116_1_Tissue Biospecimen Type,HTA6_116
2,TGTACCTT-CAGGCATT_S44_L004_R2_001.fastq.gz,dg.4DFC/2c436c52-8403-11ee-ac1d-d3d8ca49ce61,HTA6_116_1_Tissue Biospecimen Type,HTA6_116
3,Other_2263_CD33_scATAC_S5_L002_R1_001.fastq.gz,dg.4DFC/c59c1494-4cb3-11ee-b827-bba8e7f9474d,HTA4_20_1090_Blood Biospecimen Type,HTA4_20
4,Other_2456_scATAC_S1_L002_R3_001.fastq.gz,dg.4DFC/c59d346e-4cb3-11ee-90e2-7b3083d9b31e,HTA4_25_1093_Bone Marrow Biospecimen Type,HTA4_25


If instead of all the files, we were only interested in the RNA-Seq ones, the following query would return that (along with some useful additional information)

In [46]:
query = """
{
  study(study_acronym:"HTAN"){
    participants{
      participant_id
      samples{
        sample_id
        sample_type
        files(experimental_strategy_and_data_subtypes:"WXS; RNA-Seq"){
          file_id
          file_name
          experimental_strategy_and_data_subtypes
          md5sum
        }
      }
    }
  }
}
"""

In [47]:
rnaseqfiles = runGrapQLQuery(cds_api_url, query)

Like before, we'll need to parse the json and put data into a useful form (Running the query in the CDS GraphiQL interface, or any other graphical GraphQL editor will show the structure of the returned data).

In [48]:
rna_headers = ["participant_id", "sample_id", "sample_type", "file_id", "file_name", "experimental_strategy_and_data_subtypes", "md5sum"]

In [50]:
rnadf = pd.DataFrame(columns=rna_headers)

In [52]:
participant_list = rnaseqfiles['data']['study'][0]['participants']

In [53]:
for participant in participant_list:
    participant_id = participant['participant_id']
    for sample in participant['samples']:
        #pprint.pprint(sample)
        sample_id = sample['sample_id']
        sample_type = sample['sample_type']
        filelist = sample['files']
        for file in filelist:
            file_id = file['file_id']
            name = file['file_name']
            type = file['experimental_strategy_and_data_subtypes']
            md5sum = file['md5sum']
            rnadf.loc[len(rnadf.index)] = [participant_id, sample_id, sample_type, file_id, name, type, md5sum]

In [54]:
rnadf.head()

Unnamed: 0,participant_id,sample_id,sample_type,file_id,file_name,experimental_strategy_and_data_subtypes,md5sum
0,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,dg.4DFC/c589eca6-4cb3-11ee-aea0-4f447f37c744,MLL_PAYLNH_scATAC_S2_L001_R2_001.fastq.gz,WXS; RNA-Seq,01d2e00968ed4e8c31eb45dcafd843d4
1,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,dg.4DFC/c5a47e4a-4cb3-11ee-a6e1-035743e55c45,MLL_PAYLNH_scRNA_S2_L001_R1_001.fastq.gz,WXS; RNA-Seq,388b29b8d1a37ec7d4e6a985ad4d37b8
2,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,dg.4DFC/c58a1492-4cb3-11ee-a97b-335097ba8daa,MLL_PAYLNH_scATAC_S2_L001_R3_001.fastq.gz,WXS; RNA-Seq,206eb57e5954676151947528f40ae4e4
3,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,dg.4DFC/c58a58da-4cb3-11ee-8981-bba2159aab70,MLL_PAYLNH_scATAC_S2_L002_R2_001.fastq.gz,WXS; RNA-Seq,cec35fad6fcedc89ae1e0786514de5c3
4,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,dg.4DFC/c5a4a0a0-4cb3-11ee-bc6f-7304a31a72c9,MLL_PAYLNH_scRNA_S2_L001_R2_001.fastq.gz,WXS; RNA-Seq,e6b573ad4368550d12f1aeaa1eb9a73f


Lastly, it could be useful to bring this list of files to one of the Cloud Resources, and there's a manifest format for that.  It's a comma separated file (.csv) with the following headers: drs_uri, name, Participant ID, md5sum, User Comment
User Comment can be any string you like, the others need to follow the following conventions:

- drs_uri:  The is the file id preceded by 'drs://nci-crdc.datacommons.io/'
- name: The file_name string
- Participant ID: the participant ID string
- md5sum: The file md5sum value

Since the file_id column doesn't have the drs:// portion, that needs to be added

In [55]:
def drsID(id):
    return 'drs://nci-crdc.datacommons.io/'+id

In [56]:
drs_column = rnadf['file_id']
new_drs = drs_column.apply(drsID)
rnadf['file_id'] = new_drs

In [57]:
rnadf.head()

Unnamed: 0,participant_id,sample_id,sample_type,file_id,file_name,experimental_strategy_and_data_subtypes,md5sum
0,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,drs://nci-crdc.datacommons.io/dg.4DFC/c589eca6...,MLL_PAYLNH_scATAC_S2_L001_R2_001.fastq.gz,WXS; RNA-Seq,01d2e00968ed4e8c31eb45dcafd843d4
1,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,drs://nci-crdc.datacommons.io/dg.4DFC/c5a47e4a...,MLL_PAYLNH_scRNA_S2_L001_R1_001.fastq.gz,WXS; RNA-Seq,388b29b8d1a37ec7d4e6a985ad4d37b8
2,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,drs://nci-crdc.datacommons.io/dg.4DFC/c58a1492...,MLL_PAYLNH_scATAC_S2_L001_R3_001.fastq.gz,WXS; RNA-Seq,206eb57e5954676151947528f40ae4e4
3,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,drs://nci-crdc.datacommons.io/dg.4DFC/c58a58da...,MLL_PAYLNH_scATAC_S2_L002_R2_001.fastq.gz,WXS; RNA-Seq,cec35fad6fcedc89ae1e0786514de5c3
4,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,drs://nci-crdc.datacommons.io/dg.4DFC/c5a4a0a0...,MLL_PAYLNH_scRNA_S2_L001_R2_001.fastq.gz,WXS; RNA-Seq,e6b573ad4368550d12f1aeaa1eb9a73f


Now rename a few of the columns as the column names for the manifest file are fixed

In [58]:
renamed = {"participant_id":"Participant ID", "file_name":"name", "file_id":"drs_uri"}

In [59]:
rnadf.rename(columns = renamed, inplace = True)

In [60]:
rnadf.head()

Unnamed: 0,Participant ID,sample_id,sample_type,drs_uri,name,experimental_strategy_and_data_subtypes,md5sum
0,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,drs://nci-crdc.datacommons.io/dg.4DFC/c589eca6...,MLL_PAYLNH_scATAC_S2_L001_R2_001.fastq.gz,WXS; RNA-Seq,01d2e00968ed4e8c31eb45dcafd843d4
1,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,drs://nci-crdc.datacommons.io/dg.4DFC/c5a47e4a...,MLL_PAYLNH_scRNA_S2_L001_R1_001.fastq.gz,WXS; RNA-Seq,388b29b8d1a37ec7d4e6a985ad4d37b8
2,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,drs://nci-crdc.datacommons.io/dg.4DFC/c58a1492...,MLL_PAYLNH_scATAC_S2_L001_R3_001.fastq.gz,WXS; RNA-Seq,206eb57e5954676151947528f40ae4e4
3,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,drs://nci-crdc.datacommons.io/dg.4DFC/c58a58da...,MLL_PAYLNH_scATAC_S2_L002_R2_001.fastq.gz,WXS; RNA-Seq,cec35fad6fcedc89ae1e0786514de5c3
4,HTA4_11,HTA4_11_1021_Blood Biospecimen Type,Blood Biospecimen Type,drs://nci-crdc.datacommons.io/dg.4DFC/c5a4a0a0...,MLL_PAYLNH_scRNA_S2_L001_R2_001.fastq.gz,WXS; RNA-Seq,e6b573ad4368550d12f1aeaa1eb9a73f


In [61]:
csv_filename = "ExampleManifest.csv"
output_columns = ["drs_uri","name","Participant ID","md5sum"]
rnadf.to_csv(csv_filename, columns=output_columns, sep=',', index=False)