In this notebook, we'll use what we learned via GraphQL introspection to query the CDS database for cases and assocaited files that we can export for analysis
A few useful links:

The CDS Data Model: https://dataservice.datacommons.cancer.gov/#/resources
The CDS GraphQL endpoint: https://dataservice.datacommons.cancer.gov/v1/graphql/
The GraphiQL interface in CDS (a good place to build and practice queries): https://dataservice.datacommons.cancer.gov/#/graphql


As before we'll import a few useful libraries and set up a routine to run queries and return JSON results

In [1]:
import pandas as pd
import requests
from IPython.display import display, Markdown, Latex

In [2]:
cds_graphql_url = "https://dataservice.datacommons.cancer.gov/v1/graphql/"

In [3]:
def runGraphQLQuery(url,query,variables):
    headers = {"content-type":"application/json"}
    try:
        if variables is None:
            results = requests.post(url = url, json={"query":query, "headers":headers})
        else:
            results = requests.post(url=url, json={"query":query, "headers":headers, "variables":variables})
        results = results.json()
    except requests.exceptions.HTTPError as exception:
        print(exception)
    return results

One potential starting point would be to find cases associated with a specific disease.  CDS has many different diseases, so a first step would be to query the system and get a listing.  CDS stores this information in the "primary_diagnosis" field in the diagnosis section of the CDS data model

In [4]:
disease_query = """
{
  diagnosis{
    primary_diagnosis
  }
}
"""

In [5]:
diseases = runGraphQLQuery(cds_graphql_url, disease_query, None)

In [6]:
diseases = diseases['data']['diagnosis']

In [7]:
disease_list = []
for disease in diseases:
    disease_list.append(disease['primary_diagnosis'])

#remove any duplicates
disease_list = list(set(disease_list))
print(disease_list)

['Alveolar Rhabdomyosarcoma', 'Myelodysplastic Syndrome With Single Lineage Dysplasia', 'Malignant Rhabdoid Tumor', 'Central Neuroblastoma', 'Serous Surface Papillary Carcinoma', 'High-grade CNS Neoplasm', 'Choroid Plexus Carcinoma', 'Choroid Plexus Papillary Tumor', 'Large Cell Medulloblastoma', 'Extrarenal Malignant Rhabdoid Tumor', 'Residual Dermatofibrosarcoma Protuberans', 'Low-grade Glial/Glioneuronal tumor', 'Histiocytic Malignancy', 'Cerebellar Tumor', 'Undifferentiated, high-grade neoplasm/sarcoma', 'Paratesticular Rhabdomyosarcoma', 'Low Grade Glial Neoplasm', 'Infantile Fibrosarcoma', 'Low-grade glial/glioneuronal neoplasm', 'Non-WNT/non-SHH medulloblastoma', 'Myelodysplastic Syndrome With Multilineage Dysplasia', 'Suprasellar Mass', 'Left Temporal Tumor', 'Pineoblastoma', 'Low Grade Glioneuronal Neoplasm', 'Glial Neoplasm', 'Adrenal Cortical Carcinoma', 'Cerebellar Mass', 'Central neurocytoma', 'Cellular Ependymoma', 'Myxoid Liposarcoma', 'Squamous Cell Carcinoma NOS', 'Ast

That's a fair number of diseases.  If we're interested in something like Glioma, that could be represented in several different names, so let's see if this can be reduced

In [13]:
glioma_list = [s for s in disease_list if "Glioma" in s]
print(glioma_list)

['Astrocytic Glioma', 'Infiltrating Glioma', 'Infant Hemispheric Glioma', 'Angiocentric Glioma', 'Low Grade Glioma', 'Diffuse Glioma', 'Optic Pathway Glioma', 'Neuroepithelial neoplasm, Glioma', 'Low Cellularity Glioma', 'Low-grade Glioma', 'Glioma']


Finding out what studies these are associated with could help understand what kinds of data are avaiable.  From the CDS data model, the way to find out what studies are associated with a diagnosis is to go through the Participant node.  Hence, the query includes the participant.
We'll also get the phs_accession number as that is a less clumsy way to query the database than the name.

In [18]:
diagnosisStudyQuery = """
query diagnosisStudies($diagnosis: String!){
  diagnosis(primary_diagnosis:$diagnosis){
    participant{
      study{
        study_name
        phs_accession
      }
    }
  }
}
"""

In [23]:
columns = ["Diagnosis", "Studies","PHS Accession"]
study_df = pd.DataFrame(columns = columns)
for glioma in glioma_list:
    studies= {}
    variables = {"diagnosis": glioma}
    results = runGraphQLQuery(cds_graphql_url, diagnosisStudyQuery, variables)
    results = results['data']['diagnosis']
    #Putting everything in a dictionary removes duplicates
    for result in results:
        studies[result['participant']['study']['study_name']] = result['participant']['study']['phs_accession']
    for studyname, phs in studies.items():
        study_df.loc[len(study_df.index)] = [glioma, studyname, phs]
        
    

In [24]:
display(Markdown(study_df.to_markdown()))

|    | Diagnosis                        | Studies                                                                                                                                       | PHS Accession   |
|---:|:---------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:----------------|
|  0 | Astrocytic Glioma                | Clinical Trial Specimen Molecular Characterization (CTSMC)                                                                                    | phs002790       |
|  1 | Infiltrating Glioma              | Clinical Trial Specimen Molecular Characterization (CTSMC)                                                                                    | phs002790       |
|  2 | Infant Hemispheric Glioma        | Clinical Trial Specimen Molecular Characterization (CTSMC)                                                                                    | phs002790       |
|  3 | Angiocentric Glioma              | Clinical Trial Specimen Molecular Characterization (CTSMC)                                                                                    | phs002790       |
|  4 | Low Grade Glioma                 | Clinical Trial Specimen Molecular Characterization (CTSMC)                                                                                    | phs002790       |
|  5 | Diffuse Glioma                   | Clinical Trial Specimen Molecular Characterization (CTSMC)                                                                                    | phs002790       |
|  6 | Optic Pathway Glioma             | Clinical Trial Specimen Molecular Characterization (CTSMC)                                                                                    | phs002790       |
|  7 | Neuroepithelial neoplasm, Glioma | Clinical Trial Specimen Molecular Characterization (CTSMC)                                                                                    | phs002790       |
|  8 | Low Cellularity Glioma           | Clinical Trial Specimen Molecular Characterization (CTSMC)                                                                                    | phs002790       |
|  9 | Low-grade Glioma                 | Clinical Trial Specimen Molecular Characterization (CTSMC)                                                                                    | phs002790       |
| 10 | Glioma                           | Human Tumor Atlas Network (HTAN) primary sequencing data                                                                                      | phs002371       |
| 11 | Glioma                           | CIDR: Discovery, Biology, and Risk of Inherited Variants in Glioma sample                                                                     | phs002250       |
| 12 | Glioma                           | Clinical Trial Specimen Molecular Characterization (CTSMC)                                                                                    | phs002790       |
| 13 | Glioma                           | Childhood Cancer Data Initiative (CCDI): Integration of genomic and clinical data from unique rare cancer datasets to facilitate data sharing | phs002517       |
| 14 | Glioma                           | TCGA WGS Variants Across 18 Cancer Types                                                                                                      | phs003155       |

From this results, it's very clear that the CTSMC study is the one to take a much closer look at.  We know from the introspection queries that there's a StudyDetail query that provides an overview, let's see what that looks like. As before, we'll use a variable so we can more easily reuse the query

In [25]:
study_detail_query = """
query getStudyDetails ($phs:String!){
  studyDetail(phs_accession:$phs){
    study_acronym
    study_description
    study_external_url
    study_name
    phs_accession
    numberOfDiseaseSites
    numberOfFiles
    numberOfSamples
    numberOfSubjects
    data_types
  }
}
"""

In [33]:
variables = {"phs": "phs002790"}
study_detail = runGraphQLQuery(cds_graphql_url, study_detail_query, variables)
study_detail = study_detail['data']['studyDetail']

In [34]:
columns = ['Disease Sites', 'Files', 'Samples', 'Subject', 'Data Types', 'Acronym', 'Description','External URL']
data = [study_detail['numberOfDiseaseSites'],study_detail['numberOfFiles'],study_detail['numberOfSamples'],study_detail['numberOfSubjects'],study_detail['data_types'],study_detail['study_acronym'],study_detail['study_description'],study_detail['study_external_url']]
detail_df = pd.DataFrame([data], columns=columns)


In [35]:
display(Markdown(detail_df.to_markdown()))

|    |   Disease Sites |   Files |   Samples |   Subject | Data Types   | Acronym   | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | External URL                                                                         |
|---:|----------------:|--------:|----------:|----------:|:-------------|:----------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
|  0 |             135 |   49747 |      5647 |      2446 | Genomic      | CTSMC     | The National Cancer Institute's (NCI) Childhood Cancer Data Initiative (CCDI) focuses on the critical need to collect, analyze, and share data to address the burden of cancer in children, adolescents, and young adults (AYAs). The Molecular Characterization Initiative (MCI) will further the CCDI's goals by providing access to better diagnostic tests for pediatric and AYA patients. The molecular characterizations of solid tumors, soft tissue sarcomas, and rare diseases are performed in a CLIA-certified setting as results may be used to screen for and/or confirm clinical trial eligibility, direct treatment, or otherwise contribute to the conduct of the trial. The following molecular characterization assays were performed: | https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002790.v5.p1 |