This notebook has examples of how to use the CDS API to do common queries and return results that should be usable, and exportable to a Cloud Resource. A few useful links:

    The CDS Data Model: https://dataservice.datacommons.cancer.gov/#/resources
    The CDS GraphQL endpoint: https://dataservice.datacommons.cancer.gov/v1/graphql/
    The GraphiQL interface in CDS (a good place to build queries): https://dataservice.datacommons.cancer.gov/#/graphql

Import a few useful libraries. Requests for carrying out the communications and Pandas for manipulating the returns. We also need json to help with formatting the results.


In [12]:
import pandas as pd
import requests

In [4]:
cds_graphql_url = "https://dataservice.datacommons.cancer.gov/v1/graphql/"

Since there will be multiple queries, having a simple routine that runs the query and returns the answer as a JSON object will streamline the notebook

In [9]:
def runGraphQLQuery(url,query):
    headers = {"content-type":"application/json"}
    try:
        results = requests.post(url = url, json={"query":query, "headers":headers})
        results = results.json()
    except requests.exceptions.HTTPError as exception:
        print(exception)
    return results

CDS has multiple studies, so as a first step, it's worth taking a look to see what content is avaialble.  This first query simply lists out the different studies that have data (This notebook assumes that you've done some basic introspection queries to get familiar wtih the content of the CDS database).

In [10]:
study_query = """
{
  study{
    study_name
    study_description
    study_acronym
    study_access
    phs_accession
  }
}
"""

This query will bring back several useful pieces of information that should help decide what studies are worth further examination.  You can also run this query in the GraphiQL interface in the CDS portal (https://dataservice.datacommons.cancer.gov/#/graphql) to see what the JSON object looks like.

In [13]:
study_result = runGraphQLQuery(cds_graphql_url, study_query)

Let's put the information in a dataframe so we can easily make it more readable than a JSON object

In [19]:
df_columns = ['Study Name', 'Acronym', 'Access', 'Accession', 'Description']
study_df = pd.DataFrame(columns = df_columns)
for study in study_result['data']['study']:
    study_df.loc[len(study_df.index)] = [study['study_name'],study['study_acronym'],study['study_access'], study['phs_accession'],study['study_description']]

In [22]:
pd.set_option('display.max_colwidth', None)
study_df

Unnamed: 0,Study Name,Acronym,Access,Accession,Description
0,Clinical Trial Specimen Molecular Characterization (CTSMC),CTSMC,Controlled,phs002790,"The National Cancer Institute's (NCI) Childhood Cancer Data Initiative (CCDI) focuses on the critical need to collect, analyze, and share data to address the burden of cancer in children, adolescents, and young adults (AYAs). The Molecular Characterization Initiative (MCI) will further the CCDI's goals by providing access to better diagnostic tests for pediatric and AYA patients. The molecular characterizations of solid tumors, soft tissue sarcomas, and rare diseases are performed in a CLIA-certified setting as results may be used to screen for and/or confirm clinical trial eligibility, direct treatment, or otherwise contribute to the conduct of the trial. The following molecular characterization assays were performed:\n\nTumor/Normal Whole Exome sequencing\nMethylation arrays\nArcher fusion panel\nDeidentified clinical reports\nNote: New data will be added periodically, please check the SBG site for updates.\n\nStudy Weblinks:\nCCDI Molecular Characterization Initiative\nStudy Design:\nTumor vs. Matched-Normal\nStudy Type:\nAggregate Genomic Data\nClinical Diagnostic Testing\nExome Sequencing\nIndividual-Level Genomic Data\nTranscriptome Analysis\nTumor vs. Matched-Normal\nTotal number of consented subjects: 1996"
1,"UCSF Database for the Advancement of JMML - Integration of Metadata with """"Omic"""" Data",JMML,Controlled,phs002504,"Juvenile myelomonocytic leukemia (JMML) is a rare and frequently fatal myeloproliferative/myelodysplastic disorder of early childhood with an estimated incidence of 1.2 cases per million. It is associated with a spectrum of diverse outcomes ranging from spontaneous resolution in rare patients to transformation to acute myeloid leukemia in others. The overwhelming majority of JMML patients (~95%) will harbor mutations in canonical Ras pathway genes, including NF1, NRAS, KRAS, PTPN11, and CBL. As Ras proteins are mutated in more than 30% of human cancers, the information gleaned from the study of JMML has provided insights into Ras signaling in cancer as well as a group of congenital diseases with tumor predispositions known as the “Rasopathies”. While oncogenic Ras is one of the most common mutations in human cancer, it remains one of the most vexing targets for efficacious therapy."
2,Human Tumor Atlas Network (HTAN) primary sequencing data,HTAN,Controlled,phs002371,"An NCI-funded Cancer Moonshot initiative to construct 3-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced disease."
3,"CIDR: Discovery, Biology, and Risk of Inherited Variants in Glioma sample",,Controlled,phs002250,"This is a gliogene brain tumor family study. This study includes glioma cases with a family history of glioma in a first, second, or third degree relative. \nStudy Weblinks:\nGliogene\nStudy Design:\nFamily/Twin/Trios\nStudy Type:\nFamily\nWhole Genome Sequencing\ndbGaP estimated ancestry using GRAF-pop\nTotal number of consented subjects: 151"
4,Childhood Cancer Data Initiative (CCDI): Free the Data: Open Sharing of Comprehensive Genomic Childhood Cancer Datasets (Kansas),CCDI-KUMC,Controlled,phs002529,"This study provides paired tumor normal genomic sequencing data from approximately 200 children with cancer, including both solid tumors and leukemias, done by the Children’s Mercy Research Institute (CMRI) and University of Kansas Cancer Center (KUCC).\n\nThese data include whole genome sequencing (generally, ~20x), whole exome sequencing (generally, ~300x), bulk RNA sequencing (generally, ~80 million reads), and single-cell RNA and ATAC sequencing (>50,000 reads/cell). Additional phenotypic, pathologic, and genetic data, gathered clinically for these samples, are also provided.\nStudy Design:\n Tumor vs. Matched-Normal\nStudy Type:\nCase Set\nClinical Cohort\nClinical Genetic Testing\nExome Sequencing\nFull Transcriptome Sequencing\nIndividual-Level Genomic Data\nMixed\nProbands\nRepository\nRNA Sequencing\nSequencing\nSingle Cell Analysis\nTranscriptome Sequencing\nTumorTumor vs. Matched-Normal\nWhole Genome Sequencing\nTotal number of consented subjects: 193"
5,University of Texas PDX Development and Trial Center Grant,,Controlled,phs001980,"The goal for the University Texas PDX Development and Trial Center (UTPDTC) is to optimize personalized biomarker-based cancer therapy and identify effective targeted drugs based on the molecular characteristics of each tumor. Our short-term goals are to establish a biobank of clinically, and molecularly-annotated Patient-Derived Xenografts (PDXs) and to use PDXs as a platform for preclinical drug development and biomarker discovery. The primary goal for UTPDTC investigators will be to develop PDX trial strategies for preclinical testing of single agents and drug combinations. These models will allow the determination of the optimal treatments (single drugs or combinations) that should be tested in clinical trials in increasingly individualized, molecularly defined subsets of tumors. The goal of the Patient-Derived Xenograft Core is to provide high-quality clinically relevant and molecularly annotated PDX models for the research projects proposed in the University of Texas PDX Development and Trial Center (UTPDTC) grant application and to the research activity of the NCI PDX Development and Trial Centers Research Network (PDXNet) by leveraging PDX resources at our institutions and developing new PDX models from human cancer specimens using rigorous quality standards so that the models can be used to guide clinical trial development. The PDX models developed and/or characterized by the PDX Core will be available to other cancer researchers through the PDXNet and NCI Patient-Derived Models Repository (PDMR).\n\n Study Design:\n Prospective Longitudinal Cohort\n Study Type:\n Cohort\n Total number of consented subjects: 36"
6,Clonal evolution during metastatic spread in high-rish neuroblastoma,CEMSHRN,Controlled,phs003111,"The goal of this study is to deliver a detailed characterization of the patterns of disease dissemination at diagnosis, during progression and in response to therapy in high-risk neuroblastoma. Longitudinal and spatially distinct tumors were collected from patients. Clinical data includes diagnosis and treatment information. Biospecimen data includes whole genome sequencing (WGS) and whole transcriptome sequencing (WTS).\n\nStudy Design:\nTumor vs. Matched-Normal\nStudy Type:\nLongitudinal Cohort\nTranscriptome Sequencing\nWhole Genome Sequencing\nTotal number of consented subjects: 129"
7,LCCC 1108: Development of a Tumor Molecular Analyses Program and Its Use to Support Treatment Decisions (UNCseqTM),LCCC 1108 (UNCseqTM),Controlled,phs001713,"The primary objective of this specimen correlative study was two-fold: to provide a mechanism for the association of known molecular alterations with clinical outcomes, and to provide rapid genetic profiling of alterations with known clinical utility using tumor and germline specimens to support treatment decisions."
8,Washington University PDX Development and Trial Center,,Controlled,phs002305,"This data set is a collection of patient specimens from common and rare cancers that have been developed into patient-derived xenograft (PDX) models at the Washington University PDX Development and Trial Center (WU-PDTC). The goal of the PDTC is to develop and characterize PDX models, gaining insight into tumor biology, and validating biomarkers across all major tumor types. Pre-clinical experiments in these PDX models will be used to advance our ability to predict clinical responses to new molecularly targeted agents under development.\n\nStudy Design:\nTumor vs. Matched-Normal\nStudy Type:\nTumor\nTumor vs. Matched-Normal\nTotal number of consented subjects: 127"
9,CPTAC Proteogenomic Study,PanCanSnATAC,Controlled,phs001287,"Recently, significant progress has been made in characterizing and sequencing the genomic alterations in statistically robust numbers of samples from several types of cancer. For example, The Cancer Genome Atlas (TCGA), International Cancer Genome Consortium (ICGC) and other similar efforts are identifying genomic alterations associated with specific cancers (e.g., copy number aberrations, rearrangements, point mutations, epigenomic changes, etc.) The availability of these multi-dimensional data to the scientific community sets the stage for the development of new molecularly targeted cancer interventions. Understanding the comprehensive functional changes in cancer proteomes arising from genomic alterations and other factors is the next logical step in the development of high-value candidate protein biomarkers. Hence, proteomics can greatly advance the understanding of molecular mechanisms of disease pathology via the analysis of changes in protein expression, their modifications and variations, as well as protein=protein interaction, signaling pathways and networks responsible for cellular functions such as apoptosis and oncogenesis. Realizing this great potential, the NCI launched the third phase of the CPTC initiative in September 2016. As the Clinical Proteomic Tumor Analysis Consortium, CPTAC continues to define cancer proteomes on genomically-characterized biospecimens. The purpose of this integrative approach was to provide the broad scientific community with knowledge that links genotype to proteotype and ultimately phenotype. In this third phase of CPTAC, the program aims to expand on CPTAC II and genomically and proteomically characterize over 2000 samples from 10 cancer types (Lung Adenocarcinoma, Pancreatic Ductal Adenocarcinoma, Glioblastoma Multiforme, Acute Myeloid Leukemia, Clear cell renal Carcinoma, Head and Neck Squamous Cell Carcinoma, Cutaneous Melanoma, Sarcoma, Lung Squamous Cell Carcinoma, Uterine Corpus Endometrial Carcinoma) .Germline DNA is obtained from blood and Normal control samples for proteomics varied by organ site. All cancer samples were derived from primary and untreated tumor."


From this list, HTAN looks interesting, let's take a look at some numbers to see if there's enough data there to support further analysis

In [27]:
info_query= """
{
  studyDetail(phs_accession:"phs001287"){
    data_types
    numberOfFiles
    numberOfSubjects
    numberOfSamples
  }
}
"""

In [28]:
info_results = runGraphQLQuery(cds_graphql_url, info_query)

In [29]:
print(info_results)

{'data': {'studyDetail': {'data_types': 'Genomic', 'numberOfFiles': 1243, 'numberOfSubjects': 1069, 'numberOfSamples': 1114}}}
