## File metadata via Cancer Data Aggregator

This notebook explores the subject and specimen attributes that can be obtained from the Cancer Data Aggregator (CDA) for a given DRS id.

#### Function to call the CDA API
Set up a function to call the CDA API directly with an SQL query

In [1]:
import requests
def runAPIQuery(querystring, limit=None):
    cdaURL = 'https://cda.cda-dev.broadinstitute.org/api/v1/sql-query/v3'
    #Using a limit:
    if limit is not None:
        cdaURL = "{}?limit={}".format(cdaURL, str(limit))
        
    headers = {'accept' : 'application/json', 'Content-Type' : 'text/plain'}

    request = requests.post(cdaURL, headers = headers, data = querystring)

    if request.status_code == 200:
        return request.json()
    else:
        raise Exception ("Query failed code {}. {}".format(request.status_code,query))
                


#### Query for the file information specifically
Query on DRS id.

In [2]:
querystring = '''SELECT f.* FROM gdc-bq-sample.cda_mvp.v3 p, 
UNNEST(ResearchSubject) AS su, 
UNNEST(su.Specimen) AS sp,
UNNEST(sp.File) AS f
WHERE f.drs_uri = 'drs://dg.4DFC:030e5e74-6461-4f05-a399-de8e470bc056'

'''
query = runAPIQuery(querystring)
query['result']

[{'label': '46db33a7f2003837e88d0a81b8ebec2c_gdc_realn.bam',
  'associated_project': ['TCGA-BRCA'],
  'drs_uri': 'drs://dg.4DFC:030e5e74-6461-4f05-a399-de8e470bc056',
  'identifier': [{'system': 'GDC',
    'value': '030e5e74-6461-4f05-a399-de8e470bc056'}],
  'data_category': 'Sequencing Reads',
  'byte_size': '23894757370',
  'type': None,
  'file_format': None,
  'checksum': 'f0cf564932ed0418c08b828a556d6478',
  'id': '030e5e74-6461-4f05-a399-de8e470bc056',
  'data_type': 'Aligned Reads'}]

That provides minimal additional metadata beyond what DRS tells us already.

#### Query 2 - specimen attributes
The specimen data could be obtained with the query below. 

Note: the except statement provides a way to avoid including the other files derived from this specimen. We are really just interested in attributes of the specimen as they tell us something about the specific file in question. (Of course, in some circumstances those other files, and/or data about them, might be useful. It will depend on the user and the use case).

In [3]:
specimen_querystring = '''SELECT sp.*  except (File)
FROM gdc-bq-sample.cda_mvp.v3 p, 
UNNEST(ResearchSubject) AS su, 
UNNEST(su.Specimen) AS sp,
UNNEST(sp.File) AS f
WHERE f.drs_uri = 'drs://dg.4DFC:030e5e74-6461-4f05-a399-de8e470bc056'

'''
specimen_query = runAPIQuery(specimen_querystring)
specimen_query['result']

[{'derived_from_specimen': 'Initial sample',
  'associated_project': 'TCGA-BRCA',
  'age_at_collection': None,
  'anatomical_site': None,
  'source_material_type': 'Primary Tumor',
  'derived_from_subject': 'TCGA-AR-A2LK',
  'specimen_type': 'sample',
  'id': 'd6f5c34a-0f5c-4aed-977a-74a1e5d50915',
  'primary_disease_type': 'Ductal and Lobular Neoplasms',
  'identifier': [{'system': 'GDC',
    'value': 'd6f5c34a-0f5c-4aed-977a-74a1e5d50915'}]}]

#### Data from multiple objects
A query returning values from multiple levels of the model is possible as follows.

In [4]:
specimen_querystring3 = '''SELECT 
p.days_to_birth,
p.race,
p.sex,
p.ethnicity,
su.Diagnosis,
sp.derived_from_specimen,
sp.age_at_collection,
sp.anatomical_site,
sp.source_material_type,
sp.primary_disease_type,
f.drs_uri,
f.data_category
FROM gdc-bq-sample.cda_mvp.v3 p, 
UNNEST(ResearchSubject) AS su, 
UNNEST(su.Specimen) AS sp,
UNNEST(sp.File) AS f
WHERE f.drs_uri = 'drs://dg.4DFC:030e5e74-6461-4f05-a399-de8e470bc056'

'''
specimen_query3 = runAPIQuery(specimen_querystring3)
specimen_query3['result']

[{'days_to_birth': '-22800',
  'race': 'white',
  'sex': 'female',
  'ethnicity': 'not hispanic or latino',
  'Diagnosis': [{'morphology': '8520/3',
    'tumor_stage': 'stage iii',
    'tumor_grade': 'not reported',
    'Treatment': [{'type': 'Radiation Therapy, NOS', 'outcome': None},
     {'type': 'Pharmaceutical Therapy, NOS', 'outcome': None}],
    'id': '1f41548f-081a-5bc8-937e-58e6205f3794',
    'primary_diagnosis': 'Lobular carcinoma, NOS',
    'age_at_diagnosis': '22800'}],
  'derived_from_specimen': 'Initial sample',
  'age_at_collection': None,
  'anatomical_site': None,
  'source_material_type': 'Primary Tumor',
  'primary_disease_type': 'Ductal and Lobular Neoplasms',
  'drs_uri': 'drs://dg.4DFC:030e5e74-6461-4f05-a399-de8e470bc056',
  'data_category': 'Sequencing Reads'}]

It's possible to use 'except' again as follows in a more compact version of the previous query. This saves having to enumerate all the required columns, but does have the danger that columns with the same name at different levels of nesting may be ambiguous.

In [5]:
specimen_querystring4 = '''SELECT 
p.* except (ResearchSubject),
su.* except (Specimen, identifier),
sp.* except (File, identifier),
f.drs_uri,
f.data_category
FROM gdc-bq-sample.cda_mvp.v3 p, 
UNNEST(ResearchSubject) AS su, 
UNNEST(su.Specimen) AS sp,
UNNEST(sp.File) AS f
WHERE f.drs_uri = 'drs://dg.4DFC:030e5e74-6461-4f05-a399-de8e470bc056'

'''
specimen_query4 = runAPIQuery(specimen_querystring4)
specimen_query4['result']

[{'days_to_birth': '-22800',
  'race': 'white',
  'sex': 'female',
  'ethnicity': 'not hispanic or latino',
  'id': 'TCGA-AR-A2LK',
  'Diagnosis': [{'morphology': '8520/3',
    'tumor_stage': 'stage iii',
    'tumor_grade': 'not reported',
    'Treatment': [{'type': 'Radiation Therapy, NOS', 'outcome': None},
     {'type': 'Pharmaceutical Therapy, NOS', 'outcome': None}],
    'id': '1f41548f-081a-5bc8-937e-58e6205f3794',
    'primary_diagnosis': 'Lobular carcinoma, NOS',
    'age_at_diagnosis': '22800'}],
  'associated_project': 'TCGA-BRCA',
  'id_1': '1b703058-e596-45bc-80fe-8b98d545c2e2',
  'primary_disease_type': 'Ductal and Lobular Neoplasms',
  'primary_disease_site': 'Breast',
  'derived_from_specimen': 'Initial sample',
  'associated_project_1': 'TCGA-BRCA',
  'age_at_collection': None,
  'anatomical_site': None,
  'source_material_type': 'Primary Tumor',
  'derived_from_subject': 'TCGA-AR-A2LK',
  'specimen_type': 'sample',
  'id_2': 'd6f5c34a-0f5c-4aed-977a-74a1e5d50915',
  'p

#### Using the CDA Python API
Rather than writing SQL it's also possible to use the CDA Python API.

A query can be formulated on the DRS id as follows.

In [8]:
from cdapython import Q
import json

q1 = Q('ResearchSubject.Specimen.File.drs_uri = "drs://dg.4DFC:030e5e74-6461-4f05-a399-de8e470bc056"')
r = q1.run() 
r.sql
print(r)


Query: SELECT p.* FROM gdc-bq-sample.cda_mvp.v3 AS p, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen, UNNEST(_Specimen.File) AS _File WHERE (_File.drs_uri = 'drs://dg.4DFC:030e5e74-6461-4f05-a399-de8e470bc056')
Offset: 0
Limit: 1000
Count: 1
More pages: No



The result in this case is the whole data structure for the Patient from which this file came. It is included in context with other data from the same patient.

The listing of results is long. Included for completeness here at the end.

In [9]:
for res in r:
    print(json.dumps(res, indent= 2))

{
  "days_to_birth": "-22800",
  "race": "white",
  "sex": "female",
  "ethnicity": "not hispanic or latino",
  "id": "TCGA-AR-A2LK",
  "ResearchSubject": [
    {
      "Diagnosis": [
        {
          "morphology": "8520/3",
          "tumor_stage": "stage iii",
          "tumor_grade": "not reported",
          "Treatment": [
            {
              "type": "Radiation Therapy, NOS",
              "outcome": null
            },
            {
              "type": "Pharmaceutical Therapy, NOS",
              "outcome": null
            }
          ],
          "id": "1f41548f-081a-5bc8-937e-58e6205f3794",
          "primary_diagnosis": "Lobular carcinoma, NOS",
          "age_at_diagnosis": "22800"
        }
      ],
      "Specimen": [
        {
          "File": [
            {
              "label": "TCGA.BRCA.mutect.053f01ed-3154-4aea-9e7f-932c435034b3.DR-10.0.protected.maf.gz",
              "associated_project": [
                "TCGA-BRCA"
              ],
              "