## Query 1
'Find data from TCGA-BRCA project, with donors over age 50 with Stage IIIC cancer.'

General note about this example - it is GDC only - there is no aggregation component to it. So not really a test of a Data Aggregator.

The list of fields we can obtain as follows gives us some clues about how we could formulate the question above.

In [9]:
from cdapython import Q, columns, unique_terms
columns()

SELECT field_path FROM `gdc-bq-sample.cda_mvp.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS` WHERE table_name = 'v3'


['days_to_birth',
 'race',
 'sex',
 'ethnicity',
 'id',
 'ResearchSubject',
 'ResearchSubject.Diagnosis',
 'ResearchSubject.Diagnosis.morphology',
 'ResearchSubject.Diagnosis.tumor_stage',
 'ResearchSubject.Diagnosis.tumor_grade',
 'ResearchSubject.Diagnosis.Treatment',
 'ResearchSubject.Diagnosis.Treatment.type',
 'ResearchSubject.Diagnosis.Treatment.outcome',
 'ResearchSubject.Diagnosis.id',
 'ResearchSubject.Diagnosis.primary_diagnosis',
 'ResearchSubject.Diagnosis.age_at_diagnosis',
 'ResearchSubject.Specimen',
 'ResearchSubject.Specimen.File',
 'ResearchSubject.Specimen.File.label',
 'ResearchSubject.Specimen.File.associated_project',
 'ResearchSubject.Specimen.File.drs_uri',
 'ResearchSubject.Specimen.File.identifier',
 'ResearchSubject.Specimen.File.identifier.system',
 'ResearchSubject.Specimen.File.identifier.value',
 'ResearchSubject.Specimen.File.data_category',
 'ResearchSubject.Specimen.File.byte_size',
 'ResearchSubject.Specimen.File.type',
 'ResearchSubject.Specimen.File



### TCGA-BRCA project
There are three immediately obvious project columns at increasing levels of nesting
```
ResearchSubject.associated_project
ResearchSubject.Specimen.associated_project
ResearchSubject.Specimen.File.associated_project
```
From knowledge of TCGA my guess is that ResearchSubject.associated_project is the most appropriate to use.

(Various caveats about that)

#### Issue
The lower level project fields are likely redundant. In the interests of a minimal model - suggest they are removed.

We can check whether the field contains a value corresponding to the TCGA-BRCA project 

(Note that while this may seem obvious, we should not take it for granted. In this case we found an exact match. That won't always be the case. 

In [10]:
ut = unique_terms("ResearchSubject.associated_project")

# define a function to list values from a List more compactly
def formatList(lst, cols=4):
    for i in range(0, len(lst), cols):
        print( lst[i:i + cols])

formatList(ut)

SELECT DISTINCT(_ResearchSubject.associated_project) FROM `gdc-bq-sample.cda_mvp.v3`, UNNEST(ResearchSubject) AS _ResearchSubject ORDER BY _ResearchSubject.associated_project


['Academia Sinica LUAD-100', 'BEATAML1.0-COHORT', 'BEATAML1.0-CRENOLANIB', 'CGCI-BLGSP']
['CGCI-HTMCP-CC', 'CMI-ASC', 'CMI-MBC', 'CMI-MPC']
['CPTAC-2', 'CPTAC-3', 'CPTAC-TCGA', 'CPTAC2 Retrospective']
['CPTAC3-Discovery', 'CTSP-DLBCL1', 'FM-AD', 'GENIE-DFCI']
['GENIE-GRCC', 'GENIE-JHU', 'GENIE-MDA', 'GENIE-MSK']
['GENIE-NKI', 'GENIE-UHN', 'GENIE-VICC', 'Georgetown Lung Cancer Proteomics Study']
['HCMI-CMDC', 'Human Early-Onset Gastric Cancer - Korea University', 'Integrated Proteogenomic Characterization of HBV-related Hepatocellular carcinoma', 'MMRF-COMMPASS']
['NCICCR-DLBCL', 'OHSU-CNL', 'ORGANOID-PANCREATIC', 'Oral Squamous Cell Carcinoma - Chang Gung University']
['PJ25730263', 'Proteogenomic Analysis of Pediatric Brain Cancer Tumors Pilot Study', 'TARGET-ALL-P1', 'TARGET-ALL-P2']
['TARGET-ALL-P3', 'TARGET-AML', 'TARGET-CCSK', 'TARGET-NBL']
['TARGET-OS', 'TARGET-RT', 'TARGET-WT', 'TCGA-ACC']
['TCGA-BLCA', 'TCGA-BRCA', 'TCGA-CESC', 'TCGA-CHOL']
['TCGA-COAD', 'TCGA-DLBC', 'TCGA-ESCA

### 'Over age 50'
There are three columns shown above which one might choose to formulate the query
days_to_birth
ResearchSubject.Diagnosis.age_at_diagnosis
ResearchSubject.Specimen.age_at_collection

In [11]:
def runQuery1(p, a, s = None):
    bigquery = p.And(a)
    if s != None:
        bigquery = bigquery.And(s)
    r4 = bigquery.run(limit=1000) 
    r4.sql
    print(r4)

In [12]:
ageq = Q('days_to_birth > 365.25*50')
projq = Q('ResearchSubject.associated_project = "TCGA-BRCA" ')

runQuery1(projq, ageq)



Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject WHERE ((_ResearchSubject.associated_project = "TCGA-BRCA" ) AND (v3.days_to_birth > 365.25*50))
Offset: 0
Limit: 1000
Count: 0
More pages: No



In [13]:
ageq = Q('ResearchSubject.Diagnosis.age_at_diagnosis > 365.25*50')
#And try again
runQuery1(projq, ageq)


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE ((_ResearchSubject.associated_project = "TCGA-BRCA" ) AND (_Diagnosis.age_at_diagnosis > 365.25*50))
Offset: 0
Limit: 1000
Count: 794
More pages: No



### Stage IIIC cancer
As far as fields go, this seems less ambiguous. ResearchSubject.Diagnosis.tumor_stage seems to be the field to use. We can go straight to looking at its list of values.

In [14]:
stage_terms = unique_terms("ResearchSubject.Diagnosis.tumor_stage")
formatList(stage_terms, 6)

SELECT DISTINCT(_Diagnosis.tumor_stage) FROM `gdc-bq-sample.cda_mvp.v3`, UNNEST(ResearchSubject) AS _ResearchSubject,UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis ORDER BY _Diagnosis.tumor_stage


[None, '', '1B', '2A', '2B', '3']
['4', 'Adverse', 'Favorable', 'FavorableOrIntermediate', 'I', 'I b']
['IA', 'IB', 'IC', 'II', 'II b', 'IIA']
['IIB', 'III', 'IIIA', 'IIIB', 'IIIC', 'IPI:0']
['IPI:1', 'IPI:12', 'IPI:13', 'IPI:14', 'IPI:15', 'IPI:2']
['IPI:23', 'IPI:24', 'IPI:25', 'IPI:3', 'IPI:34', 'IPI:35']
['IPI:4', 'IPI:45', 'IPI:5', 'IV', 'IVA', 'IVB']
['IVa', 'IVb', 'Intermediate', 'IntermediateOrAdverse', 'N/A', 'Normal']
['Not Performed', 'Not Reported', 'Not Reported/ Unknown', 'Not Reported/Unknown', 'PT4apN0', 'Stage 1B']
['Stage I', 'Stage IA', 'Stage IA3', 'Stage IB', 'Stage IC', 'Stage II']
['Stage IIA', 'Stage IIB', 'Stage III', 'Stage IIIA', 'Stage IIIB', 'Stage IIIC']
['Stage IV', 'Stage IVA', 'Stage IVB', 'Stage1', 'T1N0Mx', 'T1aN0M0']
['T2', 'TxNxM1', 'Unknown', 'i', 'i/ii nos', 'ii']
['ii/v', 'iii', 'iii/v', 'iiib', 'iiib/v', 'is']
['iv', 'iv/v', 'na', 'no resection', 'not reported', 'pT1']
['pT1a', 'pT1b', 'pT1pN0', 'pT1pNx', 'pT2 N0', 'pT2, pN2,  pM not applicable'

However, we have a number of choices for the value which corresponds for Stage IIIc.

```
'IIIC',
 'Stage IIIC',
 'stage iiic',
```

Looking at the values listed (e.g. 'iiib' and 'iiib/v' it's only by luck that 'iiic' wasn't used also.

Note: I hadn't fully taken on board that what is passed through to Q is a query as a string literal. Formally stated,  this is not really a python binding for the query language. Effectively this is a pass through to the specifics of the underlying query language. That begs the question of what the purpose of that layer is. It is not really providing the separation from the implementation that was sought. 

We might guess at the following as an approach to searching for all the variants of stage iiib.


In [15]:
qc2 = Q('ResearchSubject.Diagnosis.tumor_stage in ("IIIC","Stage IIIC","stage iiic") ')
r3 = qc2.run(limit=1000) 
r3.sql
print(r3)

ValueError: Invalid value for `node_type` (in), must be one of ['column', 'quoted', 'unquoted', '>=', '<=', '<', '>', '=', '~', 'AND', 'OR', 'NOT', 'SUBQUERY']

That syntax doesn't seem to be allowed, but the following approach seems to work.

In [16]:
stageq1 = Q('ResearchSubject.Diagnosis.tumor_stage = "Stage IIIC" ')
stageq2 = Q('ResearchSubject.Diagnosis.tumor_stage = "IIIC" ')
stageq3 = Q('ResearchSubject.Diagnosis.tumor_stage = "stage iiic" ')

stageq = stageq1.Or(stageq2).Or(stageq3)
r3 = stageq.run(limit=1000) 
r3.sql
print(r3)


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_Diagnosis.tumor_stage = "Stage IIIC" ) OR (_Diagnosis.tumor_stage = "IIIC" )) OR (_Diagnosis.tumor_stage = "stage iiic" ))
Offset: 0
Limit: 1000
Count: 420
More pages: No



### Combining all three parts of Query 1

In [17]:
runQuery1(projq, ageq, stageq)


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_ResearchSubject.associated_project = "TCGA-BRCA" ) AND (_Diagnosis.age_at_diagnosis > 365.25*50)) AND (((_Diagnosis.tumor_stage = "Stage IIIC" ) OR (_Diagnosis.tumor_stage = "IIIC" )) OR (_Diagnosis.tumor_stage = "stage iiic" )))
Offset: 0
Limit: 1000
Count: 45
More pages: No



As a check: compare the results for each value of stage individually

In [18]:
stages = ['IIIC', 'Stage IIIC', 'stage iiic']

for stage in stages:
    qText = 'ResearchSubject.Diagnosis.tumor_stage = "{}" '.format(stage)
    stageq = Q(qText)
    runQuery1(projq, ageq, stageq)
    print('_'*80)


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_ResearchSubject.associated_project = "TCGA-BRCA" ) AND (_Diagnosis.age_at_diagnosis > 365.25*50)) AND (_Diagnosis.tumor_stage = "IIIC" ))
Offset: 0
Limit: 1000
Count: 0
More pages: No

________________________________________________________________________________

Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_ResearchSubject.associated_project = "TCGA-BRCA" ) AND (_Diagnosis.age_at_diagnosis > 365.25*50)) AND (_Diagnosis.tumor_stage = "Stage IIIC" ))
Offset: 0
Limit: 1000
Count: 0
More pages: No

________________________________________________________________________________

Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_Resear

So it seems all the results are where the stage value is "stage iiic". As noted in Slack the data have not been harmonized. It seems likely that stage was reported in this form in the TCGA-BRCA project, at least.

This is informative in the scope of the harmonization problem, and in informing what approaches might be taken to deal with it.

Can we do unique_terms() specific to a project? Or would we have to code this ourselves?

### Back to Age
What about ResearchSubject.Specimen.age_at_collection

In [27]:
stageq = stageq1.Or(stageq2).Or(stageq3)
collection_ageq = Q('ResearchSubject.Specimen.age_at_collection > 365.25*50')
runQuery1(projq, collection_ageq, stageq )


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_ResearchSubject.associated_project = "TCGA-BRCA" ) AND (_Specimen.age_at_collection > 365.25*50)) AND (((_Diagnosis.tumor_stage = "Stage IIIC" ) OR (_Diagnosis.tumor_stage = "IIIC" )) OR (_Diagnosis.tumor_stage = "stage iiic" )))
Offset: 0
Limit: 1000
Count: 0
More pages: No



In [26]:
collection_ageq = Q('ResearchSubject.Specimen.age_at_collection > 50')
runQuery1(projq, collection_ageq, stageq )


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_ResearchSubject.associated_project = "TCGA-BRCA" ) AND (_Specimen.age_at_collection > 50)) AND (((_Diagnosis.tumor_stage = "Stage IIIC" ) OR (_Diagnosis.tumor_stage = "IIIC" )) OR (_Diagnosis.tumor_stage = "stage iiic" )))
Offset: 0
Limit: 1000
Count: 0
More pages: No



In [28]:
collection_ageq = Q('ResearchSubject.Specimen.age_at_collection > 0')
runQuery1(projq, collection_ageq, stageq )


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_ResearchSubject.associated_project = "TCGA-BRCA" ) AND (_Specimen.age_at_collection > 0)) AND (((_Diagnosis.tumor_stage = "Stage IIIC" ) OR (_Diagnosis.tumor_stage = "IIIC" )) OR (_Diagnosis.tumor_stage = "stage iiic" )))
Offset: 0
Limit: 1000
Count: 0
More pages: No



In [31]:
collection_ageq = Q('ResearchSubject.Specimen.age_at_collection NOT null')
runQuery1(projq, collection_ageq, stageq )

ApiException: (400)
Reason: 
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Date': 'Fri, 09 Apr 2021 21:16:53 GMT', 'Via': '1.1 google', 'Alt-Svc': 'clear'})
HTTP response body: {"message":"Error calling BigQuery: 'SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_ResearchSubject.associated_project = \"TCGA-BRCA\" ) AND (NOT _Specimen.age_at_collection)) AND (((_Diagnosis.tumor_stage = \"Stage IIIC\" ) OR (_Diagnosis.tumor_stage = \"IIIC\" )) OR (_Diagnosis.tumor_stage = \"stage iiic\" ))) LIMIT 1000 OFFSET 0'","statusCode":400,"causes":["Error calling BigQuery: 'SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_ResearchSubject.associated_project = \"TCGA-BRCA\" ) AND (NOT _Specimen.age_at_collection)) AND (((_Diagnosis.tumor_stage = \"Stage IIIC\" ) OR (_Diagnosis.tumor_stage = \"IIIC\" )) OR (_Diagnosis.tumor_stage = \"stage iiic\" ))) LIMIT 1000 OFFSET 0'","No matching signature for operator NOT for argument types: INT64. Supported signature: NOT (BOOL) at [1:285]","400 Bad Request\nGET https://www.googleapis.com/bigquery/v2/projects/broad-cda-dev/queries/a755be75-7bdb-4f2f-8071-466ef758a209?location=US&maxResults=0&prettyPrint=false\n{\n  \"code\" : 400,\n  \"errors\" : [ {\n    \"domain\" : \"global\",\n    \"location\" : \"q\",\n    \"locationType\" : \"parameter\",\n    \"message\" : \"No matching signature for operator NOT for argument types: INT64. Supported signature: NOT (BOOL) at [1:285]\",\n    \"reason\" : \"invalidQuery\"\n  } ],\n  \"message\" : \"No matching signature for operator NOT for argument types: INT64. Supported signature: NOT (BOOL) at [1:285]\",\n  \"status\" : \"INVALID_ARGUMENT\"\n}"]}


### Looking at precedence
All of the following behave as one would wish. Are things like this covered in unit testing by the CDA team?

In [23]:
stageq = stageq1.Or(stageq2).Or(stageq3)
runQuery1(stageq, projq, ageq )


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((((_Diagnosis.tumor_stage = "Stage IIIC" ) OR (_Diagnosis.tumor_stage = "IIIC" )) OR (_Diagnosis.tumor_stage = "stage iiic" )) AND (_ResearchSubject.associated_project = "TCGA-BRCA" )) AND (_Diagnosis.age_at_diagnosis > 365.25*50))
Offset: 0
Limit: 1000
Count: 45
More pages: No



In [24]:
runQuery1(ageq, stageq, projq)


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_Diagnosis.age_at_diagnosis > 365.25*50) AND (((_Diagnosis.tumor_stage = "Stage IIIC" ) OR (_Diagnosis.tumor_stage = "IIIC" )) OR (_Diagnosis.tumor_stage = "stage iiic" ))) AND (_ResearchSubject.associated_project = "TCGA-BRCA" ))
Offset: 0
Limit: 1000
Count: 45
More pages: No

