## CDA Example for subject 09CO022 - tested on release 0.1.12

This is an update to this example following the fix in https://github.com/CancerDataAggregator/cda-python/issues/16

This notebook explores whether the genomic and proteomic data from a single CPTAC2 subject (person) become aggregated by the Cancer Data Aggregator (CDA). The following diagram shows, independent of CDA, a hierarchy of specimens and sub-specimens that were derived from that subject. In a sense one might say it is a 'ground truth' which it is useful to see whether CDA accurately represent. The subject was part of the TCGA Colon Cancer project and has the identifier 09CO022<sup>1</sup>. 

![09CO022](images/09CO022.jpg)

In the Genomic Data Commons portal we can see the three specimens that were used for genomic analysis. The ids for the specimens match those in blue in the diagram above. Note that the specimen ids shown are those used by the TCGA/CPTAC2 projects, as opposed to the UUIDs within the GDC.
![09CO022](images/09CO022_GDC.jpg)

We'll leave aside for now how to do a query using the '09CO022' identifier. From the GDC portal page shown above we identify that the ResearchSubject.id that the CDA will use is c5421e34-e5c7-4ba5-aed9-146a5575fd8d.

On to a query using that value.

<sup>1</sup>Other identifiers were created for the subject in different systems but this identifier is used so we can refer to the subject independently of any of those specific systems. It is also a more convenient id to use than a UUID when writing or speaking.

In [1]:
from cdapython import Q, columns, unique_terms
import json

The following query is what one would expect to return the subject shown above. We'll save the json result to a file.

In [3]:
q1 = Q('ResearchSubject.id = "c5421e34-e5c7-4ba5-aed9-146a5575fd8d"')
r = q1.run(limit=2) 
r.sql
print(r)
j = json.dumps(r[0], indent=3)
with open('query_results/09CO022_fix.json', 'w') as f:
    f.write(j)


Query: SELECT p.* FROM gdc-bq-sample.cda_mvp.v3 AS p, UNNEST(ResearchSubject) AS _ResearchSubject WHERE (_ResearchSubject.id = 'c5421e34-e5c7-4ba5-aed9-146a5575fd8d')
Offset: 0
Limit: 2
Count: 1
More pages: No



Worth noting that this gave one result, as would be expected for a query specifying a single subject id.

The fix in https://github.com/CancerDataAggregator/cda-python/issues/16 fixes the problem that the ResearchSubject details are returned at the top level.

The following indicates that we only get patient attributes at the top level.

In [5]:
for k in r[0].keys():
    print(k)

days_to_birth
race
sex
ethnicity
id
ResearchSubject


From the above we note that there are ResearchSubjects within ResearchSubjects. Looking at the specimens of the first nested ResearchSubject.

## The same subject as part of a query on subject attributes
Querying via the id of a specific subject as above was useful in looking at how the CDA behaves when looking at a single subject. It isolates the query from other situations.

Looking at the attributes of subject 09CO022 we can run the following query which should return 09CO022 alongside subjects with the same attributes.

In [15]:
qc1 = Q('ResearchSubject.associated_project = "CPTAC-2"')
qc2 = Q('ResearchSubject.Diagnosis.tumor_stage = "Stage IIB"')
qc3 = Q('ResearchSubject.primary_disease_site = "Colon"')

q2 = qc1.And(qc2).And(qc3)
r2 = q2.run(limit=100) 
r2.sql
print(r2)


Query: SELECT p.* FROM gdc-bq-sample.cda_mvp.v3 AS p, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_ResearchSubject.associated_project = 'CPTAC-2') AND (_Diagnosis.tumor_stage = 'Stage IIB')) AND (_ResearchSubject.primary_disease_site = 'Colon'))
Offset: 0
Limit: 100
Count: 8
More pages: No



This ha the modified SQL - so the root is still only patient. 

That tesult shows that we appear to have eight matches. The following explores what they are.

In [20]:
rn = 1
for resItem in r2:
    j = json.dumps(resItem, indent=3)
    print('Patient {}'.format(resItem['id']))
    for rs in resItem['ResearchSubject']:
        subj_name = rs['Specimen'][0]['derived_from_subject']
        print('nested subject: {}'.format(subj_name))
        id = rs['identifier'][0]
        print ('\t system:{} id:{}'.format(id['system'],id['value']))     
    fpath = 'query_results/subj_'+str(rn)+'.json'
    print(fpath)
    with open(fpath, 'w') as f:
        f.write(j)
    rn +=1
    print('_'*80)

Patient 15CO002
nested subject: 15CO002
	 system:GDC id:44ecd34b-aa2b-4ce1-ab23-c64aee162f69
nested subject: None
	 system:PDC id:d2b0df58-63d6-11e8-bcf1-0a2705229b82
query_results/subj_1.json
________________________________________________________________________________
Patient 15CO002
nested subject: 15CO002
	 system:GDC id:44ecd34b-aa2b-4ce1-ab23-c64aee162f69
nested subject: None
	 system:PDC id:d2b0df58-63d6-11e8-bcf1-0a2705229b82
query_results/subj_2.json
________________________________________________________________________________
Patient 09CO022
nested subject: 09CO022
	 system:GDC id:c5421e34-e5c7-4ba5-aed9-146a5575fd8d
nested subject: None
	 system:PDC id:459e3b69-63d6-11e8-bcf1-0a2705229b82
query_results/subj_3.json
________________________________________________________________________________
Patient 09CO022
nested subject: 09CO022
	 system:GDC id:c5421e34-e5c7-4ba5-aed9-146a5575fd8d
nested subject: None
	 system:PDC id:459e3b69-63d6-11e8-bcf1-0a2705229b82
query_resul

In fact there are only four true patients here. For each of those four, there are two records in the results. 

### Issue 7
Same subject is returned three times when queried on subject attributes.

In [21]:
import cda_client
host='https://cda.cda-dev.broadinstitute.org'
api_client = cda_client.ApiClient(configuration=cda_client.Configuration(host=host))
api_instance = cda_client.QueryApi(api_client)

In [23]:
import json
query1 = ''' select * from gdc-bq-sample.cda_mvp.v3
where id in
(SELECT distinct p.id FROM gdc-bq-sample.cda_mvp.v3 AS p, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis 
WHERE (((_ResearchSubject.associated_project = 'CPTAC-2') 
AND (_Diagnosis.tumor_stage = 'Stage IIB')) 
AND (_ResearchSubject.primary_disease_site = 'Colon')) )'''
api_response1 = api_instance.sql_query('v3',query1)

# The results are bulky to list in a notebook - so we'll write them to a file
#with open('query_results/09CO022_query1.json', 'w') as f:
#    f.write(json.dumps(api_response1.result, indent=3))

In [24]:
rn = 1
for resItem in api_response1.result:
    j = json.dumps(resItem, indent=3)
    print('Patient {}'.format(resItem['id']))
    for rs in resItem['ResearchSubject']:
        subj_name = rs['Specimen'][0]['derived_from_subject']
        print('nested subject: {}'.format(subj_name))
        id = rs['identifier'][0]
        print ('\t system:{} id:{}'.format(id['system'],id['value']))     
    fpath = 'query_results/subj_'+str(rn)+'.json'
    print(fpath)
    with open(fpath, 'w') as f:
        f.write(j)
    rn +=1
    print('_'*80)

Patient 15CO002
nested subject: 15CO002
	 system:GDC id:44ecd34b-aa2b-4ce1-ab23-c64aee162f69
nested subject: None
	 system:PDC id:d2b0df58-63d6-11e8-bcf1-0a2705229b82
query_results/subj_1.json
________________________________________________________________________________
Patient 09CO022
nested subject: 09CO022
	 system:GDC id:c5421e34-e5c7-4ba5-aed9-146a5575fd8d
nested subject: None
	 system:PDC id:459e3b69-63d6-11e8-bcf1-0a2705229b82
query_results/subj_2.json
________________________________________________________________________________
Patient 05CO039
nested subject: 05CO039
	 system:GDC id:997475b1-6648-494a-9322-79aa17be272e
nested subject: None
	 system:PDC id:2254625e-63d6-11e8-bcf1-0a2705229b82
query_results/subj_3.json
________________________________________________________________________________
Patient 05CO044
nested subject: 05CO044
	 system:GDC id:5e55cf3e-9f95-4b8c-8212-b540da3047cb
nested subject: None
	 system:PDC id:24cb0fcb-63d6-11e8-bcf1-0a2705229b82
query_resul