## CDA v2 - Example for subject 09CO022

This notebook revisits a case looked at in version 1. Subject 09CO022 was a good example for CDA because they were known to have genomic, imaging and proteomic data. Several aspects of this case were looked at in v1 see [v1tests/09CO022 Example]((../v1tests/09CO022%20Example.ipynb)). This new notebook using CDA c2 revisits only those issues related to specimens and files. The issues in the original notebook related to Diagnosis and other phenotypic issues will be covered elsewhere.

A diagram in the [v1 notebook](../v1tests/09CO022%20Example.ipynb) shows, independent of CDA, the hierarchy of specimens and sub-specimens and known images that were derived from that 09CO022 (a person). 

In the Genomic Data Commons portal we can see the three specimens that were used for genomic analysis. The ids for the specimens match those in blue in the diagram. Note that the specimen ids shown are those used by the TCGA/CPTAC2 projects, as opposed to the UUIDs within the GDC.

A number of issues were identified in version 1. This notebook explores what has happened with those issues in version 2.

In [1]:
from cdapython import Q, columns, unique_terms
import json

The following query is what one would expect to return the subject shown above. We'll save the json result to a file. Browsing [that file](./query_results/09CO022.json) may be useful to following along with this notebook. It is too large to include in full here, but the key parts are shown.

In [2]:
q1 = Q('ResearchSubject.id = "c5421e34-e5c7-4ba5-aed9-146a5575fd8d"')
r = q1.run(limit=2) 
r.sql
print(r)
j = json.dumps(r[0], indent=3)
with open('query_results/09CO022.json', 'w') as f:
    f.write(j)

Getting results from database

Total execution time: 8788 ms

QueryID: 25d5d1ed-a18a-4974-ba1f-d144703550cd
Query: SELECT all_v2.* FROM gdc-bq-sample.integration.all_v2 AS all_v2, UNNEST(ResearchSubject) AS _ResearchSubject WHERE (_ResearchSubject.id = 'c5421e34-e5c7-4ba5-aed9-146a5575fd8d')
Offset: 0
Count: 1
Total Row Count: 1
More pages: False



Worth noting that this gave one result, as would be expected for a query specifying a single subject id.

Does this return the expected specimens from the diagram? We need to look within each ResearchSubject?

### Specimens within ResearchSubject
Looking at the specimens within Research Subject

In [3]:
for subj in r[0]['ResearchSubject']:
    for k, v in subj.items():
        if k == 'identifier':
            print ('identifier: {}:{}'.format(v[0]['system'], v[0]['value']))
        elif v.__class__.__name__ == 'list':
            print ('{} has {} items'.format(k, len(v)))

        else:
            print ('{} : {}'.format(k, v))

    print('_'*80)

id : c5421e34-e5c7-4ba5-aed9-146a5575fd8d
identifier: GDC:c5421e34-e5c7-4ba5-aed9-146a5575fd8d
associated_project : CPTAC-2
primary_disease_type : Adenomas and Adenocarcinomas
primary_disease_site : Colon
Diagnosis has 1 items
File has 30 items
Specimen has 11 items
________________________________________________________________________________
id : 459e3b69-63d6-11e8-bcf1-0a2705229b82
identifier: PDC:459e3b69-63d6-11e8-bcf1-0a2705229b82
associated_project : CPTAC-2
primary_disease_type : Colon Adenocarcinoma
primary_disease_site : Colon
Diagnosis has 1 items
File has 163 items
Specimen has 5 items
________________________________________________________________________________


That's more that what we might expect. However, the following exploration of the specimens shows that the count of 11 is exactly the same hierarchy as in the GDC screenshot in the original notebook. Namely, two parent samples, sample->portion->analyte->aliquot hierarchy leading to three aliquots on which genomic sequencing was performed.

#### First for the GDC "Research Subject"

In [4]:
import pandas as pd
rowsFromRoot = []
for s in r[0]['ResearchSubject'][0]['Specimen']:
    rowsFromRoot.append([s['identifier'][0]['value'], s['derived_from_specimen'], s['specimen_type'], s['source_material_type']])
df = pd.DataFrame(rowsFromRoot)
df.columns = ['identifier','derived_from_specimen','specimen_type','source_material']
df



Unnamed: 0,identifier,derived_from_specimen,specimen_type,source_material
0,4591a53d-5668-4a70-b44b-e08a3d59267e,initial specimen,sample,Primary Tumor
1,c53c4d60-2ddb-5da8-932e-00a86fa2347f,4591a53d-5668-4a70-b44b-e08a3d59267e,portion,Primary Tumor
2,31075cfa-7aef-59f1-bf54-a1cddb5ee0fd,c53c4d60-2ddb-5da8-932e-00a86fa2347f,analyte,Primary Tumor
3,0d8adcbf-13f0-48c3-83df-3fa205b79ae8,31075cfa-7aef-59f1-bf54-a1cddb5ee0fd,aliquot,Primary Tumor
4,d085ebd9-7605-54a0-abb9-10867f5fa1b1,4591a53d-5668-4a70-b44b-e08a3d59267e,portion,Primary Tumor
5,a31724b6-e550-552b-bd61-41341c534e28,d085ebd9-7605-54a0-abb9-10867f5fa1b1,analyte,Primary Tumor
6,9250d96e-1cdc-4d68-8a56-f7b186a6fab5,a31724b6-e550-552b-bd61-41341c534e28,aliquot,Primary Tumor
7,b12c257d-7409-4858-9384-c430929a075a,initial specimen,sample,Blood Derived Normal
8,702d7ba0-9558-5b2d-af4d-cd797485b8c1,b12c257d-7409-4858-9384-c430929a075a,portion,Blood Derived Normal
9,f0003f0a-07ea-548e-b1f7-7e6d1b27d47a,702d7ba0-9558-5b2d-af4d-cd797485b8c1,analyte,Blood Derived Normal


#### Then the PDC "Research Subject"

In [5]:
rowsFromRoot = []
for s in r[0]['ResearchSubject'][1]['Specimen']:
    rowsFromRoot.append([s['identifier'][0]['value'], s['derived_from_specimen'], s['specimen_type'], s['source_material_type']])
df = pd.DataFrame(rowsFromRoot)
df.columns = ['identifier','derived_from_specimen','specimen_type','source_material']
df

Unnamed: 0,identifier,derived_from_specimen,specimen_type,source_material
0,f4af3e4d-641b-11e8-bcf1-0a2705229b82,initial specimen,sample,Solid Tissue Normal
1,208ebc64-6425-11e8-bcf1-0a2705229b82,f4af3e4d-641b-11e8-bcf1-0a2705229b82,aliquot,Solid Tissue Normal
2,f6cce507-641b-11e8-bcf1-0a2705229b82,initial specimen,sample,Primary Tumor
3,08b2a2bf-6427-11e8-bcf1-0a2705229b82,f6cce507-641b-11e8-bcf1-0a2705229b82,aliquot,Primary Tumor
4,44f5b956-642a-11e8-bcf1-0a2705229b82,f6cce507-641b-11e8-bcf1-0a2705229b82,aliquot,Primary Tumor


In respect of the PDC in v1 we got just two specimens, now we have five. This is good. Those additional specimens for the PDC ResearchSubject now have the distinct aliquots that went to PNNL and VU. That fixes an issue from v1. It was described as issue 4 in the original notebook and reported recently as reported as [cda-python/issues/97](https://github.com/CancerDataAggregator/cda-python/issues/97)

If things are going well we would expect the VU and PNNL files (see the original notebook) only to be associated with the specific aliquot. From what I've seen elsewhere though I'm not confident that will be the case. We have seen that files have been included not only within specimens at higher levels in the tree, but at lower or peer level specimens. That is documented in [cda-service/issues/79](https://github.com/CancerDataAggregator/cda-service/issues/79).

Pessimism aside let's try it! Given issue 79 let's restrict ourselves to the aliquots. Given the ground truth diagram we expect: 
one aliquot from the normal to have PNNL files only
one aliquot from the tumor to have PNNL files only
one aliquot from the tumor to have VU files only

In [6]:
subject2 = r[0]['ResearchSubject'][1]
for s2 in subject2['Specimen']:        
    if s2['specimen_type'] == 'aliquot':
        for k, v in s2.items():
            if v.__class__.__name__ == 'list':
                if len(v) < 2:
                    print(k, v)
                else:
                    if k == 'File':
                        for f in v:
                            print(f['label'])
                    else:    
                        print ('{} has {} items'.format(k, len(v)))
            else:
                print ('{} : {}'.format(k, v))

        print('_'*80)

id : 208ebc64-6425-11e8-bcf1-0a2705229b82
identifier [{'system': 'PDC', 'value': '208ebc64-6425-11e8-bcf1-0a2705229b82'}]
associated_project : CPTAC-2
age_at_collection : None
primary_disease_type : Colon Adenocarcinoma
anatomical_site : Not Reported
source_material_type : Solid Tissue Normal
specimen_type : aliquot
derived_from_specimen : f4af3e4d-641b-11e8-bcf1-0a2705229b82
derived_from_subject : 09CO022
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f03.raw.cap.psm
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f02.mzML.gz
13CPTAC_COprospective_P_PNNL_20170215_B4S1_f06.mzid.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f02.mzid.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f04.mzid.gz
13CPTAC_COprospective_P_PNNL_20170215_B4S1_f06.raw
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f06.mzML.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f06.raw.cap.psm
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f07.mzid.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f12.raw
13CPTAC_COprospective_P_PNNL_20170215_B4

Bingo! That appears to be the answer expected! It's certainly a distinct set of files for each of the three specimens. The addition of the aliquots addresses [cda-python/issues/97](https://github.com/CancerDataAggregator/cda-python/issues/97).

What happens with files at the higher level? Looking at Sample.

In [7]:
for s2 in subject2['Specimen']:        
    if s2['specimen_type'] == 'sample':
        for k, v in s2.items():
            if v.__class__.__name__ == 'list':
                if len(v) < 2:
                    print(k, v)
                else:
                    if k == 'File':
                        for f in v:
                            print(f['label'])
                    else:    
                        print ('{} has {} items'.format(k, len(v)))
            else:
                print ('{} : {}'.format(k, v))

        print('_'*80)

id : f4af3e4d-641b-11e8-bcf1-0a2705229b82
identifier [{'system': 'PDC', 'value': 'f4af3e4d-641b-11e8-bcf1-0a2705229b82'}]
associated_project : CPTAC-2
age_at_collection : None
primary_disease_type : Colon Adenocarcinoma
anatomical_site : Not Reported
source_material_type : Solid Tissue Normal
specimen_type : sample
derived_from_specimen : initial specimen
derived_from_subject : 09CO022
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f03.raw.cap.psm
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f02.mzML.gz
13CPTAC_COprospective_P_PNNL_20170215_B4S1_f06.mzid.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f02.mzid.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f04.mzid.gz
13CPTAC_COprospective_P_PNNL_20170215_B4S1_f06.raw
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f06.mzML.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f06.raw.cap.psm
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f07.mzid.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f12.raw
13CPTAC_COprospective_P_PNNL_20170215_B4S1_f05.raw.cap.psm
13

That there are two specimens at this level is as expected. The tumor and the normal.

However the files lists for each specimen are dangerously inaccurate. Under the normal specimen it lists files that came from the tumor. The reverse is also true - under the tumor specimen are listed files which are derived from the normal.

A further independent check of this can be made by searches in the PDC. That indicates no work was done on at VU on the normal sample. 

This has been submitted as [cda-python/issues/99](https://github.com/CancerDataAggregator/cda-python/issues/99).