## CPTAC TCGA Colom samples

This notebook explores whether the genomic and proteomic data from a single CPTAC2 subject (person) become aggregated by the Cancer Data Aggregator (CDA). The following diagram shows, independent of CDA, a hierarchy of specimens and sub-specimens that were derived from that subject. In a sense one might say it is a 'ground truth' which it is useful to see whether CDA accurately represent. The subject was part of the TCGA Colon Cancer project and has the identifier 09CO022<sup>1</sup>. 

We'll leave aside for now how to do a query using the '09CO022' identifier. From the GDC portal page shown above we identify that the ResearchSubject.id that the CDA will use is c5421e34-e5c7-4ba5-aed9-146a5575fd8d.

On to a query using that value.

<sup>1</sup>Other identifiers were created for the subject in different systems but this identifier is used so we can refer to the subject independently of any of those specific systems. It is also a more convenient id to use than a UUID when writing or speaking.

In [1]:
from cdapython import Q, columns, unique_terms
import json

The following query is what one would expect to return the subject shown above. We'll save the json result to a file.

In [18]:
#q1 = Q('ResearchSubject.Specimen.id = "670a67f3-641f-11e8-bcf1-0a2705229b82"')
q1 = Q('ResearchSubject.id = "c5421e34-e5c7-4ba5-aed9-146a5575fd8d"')
r = q1.run(limit=2) 
print(r)
#print(r.pretty_print)
#r.next_page
#r.count
#j = json.dumps(r[0], indent=3)
#with open('query_results/09CO022.json', 'w') as f:
#    f.write(j)

TypeError: object of type 'NoneType' has no len()

Worth noting that this gave one result, as would be expected for a query specifying a single subject id.

Did this return the expected specimens from the diagram? First, a count of those specimens.

In [4]:
specimensFromRoot = r[0]['Specimen']
len(specimensFromRoot)

11

That's more that what we might expect. However, the following exploration of the specimens shows that the count of 11 is exactly the same hierarchy as in the GDC screenshot above. Namely, two parent samples, leading to three aliquots on which genomic sequencing was performed.

In [5]:
import pandas as pd
rowsFromRoot = []
for s in specimensFromRoot:
    rowsFromRoot.append([s['identifier'][0]['value'], s['derived_from_specimen'], s['specimen_type'], s['source_material_type']])
df = pd.DataFrame(rowsFromRoot)
df.columns = ['identifier','derived_from_specimen','specimen_type','source_material']
df

Unnamed: 0,identifier,derived_from_specimen,specimen_type,source_material
0,4591a53d-5668-4a70-b44b-e08a3d59267e,Initial sample,sample,Primary Tumor
1,c53c4d60-2ddb-5da8-932e-00a86fa2347f,4591a53d-5668-4a70-b44b-e08a3d59267e,portion,Primary Tumor
2,31075cfa-7aef-59f1-bf54-a1cddb5ee0fd,c53c4d60-2ddb-5da8-932e-00a86fa2347f,analyte,Primary Tumor
3,0d8adcbf-13f0-48c3-83df-3fa205b79ae8,31075cfa-7aef-59f1-bf54-a1cddb5ee0fd,aliquot,Primary Tumor
4,d085ebd9-7605-54a0-abb9-10867f5fa1b1,4591a53d-5668-4a70-b44b-e08a3d59267e,portion,Primary Tumor
5,a31724b6-e550-552b-bd61-41341c534e28,d085ebd9-7605-54a0-abb9-10867f5fa1b1,analyte,Primary Tumor
6,9250d96e-1cdc-4d68-8a56-f7b186a6fab5,a31724b6-e550-552b-bd61-41341c534e28,aliquot,Primary Tumor
7,b12c257d-7409-4858-9384-c430929a075a,Initial sample,sample,Blood Derived Normal
8,702d7ba0-9558-5b2d-af4d-cd797485b8c1,b12c257d-7409-4858-9384-c430929a075a,portion,Blood Derived Normal
9,f0003f0a-07ea-548e-b1f7-7e6d1b27d47a,702d7ba0-9558-5b2d-af4d-cd797485b8c1,analyte,Blood Derived Normal


What else is returned by CDA in the single result? 

### Issue 1
The designation of the parent specimen as derived from Initial Sample breaks the tree. Both are derived from the Subject. (actually from a particular event in the Subject's history). Derived_from_specimen is misleading, it could be read that roes 0 and 7 above derive from the same sample.  Derived_from_specimen also has mixed meaning (semantics); in some cases it is a link to the parent object, in others it is a human readable description. A preferable solution would be derived_from_biomaterial where the biomaterial is a specimen or subject as appropriate. This is a common pattern; present in MAGE, FuGE, the Human Cell Atlas, and other standards and systems.

### Note 
The fidelity with which the tree is represented will become more important as the imaging data is added. If the 'ground truth' example is as shown in the diagram at top, there is a particular relationship between the image file and one of the genomic samples which does not exist for the other genomic sample. Let's say that an examination of the image let us determine '% neoplastic cellularity'; that might help us interpret the data from the closely related sample. We would need to be careful in assuming that any data derived from that image had any relevance to the second genomic sample.

In [6]:
for k in r[0].keys():
    print(k)

days_to_birth
race
sex
ethnicity
id
ResearchSubject
Diagnosis
Specimen
associated_project
primary_disease_type
identifier
primary_disease_site


So far we just looked at Specimen. 

Diagnosis is as follows

In [7]:
r[0]['Diagnosis']

[{'morphology': '8140/3',
  'tumor_stage': 'Stage IIB',
  'tumor_grade': 'Not Reported',
  'Treatment': [],
  'id': '7b8d36ba-ab84-48ad-ac2c-11ac40d3d0eb',
  'primary_diagnosis': 'Adenocarcinoma, NOS',
  'age_at_diagnosis': None}]

That's useful information. Morphology is cryptic in that it provides no information about the value shown. Some, but by no means all, users would be informed enough . 

Perhaps there's a schema somewhere that tells us the 'value set' used for morphology, but there's nothing here to tell is what that schema is.

Some would say that human readbility is not the point, this is an API off which UIs will be built. Fair enough, but there's not enough for a machine to work off either. 

### Issue 2
* Code system for morphology is not clear.
* Link between 'morphology' and 'primary_diagnosis' is not clear.

From the above we note that there are ResearchSubjects within ResearchSubjects. Looking at the specimens of the first nested ResearchSubject.

In [8]:
specimensFromRS = r[0]['ResearchSubject'][0]['Specimen']

In [9]:
len(specimensFromRS)

11

That's the same count as before. Looking at the details.

In [10]:
rowsFromRS = []
for s in specimensFromRS:
    rowsFromRS.append([s['identifier'][0]['value'], s['derived_from_specimen'], s['specimen_type'], s['source_material_type']])
df = pd.DataFrame(rowsFromRS)
df

Unnamed: 0,0,1,2,3
0,4591a53d-5668-4a70-b44b-e08a3d59267e,Initial sample,sample,Primary Tumor
1,c53c4d60-2ddb-5da8-932e-00a86fa2347f,4591a53d-5668-4a70-b44b-e08a3d59267e,portion,Primary Tumor
2,31075cfa-7aef-59f1-bf54-a1cddb5ee0fd,c53c4d60-2ddb-5da8-932e-00a86fa2347f,analyte,Primary Tumor
3,0d8adcbf-13f0-48c3-83df-3fa205b79ae8,31075cfa-7aef-59f1-bf54-a1cddb5ee0fd,aliquot,Primary Tumor
4,d085ebd9-7605-54a0-abb9-10867f5fa1b1,4591a53d-5668-4a70-b44b-e08a3d59267e,portion,Primary Tumor
5,a31724b6-e550-552b-bd61-41341c534e28,d085ebd9-7605-54a0-abb9-10867f5fa1b1,analyte,Primary Tumor
6,9250d96e-1cdc-4d68-8a56-f7b186a6fab5,a31724b6-e550-552b-bd61-41341c534e28,aliquot,Primary Tumor
7,b12c257d-7409-4858-9384-c430929a075a,Initial sample,sample,Blood Derived Normal
8,702d7ba0-9558-5b2d-af4d-cd797485b8c1,b12c257d-7409-4858-9384-c430929a075a,portion,Blood Derived Normal
9,f0003f0a-07ea-548e-b1f7-7e6d1b27d47a,702d7ba0-9558-5b2d-af4d-cd797485b8c1,analyte,Blood Derived Normal


Compare what we got for both "Research Subjects". They are the same

In [11]:
rowsFromRS == rowsFromRoot

True

Compare the specimens. They are also the same.

In [12]:
specimensFromRS == specimensFromRoot

True

In fact we can compare all of the content of the nested ResearchSubject with the content of the top level as follows

In [13]:
rootKeys = r[0].keys()
print(rootKeys)
print('_'*80)
researchSubjectKeys = r[0]['ResearchSubject'][0].keys()
print(researchSubjectKeys)
# get the keys that are in both the top level and within ResearchSubject
print('_'*80)
commonKeys = list(set(researchSubjectKeys) & set(rootKeys))
for k in commonKeys:
    print(k, r[0][k] == r[0]['ResearchSubject'][0][k])


dict_keys(['days_to_birth', 'race', 'sex', 'ethnicity', 'id', 'ResearchSubject', 'Diagnosis', 'Specimen', 'associated_project', 'primary_disease_type', 'identifier', 'primary_disease_site'])
________________________________________________________________________________
dict_keys(['Diagnosis', 'Specimen', 'associated_project', 'id', 'primary_disease_type', 'identifier', 'primary_disease_site'])
________________________________________________________________________________
Diagnosis True
primary_disease_type True
identifier True
Specimen True
id True
associated_project True
primary_disease_site True


This shows that the entire content of the nested ResearchSubject repeats content that is also available at the top level.

### Issue 3
Results contain repetition which is unnecessary and complicates understanding the response. In terms of the number of lines, 20.2% of the file is repeated, but that is a poor measure of the additional complication caused.

This was further explored in the SQL Exploration notebook in this repository and an underlying cause identified. This issue was created.
https://github.com/CancerDataAggregator/cda-python/issues/16

## Second Research Subject
Looking at the file, there is a second ResearchSubject present. What is it? It has two specimens.

In [14]:
subject2 = r[0]['ResearchSubject'][1]
len(subject2['Specimen'])

2

In [15]:
for subj in r[0]['ResearchSubject']:
    for k, v in subj.items():
        if v.__class__.__name__ == 'list':
            print ('{} has {} items'.format(k, len(v)))
        else:
            print ('{} : {}'.format(k, v))

    print('_'*80)

Diagnosis has 1 items
Specimen has 11 items
associated_project : CPTAC-2
id : c5421e34-e5c7-4ba5-aed9-146a5575fd8d
primary_disease_type : Adenomas and Adenocarcinomas
identifier has 1 items
primary_disease_site : Colon
________________________________________________________________________________
Diagnosis has 1 items
Specimen has 2 items
associated_project : CPTAC-2
id : 459e3b69-63d6-11e8-bcf1-0a2705229b82
primary_disease_type : Colon Adenocarcinoma
identifier has 1 items
primary_disease_site : Colon
________________________________________________________________________________


Looking back at the original query, it asked for c5421e34-e5c7-4ba5-aed9-146a5575fd8d. That is the first of the two subjects returned. It's not clear why the second is returned.

In [16]:
subject1 = r[0]['ResearchSubject'][0]
subject1['Diagnosis']

[{'morphology': '8140/3',
  'tumor_stage': 'Stage IIB',
  'tumor_grade': 'Not Reported',
  'Treatment': [],
  'id': '7b8d36ba-ab84-48ad-ac2c-11ac40d3d0eb',
  'primary_diagnosis': 'Adenocarcinoma, NOS',
  'age_at_diagnosis': None}]

In [17]:
subject2['Diagnosis']

[{'morphology': '8140/3',
  'tumor_stage': 'Stage IIB',
  'tumor_grade': 'Not Reported',
  'Treatment': [],
  'id': 'ff301535-70ca-11e8-bcf1-0a2705229b82',
  'primary_diagnosis': 'Adenocarcinoma, NOS',
  'age_at_diagnosis': None}]

The diagnosis for these two subjects is identical. That's insufficient to say they are the same subject, but this seems worth pursuing.

Let's look at the Specimens for this subject.

In [18]:
for s2 in subject2['Specimen']:
    for k, v in s2.items():
        if v.__class__.__name__ == 'list':
            if len(v) < 2:
                print(k, v)
            else:
                if k == 'File':
                    for f in v:
                        print(f['label'])
                else:    
                    print ('{} has {} items'.format(k, len(v)))
        else:
            print ('{} : {}'.format(k, v))

    print('_'*80)

13CPTAC_COprospective_W_PNNL_20170123_B4S1_f03.raw.cap.psm
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f02.mzML.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f02.mzid.gz
13CPTAC_COprospective_P_PNNL_20170215_B4S1_f06.mzid.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f04.mzid.gz
13CPTAC_COprospective_P_PNNL_20170215_B4S1_f06.raw
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f06.mzML.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f06.raw.cap.psm
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f07.mzid.gz
13CPTAC_COprospective_P_PNNL_20170215_B4S1_f05.raw.cap.psm
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f12.raw
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f02.raw.cap.psm
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f10.raw.cap.psm
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f09.mzML.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f08.mzML.gz
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f09.raw
13CPTAC_COprospective_W_PNNL_20170123_B4S1_f01.raw
13CPTAC_COprospective_P_PNNL_20170215_B4S1_f02.mzid.gz
13CPTA

Detective work lets us determine that these two specimens are the three :-) specimens in the 'ground truth'. We have to look at the file names to give us this clue, notably PNNL and VU in the file names for the tumor sample and PNNL only for the Normal sample. The ground truth tells us that the tumor went to Pacific Northwestern Laboratory (PNNL) and Vanderbilt University (VU), and the Normal blood tp PNNL. 

### Issue 4
Given that these specimens went to locations more than 2000 miles apart they must have been physically separate materials. The ground truth model shows that - with the two specimens having visibly different ids. That has been lost in the PDC/CDA representation. The second specimen represents the parent not the physically distinct samples at each site.

### Issue 5
There should not be two subjects in the results returned from this query. This is not the same use case as 'connect the EHRs from the same patient when they visit'. These are all data from the same study, and there was only one encounter of the subject with the study. The problem here is breaking the chain/tree which shows how the samples are derived from one another. 

This is a known issue, referred to in the [CDA Release 1 Testing Guide](https://docs.google.com/document/d/1jzvSJu3xWv-UtoPWpZTLbxPq_wqI1vRyfNlJf1V22cU/edit). It is explored here in the context of the other issues raised so that the CDA extensions for release 1 can be evaluated, and in order to identify how the CCDH model can best represent the data.

The CDA extension has allowed association between subjects to be made. The CCDH model should allow association at the specimen level as opposed to subject. See next issue.

### Issue 6
This also misses that the proteomic and genomic data derive from the same specimen. The information as preseented shows that the Genomic and Proteomic specimens derive from the same subject but this is insufficient. The PDC sample should indicate as its parent the same parent as the genomic sample.

## The same subject as part of a query on subject attributes
Querying via the id of a specific subject as above was useful in looking at how the CDA behaves when looking at a single subject. It isolates the query from other situations.

Looking at the attributes of subject 09CO022 we can run the following query which should return 09CO022 alongside subjects with the same attributes.

In [19]:
qc1 = Q('ResearchSubject.associated_project = "CPTAC-2"')
qc2 = Q('ResearchSubject.Diagnosis.tumor_stage = "Stage IIB"')
qc3 = Q('ResearchSubject.primary_disease_site = "Colon"')

q2 = qc1.And(qc2).And(qc3)
r2 = q2.run(limit=100) 
r2.sql
print(r2)


Query: SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_ResearchSubject.associated_project = 'CPTAC-2') AND (_Diagnosis.tumor_stage = 'Stage IIB')) AND (_ResearchSubject.primary_disease_site = 'Colon'))
Offset: 0
Limit: 100
Count: 8
More pages: No



This shows that we appear to have eight matches. The following explores what they are.

In [20]:
rn = 1
for resItem in r2:
    j = json.dumps(resItem, indent=3)
    print('Subject {}'.format(resItem['identifier']))
    for rs in resItem['ResearchSubject']:
        subj_name = rs['Specimen'][0]['derived_from_subject']
        print('nested subject: {}'.format(subj_name))
        id = rs['identifier'][0]
        print ('\t system:{} id:{}'.format(id['system'],id['value']))     
    fpath = 'query_results/subj_'+str(rn)+'.json'
    print(fpath)
    with open(fpath, 'w') as f:
        f.write(j)
    rn +=1
    print('_'*80)

Subject [{'system': 'GDC', 'value': '44ecd34b-aa2b-4ce1-ab23-c64aee162f69'}]
nested subject: 15CO002
	 system:GDC id:44ecd34b-aa2b-4ce1-ab23-c64aee162f69
nested subject: None
	 system:PDC id:d2b0df58-63d6-11e8-bcf1-0a2705229b82
query_results/subj_1.json
________________________________________________________________________________
Subject [{'system': 'PDC', 'value': 'd2b0df58-63d6-11e8-bcf1-0a2705229b82'}]
nested subject: 15CO002
	 system:GDC id:44ecd34b-aa2b-4ce1-ab23-c64aee162f69
nested subject: None
	 system:PDC id:d2b0df58-63d6-11e8-bcf1-0a2705229b82
query_results/subj_2.json
________________________________________________________________________________
Subject [{'system': 'GDC', 'value': 'c5421e34-e5c7-4ba5-aed9-146a5575fd8d'}]
nested subject: 09CO022
	 system:GDC id:c5421e34-e5c7-4ba5-aed9-146a5575fd8d
nested subject: None
	 system:PDC id:459e3b69-63d6-11e8-bcf1-0a2705229b82
query_results/subj_3.json
____________________________________________________________________________

In fact there are only four true subjects here. For each of those four, there are two subjects in the results. One with a GDC id and one with a PDC id. In each case, there are two subjects nested within the parent. One is a repetition of the parent, and the second is corresponding subject from the other system. Overall within the search the data is repeated three times.

On reflection this is unsurprizing, it all follows as a natural consequence of what we saw above in the isolated query. The basic issues are those already listed. Beyond that, this example illustrates that the duplication and complication becomes a bigger issue when a realistic query on subject atttributes is performed.

### Issue 7
Same subject is returned three times when queried on subject attributes.