## Query by PDC filename
This notebook started out to discover if proteomic files that appeared to be missing were in fact present, but had become orphaned from their parent specimen.

In doing so another issue was encountered.

In [1]:
from cdapython import Q, columns, unique_terms

These were the names of the missing files (see the thread in [cda-python/issues/99](https://github.com/CancerDataAggregator/cda-python/issues/99))

In [2]:
missing_files = ['16CPTAC_COprospective_W_PNNL_20170123_B4S4_f03.raw.cap.psm',
'16CPTAC_COprospective_W_PNNL_20170123_B4S4_f10.raw',
'13CPTAC_COprospective_W_PNNL_20170123_B4S1_f01.mzML.gz',
'13CPTAC_COprospective_W_PNNL_20170123_B4S1_f02.raw']


In [3]:
def find_file(file_label):
    print("Searching for file {}".format(file_label))
    q1 = Q('ResearchSubject.Specimen.File.label = "{}"'.format(file_label))
    r = q1.run(limit=1000) 
    print("Found that file in {} 'Subjects'".format(r.count))
    print('_'*80)
    return r
    
for f in missing_files:
    find_file(f)

Searching for file 16CPTAC_COprospective_W_PNNL_20170123_B4S4_f03.raw.cap.psm
Getting results from database

Total execution time: 1238 ms
Found that file in 0 'Subjects'
________________________________________________________________________________
Searching for file 16CPTAC_COprospective_W_PNNL_20170123_B4S4_f10.raw
Getting results from database

Total execution time: 824 ms
Found that file in 0 'Subjects'
________________________________________________________________________________
Searching for file 13CPTAC_COprospective_W_PNNL_20170123_B4S1_f01.mzML.gz
Getting results from database

Total execution time: 841 ms
Found that file in 0 'Subjects'
________________________________________________________________________________
Searching for file 13CPTAC_COprospective_W_PNNL_20170123_B4S1_f02.raw
Getting results from database

Total execution time: 881 ms
Found that file in 0 'Subjects'
________________________________________________________________________________


None of the files are found to be associated with any specimen.

Before moving on, it's worth checking our function works by asking it to look at a file we know exists.

In [4]:
cases = find_file('16CPTAC_COprospective_W_PNNL_20170123_B4S4_f04.raw.cap.psm')

Searching for file 16CPTAC_COprospective_W_PNNL_20170123_B4S4_f04.raw.cap.psm
Getting results from database

Total execution time: 9170 ms
Found that file in 30 'Subjects'
________________________________________________________________________________


Yes the file exists and is linked to specimens, our function works!

The count of subjects likely reflects the multiplexing in proteomics, but 30 seems large.

What have we got?

In [5]:
for c in cases:
    print(c['id'])

11CO039
11CO039
11CO039
11CO039
09CO022
09CO022
09CO022
05CO039
05CO039
05CO039
05CO039
11CO010
11CO010
11CO010
21CO006
21CO006
21CO006
11CO045
11CO045
11CO045
11CO005
11CO005
11CO005
11CO005
ColonRef
ColonRef
11CO072
11CO072
11CO072
11CO072


There are really only 9 subjects there with some repetition.

There's likely something going on with the SQL that CDA is generating.

What is the SQL that was generated

In [7]:
cases.sql

"SELECT all_v2.* FROM gdc-bq-sample.integration.all_v2 AS all_v2, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Specimen) AS _Specimen, UNNEST(_Specimen.File) AS _File WHERE (_File.label = '16CPTAC_COprospective_W_PNNL_20170123_B4S4_f04.raw.cap.psm')"

When that query is executed against the known current CDA content the duplication is most likely a product of the denormalization of File that we've already encountered. The unnests produces a table with one row per specimen. Because more than one specimen in the same subject has the file we are looking for we get multiple matches.

We can test this out by running an edited version of the SQL that lists some the specimen attributes.

In [14]:
testsql = '''
SELECT all_v2.id case_id,  _Specimen.specimen_type, _Specimen.id specimen_id,
_Specimen.derived_from_specimen, _Specimen.source_material_type
FROM gdc-bq-sample.integration.all_v2 AS all_v2, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Specimen) AS _Specimen, 
UNNEST(_Specimen.File) AS _File 
WHERE (_File.label = '16CPTAC_COprospective_W_PNNL_20170123_B4S4_f04.raw.cap.psm')
'''

from cda_funcs import qResultsToDF
t1 = Q.sql(testsql)
qResultsToDF(t1)

Unnamed: 0,case_id,specimen_type,specimen_id,derived_from_specimen,source_material_type
0,11CO045,sample,c7f28a6b-641c-11e8-bcf1-0a2705229b82,initial specimen,Solid Tissue Normal
1,11CO045,aliquot,10430efa-642a-11e8-bcf1-0a2705229b82,c7f28a6b-641c-11e8-bcf1-0a2705229b82,Solid Tissue Normal
2,11CO045,sample,c9b7d7b8-641c-11e8-bcf1-0a2705229b82,initial specimen,Primary Tumor
3,05CO039,sample,8d514dac-641b-11e8-bcf1-0a2705229b82,initial specimen,Solid Tissue Normal
4,05CO039,aliquot,de1a20d8-6429-11e8-bcf1-0a2705229b82,8d514dac-641b-11e8-bcf1-0a2705229b82,Solid Tissue Normal
5,05CO039,sample,8f51c47c-641b-11e8-bcf1-0a2705229b82,initial specimen,Primary Tumor
6,05CO039,sample,da08db4a-6420-11e8-bcf1-0a2705229b82,initial specimen,Primary Tumor
7,11CO010,sample,7dfa9110-641c-11e8-bcf1-0a2705229b82,initial specimen,Primary Tumor
8,11CO010,aliquot,9039ae28-6429-11e8-bcf1-0a2705229b82,7dfa9110-641c-11e8-bcf1-0a2705229b82,Primary Tumor
9,11CO010,sample,7f8b1e31-641c-11e8-bcf1-0a2705229b82,initial specimen,Solid Tissue Normal


This shows that the files are attached to many specimens besides the specific specimen (aliquot) from which they were derived.

It seems this can be likely be explained by underlying causes encountered in two other issues:
- Cause: denormalization of File to the specimens from which the lowest level specimen was derived
  - appears in [cda-python/issues/97](https://github.com/CancerDataAggregator/cda-python/issues/97)
- Cause: copying across of File to siblings of parents (aunts and uncles?) 
  - appears in [cda-python/issues/99](https://github.com/CancerDataAggregator/cda-python/issues/99)

There are a couple of reasons why the issue identified in this notebook warrants its own issue [cda-python/issues/106](https://github.com/CancerDataAggregator/cda-python/issues/106). 

- Even if the two underlying causes above are addressed there are cases where there will genuinely be two specimens within the same subject. Many proteomics study designs put tumor and normal from the same subject in the same plex (file).
- How the underyling issue shows up is sufficiently different that fixes to those issues should be tested via this notebook.

## The genuine multiple sample-file test case
The following file from the Georgetown Lung Cancer Proteomics Study is a suitable test case for the scenario where a Subject genuinely has a file with a direct association to two of the Subject's specimens.

In [9]:
gt_cases = find_file('Ctrl_10-set_3-label_116-frac_1-F4.wiff')

Searching for file Ctrl_10-set_3-label_116-frac_1-F4.wiff
Getting results from database

Total execution time: 1953 ms
Found that file in 14 'Subjects'
________________________________________________________________________________


In [10]:
for c in gt_cases:
    print(c['id'])

ICBI-000009
ICBI-000009
ICBI-000009
ICBI-000009
ReferenceMix
ReferenceMix
ICBI-000008
ICBI-000008
ICBI-000008
ICBI-000008
ICBI-000007
ICBI-000007
ICBI-000007
ICBI-000007


Of course there are really only four subjects for reasons described above. (Actually there are really only three subjects. ReferenceMix isn't a subject as such - but we'll leave that aside for now).

Next we look at the specimen details to confirm the desirable differences (for test purposes) in this case.

In [11]:
testsql2 = '''
SELECT all_v2.id case_id,  _Specimen.specimen_type, _Specimen.id specimen_id,
_Specimen.derived_from_specimen, _Specimen.source_material_type
FROM gdc-bq-sample.integration.all_v2 AS all_v2, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Specimen) AS _Specimen, 
UNNEST(_Specimen.File) AS _File 
WHERE (_File.label = 'Ctrl_10-set_3-label_116-frac_1-F4.wiff')
'''

t2 = Q.sql(testsql2)
qResultsToDF(t2)

Unnamed: 0,case_id,specimen_type,specimen_id,derived_from_specimen,source_material_type
0,ICBI-000009,sample,9e8eade1-d732-11ea-b1fd-0aad30af8a83,initial specimen,Primary Tumor
1,ICBI-000009,aliquot,9e8ec9ad-d732-11ea-b1fd-0aad30af8a83,9e8eade1-d732-11ea-b1fd-0aad30af8a83,Primary Tumor
2,ICBI-000009,sample,9e8eafdd-d732-11ea-b1fd-0aad30af8a83,initial specimen,Solid Tissue Normal
3,ICBI-000009,aliquot,9e8ecb2f-d732-11ea-b1fd-0aad30af8a83,9e8eafdd-d732-11ea-b1fd-0aad30af8a83,Solid Tissue Normal
4,ICBI-000008,sample,9e8eaab4-d732-11ea-b1fd-0aad30af8a83,initial specimen,Primary Tumor
5,ICBI-000008,aliquot,9e8ec6bb-d732-11ea-b1fd-0aad30af8a83,9e8eaab4-d732-11ea-b1fd-0aad30af8a83,Primary Tumor
6,ICBI-000008,sample,9e8eacb2-d732-11ea-b1fd-0aad30af8a83,initial specimen,Solid Tissue Normal
7,ICBI-000008,aliquot,9e8ec7b8-d732-11ea-b1fd-0aad30af8a83,9e8eacb2-d732-11ea-b1fd-0aad30af8a83,Solid Tissue Normal
8,ReferenceMix,sample,9e8eb490-d732-11ea-b1fd-0aad30af8a83,initial specimen,Not Reported
9,ReferenceMix,aliquot,9e8ecf0b-d732-11ea-b1fd-0aad30af8a83,9e8eb490-d732-11ea-b1fd-0aad30af8a83,Not Reported


The file contains data for a plex which contains both tumor and normal material from the same subject. So the associations shown above, of the file with both the tumor aliquot and the normal aliquot, are genuine.

Ahead of the denormalization being removed by the CDA team we can simulate that it has been done by filtering the specimen_type to aliquot only.

In [12]:
testsql3 = '''
SELECT all_v2.*
FROM gdc-bq-sample.integration.all_v2 AS all_v2, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Specimen) AS _Specimen, 
UNNEST(_Specimen.File) AS _File 
WHERE (_File.label = 'Ctrl_10-set_3-label_116-frac_1-F4.wiff')
and specimen_type = 'aliquot'
'''

t3 = Q.sql(testsql3)
for c in t3:
    print (c['id'])

ICBI-000009
ICBI-000009
ReferenceMix
ICBI-000008
ICBI-000008
ICBI-000007
ICBI-000007


That query demonstrates that even if the demormalized File data is removed, the CDA generated query would still generate duplicates of these Subjects.

The desired result could be produced with generated SQL as follows.

In [13]:
testsql4 = ''' SELECT a_v2.* FROM gdc-bq-sample.integration.all_v2 AS a_v2
where a_v2.id in 
( SELECT all_v2.id case_id
FROM gdc-bq-sample.integration.all_v2 AS all_v2, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Specimen) AS _Specimen, 
UNNEST(_Specimen.File) AS _File 
WHERE (_File.label = 'Ctrl_10-set_3-label_116-frac_1-F4.wiff')
)
'''

t4 = Q.sql(testsql4)
for c in t4:
    print (c['id'])

ICBI-000007
ICBI-000009
ICBI-000008
ReferenceMix
