## File duplication

This notebook demonstrates how duplicate details for files are returned in Cancer Data Aggregator (CDA) query results.

#### Function to call the CDA API
Set up a function to call the CDA API directly with an SQL query

In [91]:
import requests
def runAPIQuery(querystring, limit=None):
    cdaURL = 'https://cda.cda-dev.broadinstitute.org/api/v1/sql-query/v3'
    #Using a limit:
    if limit is not None:
        cdaURL = "{}?limit={}".format(cdaURL, str(limit))
        
    headers = {'accept' : 'application/json', 'Content-Type' : 'text/plain'}

    request = requests.post(cdaURL, headers = headers, data = querystring)

    if request.status_code == 200:
        return request.json()
    else:
        raise Exception ("Query failed code {}. {}".format(request.status_code,query))
                


#### Set up a query looking for patients with a tumor stage II in the TCGA-BRCA project. 
The query returns six results

In [92]:
querystring = ''' select * from gdc-bq-sample.cda_mvp.v3
where id in
(SELECT distinct p.id FROM gdc-bq-sample.cda_mvp.v3 AS p, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis 
WHERE (((_ResearchSubject.associated_project = 'TCGA-BRCA') 
AND (_Diagnosis.tumor_stage = 'stage ii')) 
) )'''

query = runAPIQuery(querystring)
print('Count of results:',format(len(query['result'])))

Count of results: 6


Define a function to list details of files which occur more than once in the results

In [93]:
def findFileDuplicates(result, minoccurs=2, maxoccurs=1000000):
    fileOccurrences = {}
    for res in result:
        for su in res['ResearchSubject']:
            for sp in su['Specimen']:
                for f in sp['File']:
                    fileID = f['id']
                    specimen_details= {'id':sp['id'], 
                    'subject':sp['derived_from_subject'],
                    'material':sp['source_material_type']
                     }
                    if fileID in fileOccurrences:
                        fileOccurrences[fileID]['occurences'] += 1
                        fileOccurrences[fileID]['specimens'].append(specimen_details)
                    else:
                        fileOccurrences[fileID] = {'occurences':1,
                                                   'file':f,
                                                   'specimens':[specimen_details]
                                                  }

    for fileId, details in fileOccurrences.items():
        if details['occurences'] >= minoccurs and details['occurences'] <= maxoccurs:
            print('file id: {}'.format(fileId))
            print('Occurrences in results: {}'.format(details['occurences']))
            print('data category: {}'.format(details['file']['data_category']))
            print('file label: {}'.format(details['file']['label']))
            print('Specimens in which this file occurs within results')
            for s in details['specimens']:
                print(s)
            print('_'*80)

In [94]:
findFileDuplicates(query['result'],minoccurs=2, maxoccurs=2)

file id: 11622ce5-c58e-4999-a5c4-7a04dc01c0d2
Occurrences in results: 2
data category: Simple Nucleotide Variation
file label: 11622ce5-c58e-4999-a5c4-7a04dc01c0d2.vep.vcf.gz
Specimens in which this file occurs within results
{'id': '569dc2bf-8078-4330-afac-f3ff65903249', 'subject': 'TCGA-EW-A6SB', 'material': 'Primary Tumor'}
{'id': 'a83c0976-6b61-41be-ba96-c37986e90b6d', 'subject': 'TCGA-EW-A6SB', 'material': 'Blood Derived Normal'}
________________________________________________________________________________
file id: 15c363fd-7a70-4fe8-b8c4-3a383f169488
Occurrences in results: 2
data category: Copy Number Variation
file label: TCGA-BRCA.d9b9b065-fec4-4abf-b818-6f832d64ead4.ascat2.allelic_specific.seg.txt
Specimens in which this file occurs within results
{'id': '569dc2bf-8078-4330-afac-f3ff65903249', 'subject': 'TCGA-EW-A6SB', 'material': 'Primary Tumor'}
{'id': 'a83c0976-6b61-41be-ba96-c37986e90b6d', 'subject': 'TCGA-EW-A6SB', 'material': 'Blood Derived Normal'}
________________

These files are derived from two specimens. Per the GDC documentation [the vcf describes a tumor-normal pair](https://docs.gdc.cancer.gov/Data/File_Formats/VCF_Format/)

To maintain proper cardinality, and to avoid duplication, the file would be more properly associated with the ResearchSubject rather than the Specimen. (That applies to TCGA, but may not be true for other study designs. A more generally applicable model would be that the vcf would be associated with a Specimen Collection Group. A Specimen Collection Group is a group of specimens collected together at some defined point (e.g. tumor surgery). In a study with repeated biopsies, for example, a subject would have more than one Specimen Collection Group. Most accurately perhaps it could be said that the vcf primarily belongs with the tumor sample.

It is correct that that provenance of each vcf file is that it derives from two specimens. However, that is different than the primary meaning of the file which provides data about the subject at that point in time. 

### Files which repeat more than twice
We can use the same function as before to list details of the  files which are repeated more than twice

In [95]:
findFileDuplicates(query['result'],minoccurs=3, maxoccurs=100000)

file id: 053f01ed-3154-4aea-9e7f-932c435034b3
Occurrences in results: 12
data category: Simple Nucleotide Variation
file label: TCGA.BRCA.mutect.053f01ed-3154-4aea-9e7f-932c435034b3.DR-10.0.protected.maf.gz
Specimens in which this file occurs within results
{'id': '569dc2bf-8078-4330-afac-f3ff65903249', 'subject': 'TCGA-EW-A6SB', 'material': 'Primary Tumor'}
{'id': 'a83c0976-6b61-41be-ba96-c37986e90b6d', 'subject': 'TCGA-EW-A6SB', 'material': 'Blood Derived Normal'}
{'id': '680396b1-a19e-4acd-bef8-e1671b80aa31', 'subject': 'TCGA-EW-A6SA', 'material': 'Blood Derived Normal'}
{'id': '5a1bb598-9a38-452a-9599-365501038a88', 'subject': 'TCGA-EW-A6SA', 'material': 'Primary Tumor'}
{'id': '8fa798fe-f032-4d77-b362-06261db78e6d', 'subject': 'TCGA-BH-A202', 'material': 'Primary Tumor'}
{'id': '6df53651-4c7a-4b9f-b85f-9cbd5de65a3f', 'subject': 'TCGA-BH-A202', 'material': 'Blood Derived Normal'}
{'id': '5b94a116-7e70-41d4-8c6f-2bae44e591aa', 'subject': 'TCGA-B6-A0WZ', 'material': 'Primary Tumor'}


Most of these files are repeated 12 times i.e. for both the tumor and normal specimens for each of the 6 Research Subjects within the results. The exception is the focal_score_by_genes file which is repeated 6 times. That file derives only from tumor sample from each subject.

Again the the GDC documentation is helpful, describing for [the MAF format](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) "One MAF file is produced per variant calling pipeline per GDC project". In other words there is one MAF file for 'TCGA-BRCA'.

It is again correct that the existing relationship describes provenance, and that may be useful for some purposes. However, a user of the CDA would likely be better served by eliminating duplication of the files within the results. Associating the maf file with a project (and pipeline) gives a user a simpler view of the meaning of the file. 

Provenance for a file would be more simply provided from the other end of the relationship i.e. for the maf file to record which specimens (or vcfs) it derived from.

### CDA Python API
To demonstrate - the same issue occurs when cdapython is used, as opposed to using the CDA API directly as above.

In [96]:
from cdapython import Q
import json

q1 = Q('ResearchSubject.Specimen.File.drs_uri = "drs://dg.4DFC:030e5e74-6461-4f05-a399-de8e470bc056"')
r = q1.run() 
r.sql
print(r.count)

1


In [97]:
findFileDuplicates(r)

file id: 053f01ed-3154-4aea-9e7f-932c435034b3
Occurrences in results: 2
data category: Simple Nucleotide Variation
file label: TCGA.BRCA.mutect.053f01ed-3154-4aea-9e7f-932c435034b3.DR-10.0.protected.maf.gz
Specimens in which this file occurs within results
{'id': '6ec8ccf0-4292-4add-9190-7339ffed7ffa', 'subject': 'TCGA-AR-A2LK', 'material': 'Blood Derived Normal'}
{'id': 'd6f5c34a-0f5c-4aed-977a-74a1e5d50915', 'subject': 'TCGA-AR-A2LK', 'material': 'Primary Tumor'}
________________________________________________________________________________
file id: 097d9d39-5794-4ccf-abb4-915ed60ff8a2
Occurrences in results: 2
data category: Simple Nucleotide Variation
file label: 097d9d39-5794-4ccf-abb4-915ed60ff8a2.vep.vcf.gz
Specimens in which this file occurs within results
{'id': '6ec8ccf0-4292-4add-9190-7339ffed7ffa', 'subject': 'TCGA-AR-A2LK', 'material': 'Blood Derived Normal'}
{'id': 'd6f5c34a-0f5c-4aed-977a-74a1e5d50915', 'subject': 'TCGA-AR-A2LK', 'material': 'Primary Tumor'}
__________