### Look at all PDC projects
Beyond the 09CO022 example this noteboook looks more broadly at all the PDC studies.


First query for all PDC projects and list their names and count of subjects


In [2]:
from cdapython import Q

query1 = """SELECT associated_project,count(*) subject_count
from gdc-bq-sample.integration.all_v2 AS su,
unnest(ResearchSubject) AS rs,
unnest(rs.identifier) as id
where (id.system = 'PDC')
group by associated_project """

r1 = Q.sql(query1)
r1


QueryID: 70e7a9d7-9bde-448c-84c8-f0129678ca4a
Query: SELECT associated_project,count(*) subject_count
from gdc-bq-sample.integration.all_v2 AS su,
unnest(ResearchSubject) AS rs,
unnest(rs.identifier) as id
where (id.system = 'PDC')
group by associated_project 
Offset: 0
Count: 14
Total Row Count: 14
More pages: False

In [3]:
from cda_funcs import qResultsToDF 
qResultsToDF(r1)



Unnamed: 0,associated_project,subject_count
0,CPTAC3-Discovery,886
1,Integrated Proteogenomic Characterization of H...,171
2,Human Early-Onset Gastric Cancer - Korea Unive...,80
3,CPTAC3-Other,27
4,PJ25730263,9
5,CPTAC-2,347
6,CPTAC-TCGA,404
7,Georgetown Lung Cancer Proteomics Study,11
8,Proteogenomic Analysis of Pediatric Brain Canc...,207
9,Oral Squamous Cell Carcinoma - Chang Gung Univ...,39


### Look at the file content in each study
The following picks two subjects from each of the above studies as examples. It then looks at each sample for those subject and compares the file content across them (actually it checks each sample's file content against the file content of the previous sample from the same subject.


In [4]:
for project in r1:
    print('_'*100)
    projName = project['associated_project']
    print("Project:{}".format(projName))
    pq = Q('ResearchSubject.associated_project = "{}"'.format(projName))
    pr = pq.run(limit=2)
    for subject2 in pr[1]['ResearchSubject']:
        subid = subject2['identifier'][0]
        if subid['system'] == 'PDC':
            print('_'*50)
            print("Subject: {}:{}".format(subid['system'],subid['value']))
            print("Specimen count: {}".format(len(subject2['Specimen'])))
            lastFileList = []
            for s in subject2['Specimen']:
                print('_'*10)
                print ('Specimen {}'.format(s['id']))
                if s['derived_from_specimen'] == 'initial specimen':
                    qual = ''
                else:
                    qual = 'derived from:'
                print("Source material {}".format(s['source_material_type']))
                print ("{}: {}{}".format(s['specimen_type'], qual, s['derived_from_specimen']))
                print ("files {}".format(len(s['File'])))
                specimenFiles = []
                for f in s['File']:
                    specimenFiles.append(f['id'])
                specimenFiles.sort()
                if specimenFiles == lastFileList:
                    print("Same file content as previous specimen")
                lastFileList = specimenFiles


____________________________________________________________________________________________________
Project:CPTAC3-Discovery
Getting results from database

Total execution time: 25349 ms
__________________________________________________
Subject: PDC:f1edd1df-cf1e-11e9-9a07-0a80fada099c
Specimen count: 4
__________
Specimen a01705f8-d0a6-11e9-9a07-0a80fada099c
Source material Solid Tissue Normal
sample: initial specimen
files 165
__________
Specimen 5a910ac5-d0b0-11e9-9a07-0a80fada099c
Source material Solid Tissue Normal
aliquot: derived from:a01705f8-d0a6-11e9-9a07-0a80fada099c
files 165
Same file content as previous specimen
__________
Specimen a02ea9f3-d0a6-11e9-9a07-0a80fada099c
Source material Primary Tumor
sample: initial specimen
files 165
Same file content as previous specimen
__________
Specimen 5a700807-d0b0-11e9-9a07-0a80fada099c
Source material Primary Tumor
aliquot: derived from:a02ea9f3-d0a6-11e9-9a07-0a80fada099c
files 165
Same file content as previous specimen
________

Total execution time: 8410 ms
__________________________________________________
Subject: PDC:d08daaa1-ff5e-11e9-9a07-0a80fada099c
Specimen count: 2
__________
Specimen d08ec96b-ff5e-11e9-9a07-0a80fada099c
Source material Primary Tumor
sample: initial specimen
files 56
__________
Specimen d09048f6-ff5e-11e9-9a07-0a80fada099c
Source material Primary Tumor
aliquot: derived from:d08ec96b-ff5e-11e9-9a07-0a80fada099c
files 56
Same file content as previous specimen
____________________________________________________________________________________________________
Project:Oral Squamous Cell Carcinoma - Chang Gung University
Getting results from database

Total execution time: 8934 ms
__________________________________________________
Subject: PDC:8369a540-474e-4b9b-8f2b-f96153ac8bfe
Specimen count: 4
__________
Specimen a14d3203-d84e-4b53-8a69-9c779f82e82f
Source material Primary Tumor
sample: initial specimen
files 86
__________
Specimen dd9c3983-f923-4b0f-a84f-c815d048c62c
Source material 

That demonstrates that for most of the projects the files associated with tumor and normal specimens are the same set of files.

Take Subject 7a5931dd-1168-11ea-9bfa-0a42f3c845fe from the "Integrated Proteogenomic Characterization of HBV-related Hepatocellular carcinoma" study.

There are 4 specimens for that subject; an initial sample of normal and of tumor, and an 'aliquot' of each.
The 287 file ids for each of those four samples are identical. We know that the aliquot level files are repeated at the sample level. However, that the tumor and normal specimens both have the same set of files looks worth exploring.

This is covered in the notebook "Compare files via PDC API" which uses the PDC API to obtain additional information about files and relationships not necessarily visible through CDA.
