## Parse the proteomic files

In [1]:
from cdapython import Q, columns, unique_terms

Query for the same subject as previously (09CO022).

In [2]:
q1 = Q('ResearchSubject.id = "c5421e34-e5c7-4ba5-aed9-146a5575fd8d"')
r = q1.run(limit=2) 
r.sql
print(r)
import json
j = json.dumps(r[0], indent=3)
with open('./query_results/09CO022.json', 'w') as f:
    f.write(j)



Getting results from database

Total execution time: 10102 ms

QueryID: 10729a4f-a541-4e9b-adad-63c2e64552e1
Query: SELECT all_v2.* FROM gdc-bq-sample.integration.all_v2 AS all_v2, UNNEST(ResearchSubject) AS _ResearchSubject WHERE (_ResearchSubject.id = 'c5421e34-e5c7-4ba5-aed9-146a5575fd8d')
Offset: 0
Count: 1
Total Row Count: 1
More pages: False



Looking at the PDC "Research Subject".

Compared with genomics there are many more files per specimen,
The file names follow a naming format like the following example.
```13CPTAC_COprospective_W_PNNL_20170123_B4S1_f04.mzid.gz```

We'll break that down to work out what is going on.

The second Research Subject is the PDC Research Subject

In [3]:
subject2 = r[0]['ResearchSubject'][1]
#Count how many specimens it has
len(subject2['Specimen'])

5

Examine the details of the files for this Subject's specimens. Filter on aliquot to exclude denormalized duplicates which are part of higher level objects.

In [4]:
files = []
for s2 in subject2['Specimen']:
    if s2['specimen_type'] == 'aliquot':
        for f in s2['File']:
            #print(f['label'])
            parts = f['label'].split('_')
            n = 1
            for p in parts:
                f['p'+str(n)] = p
                n += 1
            lastTwo = f['p'+str(n-1)].split(".", 1)
            f['p'+str(n-1)] = lastTwo[0]
            f['exts'] = lastTwo[1]
            f['sample'] = s2['source_material_type']
            files.append(f)
#files
                


In [11]:
import pandas as pd
df = pd.DataFrame(files)

In [6]:
df

Unnamed: 0,id,identifier,label,data_category,data_type,file_format,associated_project,drs_uri,byte_size,checksum,p1,p2,p3,p4,p5,p6,p7,exts,sample
0,0235942c-7bd3-46c1-8a3d-87e42a0b1e86,"[{'system': 'PDC', 'value': '0235942c-7bd3-46c...",13CPTAC_COprospective_W_PNNL_20170123_B4S1_f03...,Peptide Spectral Matches,Text,tsv,CPTAC-2,drs://dg.4DFC:0235942c-7bd3-46c1-8a3d-87e42a0b...,5195961,c1e585b85d70314ddfb9fe930faf6a0f,13CPTAC,COprospective,W,PNNL,20170123,B4S1,f03,raw.cap.psm,Solid Tissue Normal
1,0a63771a-14b2-4d6b-9c08-c7177c372789,"[{'system': 'PDC', 'value': '0a63771a-14b2-4d6...",13CPTAC_COprospective_W_PNNL_20170123_B4S1_f02...,Processed Mass Spectra,Open Standard,mzML,CPTAC-2,drs://dg.4DFC:0a63771a-14b2-4d6b-9c08-c7177c37...,213302411,8e16ac2e87bffdea86d75e03a8cf0ffa,13CPTAC,COprospective,W,PNNL,20170123,B4S1,f02,mzML.gz,Solid Tissue Normal
2,0d846fdf-169e-4081-910d-a792fab3df43,"[{'system': 'PDC', 'value': '0d846fdf-169e-408...",13CPTAC_COprospective_P_PNNL_20170215_B4S1_f06...,Peptide Spectral Matches,Open Standard,mzIdentML,CPTAC-2,drs://dg.4DFC:0d846fdf-169e-4081-910d-a792fab3...,4247403,5f8d183651df64850a85a64586aa2d9b,13CPTAC,COprospective,P,PNNL,20170215,B4S1,f06,mzid.gz,Solid Tissue Normal
3,0daea74c-6886-415b-9395-cd4e0857acb4,"[{'system': 'PDC', 'value': '0daea74c-6886-415...",13CPTAC_COprospective_W_PNNL_20170123_B4S1_f02...,Peptide Spectral Matches,Open Standard,mzIdentML,CPTAC-2,drs://dg.4DFC:0daea74c-6886-415b-9395-cd4e0857...,7044458,fc0008314a37ba989ed29b1c7f9d67e1,13CPTAC,COprospective,W,PNNL,20170123,B4S1,f02,mzid.gz,Solid Tissue Normal
4,1279b00d-cb50-4021-a595-08e7c3dd35c0,"[{'system': 'PDC', 'value': '1279b00d-cb50-402...",13CPTAC_COprospective_W_PNNL_20170123_B4S1_f04...,Peptide Spectral Matches,Open Standard,mzIdentML,CPTAC-2,drs://dg.4DFC:1279b00d-cb50-4021-a595-08e7c3dd...,7311308,48c15a24e109b3f783b8a23766f25a94,13CPTAC,COprospective,W,PNNL,20170123,B4S1,f04,mzid.gz,Solid Tissue Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
158,e678893f-fadc-4a92-842e-e89534886b9d,"[{'system': 'PDC', 'value': 'e678893f-fadc-4a9...",16CPTAC_COprospective_W_PNNL_20170123_B4S4_f07...,Peptide Spectral Matches,Open Standard,mzIdentML,CPTAC-2,drs://dg.4DFC:e678893f-fadc-4a92-842e-e8953488...,7779691,bf1889149a45b609e94b4c70e60326c0,16CPTAC,COprospective,W,PNNL,20170123,B4S4,f07,mzid.gz,Primary Tumor
159,e68b9b17-100e-48c1-84d7-d1a5f9485c1e,"[{'system': 'PDC', 'value': 'e68b9b17-100e-48c...",16CPTAC_COprospective_W_PNNL_20170123_B4S4_f06...,Peptide Spectral Matches,Open Standard,mzIdentML,CPTAC-2,drs://dg.4DFC:e68b9b17-100e-48c1-84d7-d1a5f948...,7633586,b0a122f5fcf894bc5eab79bfc1b9cd26,16CPTAC,COprospective,W,PNNL,20170123,B4S4,f06,mzid.gz,Primary Tumor
160,ec2b6628-0081-4609-9d84-8fe7e8fe5ee2,"[{'system': 'PDC', 'value': 'ec2b6628-0081-460...",16CPTAC_COprospective_W_PNNL_20170123_B4S4_f08...,Processed Mass Spectra,Open Standard,mzML,CPTAC-2,drs://dg.4DFC:ec2b6628-0081-4609-9d84-8fe7e8fe...,207488107,1626736ac46b5d91480fe6160b24f2a8,16CPTAC,COprospective,W,PNNL,20170123,B4S4,f08,mzML.gz,Primary Tumor
161,eec97b50-58e7-41e9-9b53-e192632655c4,"[{'system': 'PDC', 'value': 'eec97b50-58e7-41e...",16CPTAC_COprospective_W_PNNL_20170123_B4S4_f12...,Raw Mass Spectra,Proprietary,vendor-specific,CPTAC-2,drs://dg.4DFC:eec97b50-58e7-41e9-9b53-e1926326...,1773260093,6422e8c855bacfdf80154503ce276fe1,16CPTAC,COprospective,W,PNNL,20170123,B4S4,f12,raw,Primary Tumor


Check whether our parsing of the file names came up with a value for all elements of our file name pattern.

In [7]:
df.count()

id                    163
identifier            163
label                 163
data_category         163
data_type             163
file_format           163
associated_project    163
drs_uri               163
byte_size             163
checksum              163
p1                    163
p2                    163
p3                    163
p4                    163
p5                    163
p6                    163
p7                    163
exts                  163
sample                163
dtype: int64

It did.

Looking at the different components of the file name

In [8]:
for p in range(1,7):
    pcol = 'p'+str(p)
    print("column:{} values:{}".format(pcol,df[pcol].unique()))

column:p1 values:['13CPTAC' '87CPTAC' '16CPTAC']
column:p2 values:['COprospective']
column:p3 values:['W' 'P']
column:p4 values:['PNNL' 'VU']
column:p5 values:['20170123' '20170215' '20160701']
column:p6 values:['B4S1' '09CO022' 'B4S4']


* p2 is constant for this study, representing an abbreviated name of the dataset.
* p3 indicates phosphoproteome  (p) or whole proteome (W)
* p4 are abbreviations for the two geographic sites at which work was done.
* p5 looks like a date.
* p6 likely is some representation of the sample id. 

Note on p6. In the VU case it is the subject id value of the subject. This works fine as a unique sample id because VU only dealt with one sample (tumor). In the PNNL case it looks like B4S4 was some kind of local identifier for the tumor specimen and B4S1 for the normal.

Next we pivot the DataFrame in a way that helps to understand the organization of the files. It took some experimentation to identify the 'group by' that revealed the minimum set of columns to distinguish any file from its companions. 

p1 and p5 proved unnecessary for that purpose though it may still contain useful information.

The following sets formatting of cell borders for the display of dataframes.

In [9]:
%%HTML
<style type="text/css">
table.dataframe td, table.dataframe th {
    border: 1px  black solid !important;
  color: black !important;
}

In [10]:
proteomics_df = df.groupby(['sample','p6','p4','p3','p7','data_category','file_format','exts']).count()[['id']]

# set some other display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.colheader_justify', 'center')
pd.set_option('display.precision', 3)


display(proteomics_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,id
sample,p6,p4,p3,p7,data_category,file_format,exts,Unnamed: 8_level_1
Primary Tumor,09CO022,VU,W,f01,Peptide Spectral Matches,mzIdentML,mzid.gz,1
Primary Tumor,09CO022,VU,W,f01,Peptide Spectral Matches,tsv,raw.cap.psm,1
Primary Tumor,09CO022,VU,W,f01,Processed Mass Spectra,mzML,mzML.gz,1
Primary Tumor,09CO022,VU,W,f01,Raw Mass Spectra,vendor-specific,raw,1
Primary Tumor,09CO022,VU,W,f02,Peptide Spectral Matches,mzIdentML,mzid.gz,1
Primary Tumor,09CO022,VU,W,f02,Peptide Spectral Matches,tsv,raw.cap.psm,1
Primary Tumor,09CO022,VU,W,f02,Processed Mass Spectra,mzML,mzML.gz,1
Primary Tumor,09CO022,VU,W,f02,Raw Mass Spectra,vendor-specific,raw,1
Primary Tumor,09CO022,VU,W,f03,Peptide Spectral Matches,mzIdentML,mzid.gz,1
Primary Tumor,09CO022,VU,W,f03,Peptide Spectral Matches,tsv,raw.cap.psm,1


### Conclusions

The above has some bearing on the CDA and the CRDC-H model.

In particular this relevant to picking minimum attributes for files/data.
A user need is that the minimum should be the minimum set of fields which allows them to distinguish one file from another.
The example above illustrates that is not a static set of fields. That should be taken into consideration in any minimum fields discussion. Tagging this as [cda-python/issues/104](https://github.com/CancerDataAggregator/cda-python/issues/104).
