### Retrieving files at a specific level within a Subject

Issue [cda-python/issues/93](https://github.com/CancerDataAggregator/cda-python/issues/93) raised the need to retrieve files at a specific level within a Subject's hierarchy of data.

This is an illustration of how Q.sql() can be used to do so.

An unintended consequence of this exploration was that it turned up another consequence of the repetition (> than duplication) of files redundantly within the specimen hierarchy.

In [1]:
from cdapython import Q
import json

### A query to get file details from the Research Subject level
To keep the output compact we're limiting our query to a specific Subject and to BAM files.

In [2]:
query1 = """SELECT su.id sid , rs.identifier rsid, fi.drs_uri
from gdc-bq-sample.integration.all_v2 AS su,
unnest(ResearchSubject) AS rs,
unnest(rs.File) as fi
where (su.id = 'TCGA-13-1504')
and file_format = 'BAM' """

r1 = Q.sql(query1)
r1


QueryID: ab188714-7e95-41cd-95c1-62871a694315
Query: SELECT su.id sid , rs.identifier rsid, fi.drs_uri
from gdc-bq-sample.integration.all_v2 AS su,
unnest(ResearchSubject) AS rs,
unnest(rs.File) as fi
where (su.id = 'TCGA-13-1504')
and file_format = 'BAM' 
Offset: 0
Count: 3
Total Row Count: 3
More pages: False

In [3]:
for row in r1:
    print(json.dumps(row, indent=3))

{
   "sid": "TCGA-13-1504",
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "drs_uri": "drs://dg.4DFC:8b7859ed-395a-449d-8e4b-7d385d0ffa43"
}
{
   "sid": "TCGA-13-1504",
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "drs_uri": "drs://dg.4DFC:698994fe-b22a-4723-b9f0-23551fd3bb0e"
}
{
   "sid": "TCGA-13-1504",
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "drs_uri": "drs://dg.4DFC:07a9e595-023d-47d5-97fa-3fcc62f5d0c6"
}


### Or to provide more complete details of the file
We specify all fields for the File

In [4]:
query2 = """SELECT su.id sid , rs.identifier rsid, fi.*
from gdc-bq-sample.integration.all_v2 AS su,
unnest(ResearchSubject) AS rs,
unnest(rs.File) as fi
where (su.id = 'TCGA-13-1504')
and file_format = 'BAM' """

r2 = Q.sql(query2)
for row in r2:
    print(json.dumps(row, indent=3))

{
   "sid": "TCGA-13-1504",
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "id": "8b7859ed-395a-449d-8e4b-7d385d0ffa43",
   "identifier": [
      {
         "system": "GDC",
         "value": "8b7859ed-395a-449d-8e4b-7d385d0ffa43"
      }
   ],
   "label": "TCGA-13-1504-01A-01R-1565-13_mirna_gdc_realn.bam",
   "data_category": "Sequencing Reads",
   "data_type": "Aligned Reads",
   "file_format": "BAM",
   "associated_project": "TCGA-OV",
   "drs_uri": "drs://dg.4DFC:8b7859ed-395a-449d-8e4b-7d385d0ffa43",
   "byte_size": "617903851",
   "checksum": "42b51bf61122b716cd02a628da9b7d89"
}
{
   "sid": "TCGA-13-1504",
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "id": "698994fe-b22a-4723-b9f0-23551fd3bb0e",
   "identifier": [
      {
         "system": "GDC",
         "value": "698994fe-b22a-4723-b9f0-23551fd3bb0e"
      }
   ],
   "label": "C2

Note that some care is needed in reading the results. The identifier attribute belongs to the File, as do all the subsequent attributes within each result row. Maybe there is a way of writing the select clause to retain this.

### Digging deeper - File details from the specimen

In [5]:
specimenFileQuery = """SELECT rs.identifier rsid, fi.drs_uri
from gdc-bq-sample.integration.all_v2 AS su,
unnest(ResearchSubject) AS rs,
unnest(rs.Specimen) AS sp,
unnest(sp.File) as fi
where (su.id = 'TCGA-13-1504')
and fi.file_format = 'BAM' """
specimenFiles = Q.sql(specimenFileQuery)
specimenFiles


QueryID: aabeb160-0f87-4c89-9751-bec00178bc06
Query: SELECT rs.identifier rsid, fi.drs_uri
from gdc-bq-sample.integration.all_v2 AS su,
unnest(ResearchSubject) AS rs,
unnest(rs.Specimen) AS sp,
unnest(sp.File) as fi
where (su.id = 'TCGA-13-1504')
and fi.file_format = 'BAM' 
Offset: 0
Count: 16
Total Row Count: 16
More pages: False

### That's odd
We had three BAM files for this subject before. Now we have sixteen.

In [6]:
for row in specimenFiles:
    print(json.dumps(row, indent=3))

{
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "drs_uri": "drs://dg.4DFC:07a9e595-023d-47d5-97fa-3fcc62f5d0c6"
}
{
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "drs_uri": "drs://dg.4DFC:07a9e595-023d-47d5-97fa-3fcc62f5d0c6"
}
{
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "drs_uri": "drs://dg.4DFC:07a9e595-023d-47d5-97fa-3fcc62f5d0c6"
}
{
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "drs_uri": "drs://dg.4DFC:07a9e595-023d-47d5-97fa-3fcc62f5d0c6"
}
{
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "drs_uri": "drs://dg.4DFC:698994fe-b22a-4723-b9f0-23551fd3bb0e"
}
{
   "rsid": [
      {
         "system": "GDC",
      

Close inspection of the above shows that there are really only three files present.

### What's going on?
Some inspection showed where the problem was. The following query illustrates by adding the specimen type to the select statement

In [7]:
specimenFileQuery2 = """SELECT specimen_type, fi.drs_uri
from gdc-bq-sample.integration.all_v2 AS su,
unnest(ResearchSubject) AS rs,
unnest(rs.Specimen) AS sp,
unnest(sp.File) as fi
where (su.id = 'TCGA-13-1504')
and fi.file_format = 'BAM' """
specimenFiles2 = Q.sql(specimenFileQuery2)
for row in specimenFiles2:
    print(json.dumps(row, indent=3))

{
   "specimen_type": "sample",
   "drs_uri": "drs://dg.4DFC:07a9e595-023d-47d5-97fa-3fcc62f5d0c6"
}
{
   "specimen_type": "portion",
   "drs_uri": "drs://dg.4DFC:07a9e595-023d-47d5-97fa-3fcc62f5d0c6"
}
{
   "specimen_type": "analyte",
   "drs_uri": "drs://dg.4DFC:07a9e595-023d-47d5-97fa-3fcc62f5d0c6"
}
{
   "specimen_type": "aliquot",
   "drs_uri": "drs://dg.4DFC:07a9e595-023d-47d5-97fa-3fcc62f5d0c6"
}
{
   "specimen_type": "sample",
   "drs_uri": "drs://dg.4DFC:698994fe-b22a-4723-b9f0-23551fd3bb0e"
}
{
   "specimen_type": "sample",
   "drs_uri": "drs://dg.4DFC:8b7859ed-395a-449d-8e4b-7d385d0ffa43"
}
{
   "specimen_type": "portion",
   "drs_uri": "drs://dg.4DFC:698994fe-b22a-4723-b9f0-23551fd3bb0e"
}
{
   "specimen_type": "portion",
   "drs_uri": "drs://dg.4DFC:8b7859ed-395a-449d-8e4b-7d385d0ffa43"
}
{
   "specimen_type": "slide",
   "drs_uri": "drs://dg.4DFC:698994fe-b22a-4723-b9f0-23551fd3bb0e"
}
{
   "specimen_type": "slide",
   "drs_uri": "drs://dg.4DFC:8b7859ed-395a-449d-8e4b-7d3

#### Diagnosis
The problem derives from the fact that the File has been repeated as part of the content of multiple specimens. This was the duplication referred to in the v2 updates recently added to [cda-service/issues/79](https://github.com/CancerDataAggregator/cda-service/issues/79).

### Fixing a hole
We can work around the problem by specifying the specific specimen type (aliquot) in whose files we are interested in. 

In [8]:
specimenFileQuery3 = """SELECT rs.identifier rsid, fi.*
from gdc-bq-sample.integration.all_v2 AS su,
unnest(ResearchSubject) AS rs,
unnest(rs.Specimen) AS sp,
unnest(sp.File) as fi
where (su.id = 'TCGA-13-1504')
and specimen_type = 'aliquot'
and fi.file_format = 'BAM' """
specimenFiles3 = Q.sql(specimenFileQuery3)
specimenFiles3


QueryID: eb7da425-5f0f-4859-a6d2-bdf091dc2f96
Query: SELECT rs.identifier rsid, fi.*
from gdc-bq-sample.integration.all_v2 AS su,
unnest(ResearchSubject) AS rs,
unnest(rs.Specimen) AS sp,
unnest(sp.File) as fi
where (su.id = 'TCGA-13-1504')
and specimen_type = 'aliquot'
and fi.file_format = 'BAM' 
Offset: 0
Count: 3
Total Row Count: 3
More pages: False

That gets us back to the expected number of files i.e. with no repetition.

And we can list the results as follows.

In [9]:
for row in specimenFiles3:
    print(json.dumps(row, indent=3))

{
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "id": "07a9e595-023d-47d5-97fa-3fcc62f5d0c6",
   "identifier": [
      {
         "system": "GDC",
         "value": "07a9e595-023d-47d5-97fa-3fcc62f5d0c6"
      }
   ],
   "label": "C239.TCGA-13-1504-10A-01W.6_gdc_realn.bam",
   "data_category": "Sequencing Reads",
   "data_type": "Aligned Reads",
   "file_format": "BAM",
   "associated_project": "TCGA-OV",
   "drs_uri": "drs://dg.4DFC:07a9e595-023d-47d5-97fa-3fcc62f5d0c6",
   "byte_size": "35466456040",
   "checksum": "74451615312647e65cc6e6e69c0a9e0b"
}
{
   "rsid": [
      {
         "system": "GDC",
         "value": "cd49126a-ec15-43fa-9e43-3f7460d43f2b"
      }
   ],
   "id": "698994fe-b22a-4723-b9f0-23551fd3bb0e",
   "identifier": [
      {
         "system": "GDC",
         "value": "698994fe-b22a-4723-b9f0-23551fd3bb0e"
      }
   ],
   "label": "C239.TCGA-13-1504-01A-01W.5_gdc_realn.bam",
   "data_catego