### Comparison of DICOM approaches in two Gen3 implementations
Other notebooks with CDA tests have shown that the flat DICOM structure provides limited information to effectively navigate and manage the large numbers of files in DICOM *studies* and *series*.

Some of the work of the Medical Imaging and Data Resource Center (MIDRC) https://www.midrc.org was recently shared.

Like much of CRDC the MIDRC data portal (https://data.midrc.org) is also based on Gen3, and MIDRC handles DICOM. Some studies are present in both CDA and MIDRC. 

The following examines how each of these Gen3 instances handle the data from the same study and the accessibility they provide to the files. This can help inform CDA functionality.

### Using CDA to look at MIDRC studies


In [1]:
from cdapython import Q
import json

First run a query against the CDA to count files per IDC project

In [2]:
query1 = """SELECT f.associated_project,count(*) file_count
from gdc-bq-sample.integration.all_v2 AS su,
unnest(File) AS f,
unnest(f.identifier) as id
where (id.system = 'IDC')
group by f.associated_project """

r1 = Q.sql(query1, limit=200)
r1


QueryID: 814c750a-3013-473e-a1fd-703de8428ec9
Query: SELECT f.associated_project,count(*) file_count
from gdc-bq-sample.integration.all_v2 AS su,
unnest(File) AS f,
unnest(f.identifier) as id
where (id.system = 'IDC')
group by f.associated_project 
Offset: 0
Count: 107
Total Row Count: 107
More pages: False

### List MIDRC project image counts that are in IDC
The query above includes a lot of projects so we'll confine the listing of the results to those that have some relationship to MIDRC.

In [3]:
import pandas as pd
file_counts = []
for proj in r1:
    if proj['associated_project'].startswith('covid') or proj['associated_project'].startswith('midrc'):
        file_counts.append({'project':proj['associated_project'], 'file count':proj['file_count']})
display(pd.DataFrame(file_counts))



Unnamed: 0,project,file count
0,covid_19_ar,31935
1,midrc_ricord_1b,21220
2,midrc_ricord_1c,1257
3,midrc_ricord_1a,31856


### Files per subject
There are at least four studies that are in both MIDRC and the Imaging Data Commons.

Working with the the covid_19_ar data, we can run a query to list the file counts per subject.

In [4]:
import json

query2 = """SELECT su.id subject_id, count(*) file_count
from gdc-bq-sample.integration.all_v2 AS su,
unnest(File) AS f,
unnest(f.identifier) as id
where (id.system = 'IDC')
and f.associated_project = 'covid_19_ar' 
group by su.id"""

r2 = Q.sql(query2, limit=200)
r2


QueryID: 88b72300-7bb8-46f8-a09e-0b22748dd336
Query: SELECT su.id subject_id, count(*) file_count
from gdc-bq-sample.integration.all_v2 AS su,
unnest(File) AS f,
unnest(f.identifier) as id
where (id.system = 'IDC')
and f.associated_project = 'covid_19_ar' 
group by su.id
Offset: 0
Count: 105
Total Row Count: 105
More pages: False

And tabulate the results for display

In [5]:
su_file_counts = []
for su in r2:
    su_file_counts.append({'subject':su['subject_id'], 'file count':su['file_count']})
display(pd.DataFrame(su_file_counts))

Unnamed: 0,subject,file count
0,COVID-19-AR-16424103,5
1,COVID-19-AR-16434399,1
2,COVID-19-AR-16406496,3
3,COVID-19-AR-16406504,2
4,COVID-19-AR-16406522,2
...,...,...
100,COVID-19-AR-16406559,3
101,COVID-19-AR-16424071,1020
102,COVID-19-AR-16434453,1465
103,COVID-19-AR-16434395,7


The table above just tops and tails the full list of subjects but we can see that there is a wide range in the number of image files for each subject. This is likely reflects that different imaging procedures were performed on different subjects - some procedures generating many more files than others.

This however gets to the issues raised above. To attempt to illustrate we can drill down on the 5 files for the first subject "COVID-19-AR-16424103".

In [6]:
q1 = Q('id = "COVID-19-AR-16424103"')
r = q1.run()

Getting results from database

Total execution time: 958 ms


A simple dump, while not a good for general user consumption, shows us everything CDA can tell us about these files in a way which is good enough for our exploratory purposes.

In [7]:
print(json.dumps(r[0]['File'], indent=3))

[
   {
      "id": "6eea85b8-924e-4c9f-abe3-a6c1b9d65670",
      "identifier": [
         {
            "system": "IDC",
            "value": "6eea85b8-924e-4c9f-abe3-a6c1b9d65670"
         }
      ],
      "label": "6eea85b8-924e-4c9f-abe3-a6c1b9d65670.dcm",
      "data_category": "Imaging",
      "data_type": "DX",
      "file_format": "DICOM",
      "associated_project": "covid_19_ar",
      "drs_uri": "gs://idc-open/6eea85b8-924e-4c9f-abe3-a6c1b9d65670.dcm",
      "byte_size": null,
      "checksum": null
   },
   {
      "id": "aefa0c17-aae9-48fa-8f67-16688817a22e",
      "identifier": [
         {
            "system": "IDC",
            "value": "aefa0c17-aae9-48fa-8f67-16688817a22e"
         }
      ],
      "label": "aefa0c17-aae9-48fa-8f67-16688817a22e.dcm",
      "data_category": "Imaging",
      "data_type": "DX",
      "file_format": "DICOM",
      "associated_project": "covid_19_ar",
      "drs_uri": "gs://idc-open/aefa0c17-aae9-48fa-8f67-16688817a22e.dcm",
      "byte_si

There is no indication of the DICOM structure in the above. This example, with only 5 files, might be easy to drill into manually. Other cases such as those with nearly 1500 files for a single subject would be harder to deal with. We know tha the DICOM model provides the underlying structure via which these images can be managed in way that would allow meaningful compute on them. 

The Imaging Data Commons reported recently that their BigQuery tables are the recommended route via which files would be selected.

Looking at MIDRC shows that it can manage the DICOM structure in a similar way. In this case 

### How is the DICOM structure represented in MIDRC itself?

The first encouraging sign is that the MIDRC data portal user interface shows 105 subjects for this study, just as we saw in CDA/IDC above.

Screenshots aren't ideal for this and may include protected health information (checking). We can show things more compactly with extracts from a download of the "Case Metadata" for these 105 cases. 

#### The first case (5 files)
Looking in MIDRC for the first case we looked at in CDA/IDC.

```
  {
    "submitter_id": "COVID-19-AR-16424103",
    "sex": "Female",
    "age_at_index": 48,
    "index_event": null,
    "race": "Masked for this notebook",
    "zip": 999,
    "covid19_positive": "Yes",
    "_imaging_studies_count": 4,
    "_ct_scans_count": 0,
    "_radiography_exams_count": 4,
    "_ct_series_count": 0,
    "_ct_instances_count": 0,
    "_dx_series_count": 5,
    "_dx_instances_count": 5,
    "_cr_series_count": 0,
    "_cr_instances_count": 0,
    "project_id": "TCIA-COVID-19-AR"
  }
  ```
  
We can tell that the MIDRC system has retained about the DICOM structure we are dealing with. There are 4 *studies* (where *study* has its specific DICOM meaning). That there are 5 *series*, suggesting 4 *studies* with 1 *series* and one *study* with 2 *series*. Each *series* must have only one *instance* (image) in it. As we expected, this was a pretty simple example.
  

#### The second case (1465 files)
Here's the breakdown that the MDRIC Data Portal gives us for COVID-19-AR-16434453

```
  {
    "submitter_id": "COVID-19-AR-16434453",
    "sex": "Male",
    "age_at_index": 62,
    "index_event": null,
    "race": "Masked for this notebook",
    "zip": 901,
    "covid19_positive": "Yes",
    "_imaging_studies_count": 4,
    "_ct_scans_count": 1,
    "_radiography_exams_count": 3,
    "_ct_series_count": 8,
    "_ct_instances_count": 1460,
    "_dx_series_count": 5,
    "_dx_instances_count": 5,
    "_cr_series_count": 0,
    "_cr_instances_count": 0,
    "project_id": "TCIA-COVID-19-AR"
  }
 ```
Here we have:
* 4 *studies* made up of 1 CT scan and 3 radiography exams
* The CT Scan has 8 *series*
* The 5 diagnostic *series* are spread across the 3 radiography *studies* 
* We can see that there is only one *instance* per diagnostic series.

We can't infer any more than that from the above which is only a list of counts. But can the MIDRC Gen3 model do more? Does it give us the specifics we could not infer from the counts?

Again we could take a UI or the API approach. Have begun to investigate but have put on hold for now. The model shown in MIDRC  indicates that the full DICOM model is present. See https://data.midrc.org/DD, particularly the graph view. Every indication is that we could fully navigate the images via the DICOM model.



### CDA shows us part of this
CDA and the CRDC-H model can tell us part of this.

In [8]:
subject_ids = ["COVID-19-AR-16424103", "COVID-19-AR-16434453"]

for sid in subject_ids:
    dicom_type_query = """SELECT data_category, data_type, file_format, count(*) file_count
from gdc-bq-sample.integration.all_v2 AS su,
unnest(File) AS f
where (su.id = '{}')
group by data_category, data_type, file_format""".format(sid)

    r3 = Q.sql(dicom_type_query, limit=200)
    print("Subject:{}".format(sid))
    file_counts = []
    for row in r3:
        file_counts.append(row)
    display(pd.DataFrame(file_counts))

Subject:COVID-19-AR-16424103


Unnamed: 0,data_category,data_type,file_format,file_count
0,Imaging,DX,DICOM,5


Subject:COVID-19-AR-16434453


Unnamed: 0,data_category,data_type,file_format,file_count
0,Imaging,DX,DICOM,5
1,Imaging,CT,DICOM,1460


### Conclusions for CDA and CRDC-H purposes
1. Both the Imaging Data Commons and MIDRC provide ways of querying and retrieving data stored in the DICOM model.
2. Making that model available via CDA/CRDC-H would benefit CDA users who wish to aggregate particular files.
3. This could be a good pilot for federated query in place of ETL into the CDA BigQuery tables.
4. GA4GH Data Connect may have some relevance to these concerns.

### Also of relevance: Does MIDRC Gen3 support the GA4GH DRS protocol?

Having seen object_id's such as this example in the file manifest it seems like it's a possibility.

```
  {
    "md5sum": "65c8709ab293e89a625d989c0061b13b",
    "file_name": "COVID-19-AR/COVID-19-AR-16434368/02-17-2012-CT CHEST ABDOMEN PELVIS W-73530/80485.000000-MPR SAG Sagittal-39582/1-007.dcm",
    "_case_id": [
      "eda2a2ca-f524-4466-9fb5-56bb407821f0"
    ],
    "object_id": "dg.MD1R/f55b8fed-a938-4cd7-8f39-5ee3cb75c218",
    "file_size": 527322
  }
 ```
 
 Taking that id and what one would expect for the DRS endpoint...

In [9]:
drs_id = "f55b8fed-a938-4cd7-8f39-5ee3cb75c218"
url = "https://data.midrc.org/ga4gh/drs/v1/objects/{}".format(drs_id)

We can try submitting a DRS request

In [10]:
import requests

response = requests.get(url)

if(response.ok):
    print(json.dumps(response.json(), indent=3))
else:
    response.raise_for_status()


{
   "access_methods": [
      {
         "access_id": "s3",
         "access_url": {
            "url": "s3://midrcprod-default-813684607867-upload/dg.XXXX/f55b8fed-a938-4cd7-8f39-5ee3cb75c218/1-007.dcm"
         },
         "region": "",
         "type": "s3"
      }
   ],
   "aliases": [],
   "checksums": [
      {
         "checksum": "d139aa1e",
         "type": "crc"
      },
      {
         "checksum": "c645ce0823b0014b9511d104a4bb0dbd363cbd2eaaae8223490488af99f4f8f03c8ab1f47fad7e5d7232a4b36f21c2e00f4373ecdf628f594fd5a0e43c58f074",
         "type": "sha512"
      },
      {
         "checksum": "5457ef7bf73998457ec660481d33c4ef64ed8d792154cd2dd65f6f20ead909c2",
         "type": "sha256"
      },
      {
         "checksum": "8bbba3d324e29efe580ff7f3681baacb8dc4f716",
         "type": "sha1"
      },
      {
         "checksum": "65c8709ab293e89a625d989c0061b13b",
         "type": "md5"
      }
   ],
   "contents": [],
   "created_time": "2020-10-01T05:53:16.260332",
   "descrip

That is a valid DRS response. That a url can be obtained via DRS for downloading the file has also been validated.

So, yes, the MIDRC Gen3 instance does support the GA4GH protocol.


Further questions for exploration:
Does the IDC point to the same cloud copy of the file?