## Using SQL to explore CDA 09CO022 example

### Passing Direct SQL Queries
It would be useful to pass some direct SQL queries. Swagger is possible, but not really scriptable. If I want to call the REST API directly from Python I'll probably end up writing a client. How does cdapython call the REST API? It a pre-existing client already exists - in the cda_client module. No need to write our own. Even better, it's already installed; as a cdapython dependency.

First some set up.

In [1]:
import cda_client
host='https://cda.cda-dev.broadinstitute.org'
api_client = cda_client.ApiClient(configuration=cda_client.Configuration(host=host))
api_instance = cda_client.QueryApi(api_client)

One oddity is that version is a mandatory parameter to sql_query, but the version is actually specified as part of the table name. Currently you can pass anything to version. Not to say that will always be the case, and we should respect the intent of the interface and use a proper version number.

### Exploring the Subject 09CO022 example via SQL
Start with the same SQL that was generated by cdapython

In [2]:
import json
query1 = '''SELECT * FROM gdc-bq-sample.cda_mvp.v3, 
UNNEST(ResearchSubject) AS _ResearchSubject 
WHERE (_ResearchSubject.id = 'c5421e34-e5c7-4ba5-aed9-146a5575fd8d')'''
api_response1 = api_instance.sql_query('v3',query1)

# The results are bulky to list in a notebook - so we'll write them to a file
with open('query_results/09CO022_query1.json', 'w') as f:
    f.write(json.dumps(api_response1.result, indent=3))

See the file for full results. The following function helps us see the nesting of patients and subjects in the result.

In [3]:
def summarizeResults(results):
    for resItem in results:
        j = json.dumps(resItem, indent=3)
        print('Patient {}'.format(resItem['id']))
        for rs in resItem['ResearchSubject']:
            id = rs['identifier'][0]
            print ('\t system:{} ResearchSubjectid:{}'.format(id['system'],rs['id']))      
        print('_'*80)

In [4]:
summarizeResults(api_response1.result)

Patient 09CO022
	 system:GDC ResearchSubjectid:c5421e34-e5c7-4ba5-aed9-146a5575fd8d
	 system:PDC ResearchSubjectid:459e3b69-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________


One question is whether the SQL created above within the Python API is what generates the repetition of ResearchSubject at the top level - notably the unnest of ResearchSubject.

Given that the same id was available at the top level of the query we should be able to run it without the unnest.

In [5]:
query2 = '''SELECT * FROM gdc-bq-sample.cda_mvp.v3 WHERE id = 'c5421e34-e5c7-4ba5-aed9-146a5575fd8d' '''
api_response2 = api_instance.sql_query('v3',query2)
api_response2

{'next_url': None,
 'previous_url': None,
 'query_sql': 'SELECT * FROM gdc-bq-sample.cda_mvp.v3 WHERE id = '
              "'c5421e34-e5c7-4ba5-aed9-146a5575fd8d' ",
 'result': []}

That was unexpected. That id is what we were getting back at the top level. Why doesn't it match anything?

As a check, let's try something that avoids a where clause at all.

In [6]:
query3 = '''SELECT * FROM gdc-bq-sample.cda_mvp.v3  '''
api_response3 = api_instance.sql_query('v3',query3,limit=1)
print(json.dumps(api_response3.result,indent=3))

[
   {
      "days_to_birth": null,
      "race": null,
      "sex": null,
      "ethnicity": null,
      "id": "HTMCP-03-06-02177",
      "ResearchSubject": [
         {
            "Diagnosis": [],
            "Specimen": [],
            "associated_project": "CGCI-HTMCP-CC",
            "id": "4d54f72c-e8ac-44a7-8ab9-9f20001750b3",
            "primary_disease_type": "Adenomas and Adenocarcinomas",
            "identifier": [
               {
                  "system": "GDC",
                  "value": "4d54f72c-e8ac-44a7-8ab9-9f20001750b3"
               }
            ],
            "primary_disease_site": "Cervix uteri"
         }
      ]
   }
]


In [7]:
summarizeResults(api_response3.result)

Patient HTMCP-03-06-02177
	 system:GDC ResearchSubjectid:4d54f72c-e8ac-44a7-8ab9-9f20001750b3
________________________________________________________________________________


We get a different kind of patient id there. Let's try using that id in a where clause for our query on patient id.

In [8]:
query4 = '''SELECT * FROM gdc-bq-sample.cda_mvp.v3 WHERE id = 'HTMCP-03-06-02177' '''
api_response4 = api_instance.sql_query('v3',query4)
print(json.dumps(api_response4.result,indent=3))

[
   {
      "days_to_birth": null,
      "race": null,
      "sex": null,
      "ethnicity": null,
      "id": "HTMCP-03-06-02177",
      "ResearchSubject": [
         {
            "Diagnosis": [],
            "Specimen": [],
            "associated_project": "CGCI-HTMCP-CC",
            "id": "4d54f72c-e8ac-44a7-8ab9-9f20001750b3",
            "primary_disease_type": "Adenomas and Adenocarcinomas",
            "identifier": [
               {
                  "system": "GDC",
                  "value": "4d54f72c-e8ac-44a7-8ab9-9f20001750b3"
               }
            ],
            "primary_disease_site": "Cervix uteri"
         }
      ]
   }
]


That is reassuring, demomstrating that a query of the top level id does work.

Sticking with that record, we'll query it on the Research Subject id using the form of query generated by cdapython.

In [9]:
query5 = '''SELECT * FROM gdc-bq-sample.cda_mvp.v3, 
UNNEST(ResearchSubject) AS _ResearchSubject 
WHERE (_ResearchSubject.id = '4d54f72c-e8ac-44a7-8ab9-9f20001750b3')'''
api_response5 = api_instance.sql_query('v3',query5)
print(json.dumps(api_response5.result,indent=3))

[
   {
      "days_to_birth": null,
      "race": null,
      "sex": null,
      "ethnicity": null,
      "id": "HTMCP-03-06-02177",
      "ResearchSubject": [
         {
            "Diagnosis": [],
            "Specimen": [],
            "associated_project": "CGCI-HTMCP-CC",
            "id": "4d54f72c-e8ac-44a7-8ab9-9f20001750b3",
            "primary_disease_type": "Adenomas and Adenocarcinomas",
            "identifier": [
               {
                  "system": "GDC",
                  "value": "4d54f72c-e8ac-44a7-8ab9-9f20001750b3"
               }
            ],
            "primary_disease_site": "Cervix uteri"
         }
      ],
      "Diagnosis": [],
      "Specimen": [],
      "associated_project": "CGCI-HTMCP-CC",
      "id_1": "4d54f72c-e8ac-44a7-8ab9-9f20001750b3",
      "primary_disease_type": "Adenomas and Adenocarcinomas",
      "identifier": [
         {
            "system": "GDC",
            "value": "4d54f72c-e8ac-44a7-8ab9-9f20001750b3"
         }
      ]

Now we see the problem reveal itself again. Quite a different result from before even though it's the same patient. It masks the patient id with the subject id and repeats he subject attributes at the top level.

Does being specific about which table - or nested object we are referring to make a difference? This is the query with the original id, but note we are specific aboout which table we want the select * from.

In [10]:
query6 = '''SELECT p.* FROM gdc-bq-sample.cda_mvp.v3 p , 
UNNEST(ResearchSubject) AS _ResearchSubject 
WHERE (_ResearchSubject.id = 'c5421e34-e5c7-4ba5-aed9-146a5575fd8d')'''
api_response6 = api_instance.sql_query('v3',query6)

with open('query_results/09CO022_sql_fix.json', 'w') as f:
    f.write(json.dumps(api_response6.result, indent=3))

Again see the file for the full results, but the following summatizes the structure.

In [11]:
summarizeResults(api_response6.result)

Patient 09CO022
	 system:GDC ResearchSubjectid:c5421e34-e5c7-4ba5-aed9-146a5575fd8d
	 system:PDC ResearchSubjectid:459e3b69-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________


That seems to fix the problem. It is necessary to specify in the SQL that we want only the columns for the root table to be returned.

Created issue https://github.com/CancerDataAggregator/cda-python/issues/16

### The query on attributes
Similarly, looking at a modified version of the query based on attributes that will match 09CO022.

This was the query generated when cdapython was used.

`SELECT * FROM gdc-bq-sample.cda_mvp.v3, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis WHERE (((_ResearchSubject.associated_project = 'CPTAC-2') AND (_Diagnosis.tumor_stage = 'Stage IIB')) AND (_ResearchSubject.primary_disease_site = 'Colon'))'`

We modify that in the same way, to select from the root table only.

In [12]:
query7 = '''SELECT p.* FROM gdc-bq-sample.cda_mvp.v3 p, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis 
WHERE (((_ResearchSubject.associated_project = 'CPTAC-2') 
AND (_Diagnosis.tumor_stage = 'Stage IIB')) 
AND (_ResearchSubject.primary_disease_site = 'Colon'))'''
api_response7 = api_instance.sql_query('v3',query7)
with open('query_results/09CO022_like_sql_fix.json', 'w') as f:
    f.write(json.dumps(api_response6.result, indent=3))
summarizeResults(api_response7.result)

Patient 15CO002
	 system:GDC ResearchSubjectid:44ecd34b-aa2b-4ce1-ab23-c64aee162f69
	 system:PDC ResearchSubjectid:d2b0df58-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________
Patient 15CO002
	 system:GDC ResearchSubjectid:44ecd34b-aa2b-4ce1-ab23-c64aee162f69
	 system:PDC ResearchSubjectid:d2b0df58-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________
Patient 09CO022
	 system:GDC ResearchSubjectid:c5421e34-e5c7-4ba5-aed9-146a5575fd8d
	 system:PDC ResearchSubjectid:459e3b69-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________
Patient 09CO022
	 system:GDC ResearchSubjectid:c5421e34-e5c7-4ba5-aed9-146a5575fd8d
	 system:PDC ResearchSubjectid:459e3b69-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________
Patient 05CO039
	 system:GDC ResearchSubjectid:997475b1-6648-494a-93

That at least removes one of the extra copies of the research subject from the top level of the result hierarchy.

It also removes the masking of the patient level attributes in the results.

However, that unmasking also makes it clear that each patient is still reported twice in the results. This is to be expected because of the duplication of attributes in both ResearchSubjects. 

There are likely some formulations in SQL that could force the Patient to be returned only once, and it would be good to understand these.

The most helpful solution for CPTAC-2 and  PDC/GDC would be to avoid, within the schema, the duplication of ResearchSubject attributes.

We can, for a current workaround, do the following which limits the searched ResearchSubject to that from the GDC.

In [13]:
query8 = '''SELECT p.* FROM gdc-bq-sample.cda_mvp.v3 p, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis,
UNNEST(_ResearchSubject.Identifier) AS _Identifier
WHERE _ResearchSubject.associated_project = 'CPTAC-2'
        AND _Diagnosis.tumor_stage = 'Stage IIB'
        AND _ResearchSubject.primary_disease_site = 'Colon'
        AND _Identifier.system = 'GDC'
'''
api_response8 = api_instance.sql_query('v3',query8)
summarizeResults(api_response8.result)

Patient 05CO039
	 system:GDC ResearchSubjectid:997475b1-6648-494a-9322-79aa17be272e
	 system:PDC ResearchSubjectid:2254625e-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________
Patient 05CO044
	 system:GDC ResearchSubjectid:5e55cf3e-9f95-4b8c-8212-b540da3047cb
	 system:PDC ResearchSubjectid:24cb0fcb-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________
Patient 15CO002
	 system:GDC ResearchSubjectid:44ecd34b-aa2b-4ce1-ab23-c64aee162f69
	 system:PDC ResearchSubjectid:d2b0df58-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________
Patient 09CO022
	 system:GDC ResearchSubjectid:c5421e34-e5c7-4ba5-aed9-146a5575fd8d
	 system:PDC ResearchSubjectid:459e3b69-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________


Perhaps we could force this through cdapython?

In [14]:
from cdapython import Q
qc1 = Q('ResearchSubject.associated_project = "CPTAC-2"')
qc2 = Q('ResearchSubject.Diagnosis.tumor_stage = "Stage IIB"')
qc3 = Q('ResearchSubject.primary_disease_site = "Colon"')
qc4 = Q('ResearchSubject.identifier.system = "GDC"')


q2 = qc1.And(qc2).And(qc3).And(qc4)
r2 = q2.run(limit=100) 
print(r2)


Query: SELECT p.* FROM gdc-bq-sample.cda_mvp.v3 AS p, UNNEST(ResearchSubject) AS _ResearchSubject, UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis, UNNEST(_ResearchSubject.identifier) AS _identifier WHERE ((((_ResearchSubject.associated_project = 'CPTAC-2') AND (_Diagnosis.tumor_stage = 'Stage IIB')) AND (_ResearchSubject.primary_disease_site = 'Colon')) AND (_identifier.system = 'GDC'))
Offset: 0
Limit: 100
Count: 4
More pages: No



That gives the correct number of results, but we still suffer from the masking of the patient level information by the lower level subject information.

In [15]:
summarizeResults(r2)

Patient 15CO002
	 system:GDC ResearchSubjectid:44ecd34b-aa2b-4ce1-ab23-c64aee162f69
	 system:PDC ResearchSubjectid:d2b0df58-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________
Patient 09CO022
	 system:GDC ResearchSubjectid:c5421e34-e5c7-4ba5-aed9-146a5575fd8d
	 system:PDC ResearchSubjectid:459e3b69-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________
Patient 05CO039
	 system:GDC ResearchSubjectid:997475b1-6648-494a-9322-79aa17be272e
	 system:PDC ResearchSubjectid:2254625e-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________
Patient 05CO044
	 system:GDC ResearchSubjectid:5e55cf3e-9f95-4b8c-8212-b540da3047cb
	 system:PDC ResearchSubjectid:24cb0fcb-63d6-11e8-bcf1-0a2705229b82
________________________________________________________________________________


That shows additional odd behavior. The id showing up for patient is not the id of either of the nested ResearchSubjects.

The following show what's happening in more detail. See the query_results/09CO022_like_api_fix.json file for the full result listing. In this case it appears that the diagnosis id is what endds up in patient id.

In [16]:
allRes = []
for resItem in r2:
    #add to a list of dicts so we can serialize to json
    allRes.append(resItem)

    print('Patient {}'.format(resItem['id']))
    print('Top level Subject {}'.format(resItem['identifier']))
    for rs in resItem['ResearchSubject']:
        subj_name = rs['Specimen'][0]['derived_from_subject']
        print('nested subject: {}'.format(subj_name))
        id = rs['identifier'][0]
        print ('\t system:{} id:{}'.format(id['system'],id['value']))     
    print('_'*80)
with open('query_results/09CO022_like_api_fix.json', 'w') as f:
    f.write(json.dumps(allRes, indent=3))

Patient 15CO002


KeyError: 'identifier'

### Where does the masking come from?
Does the masking problem occur in cda-client? Or within the REST API?
Borrowing a function from Todd Pihl, with slight modification, to call the API directly.


In [None]:
import requests
def runAPIQuery(querystring, limit=None):
    cdaURL = 'https://cda.cda-dev.broadinstitute.org/api/v1/sql-query/v3'
    #Using a limit:
    if limit is not None:
        cdaURL = "{}?limit={}".format(cdaURL, str(limit))
        
    headers = {'accept' : 'application/json', 'Content-Type' : 'text/plain'}

    request = requests.post(cdaURL, headers = headers, data = querystring)

    if request.status_code == 200:
        return request.json()
    else:
        raise Exception ("Query failed code {}. {}".format(request.status_code,query))

And running the original query directly demonstrates the issue occurs within the REST API. The patient id at the top level has been masked/overwritten with the diagnosis id. The ResearchSubject attributes are also present at the top level.  
Issue added https://github.com/CancerDataAggregator/cda-service/issues/65

In [None]:
querystring = '''SELECT * FROM gdc-bq-sample.cda_mvp.v3 p, 
UNNEST(ResearchSubject) AS _ResearchSubject, 
UNNEST(_ResearchSubject.Diagnosis) AS _Diagnosis,
UNNEST(_ResearchSubject.Identifier) AS _Identifier
WHERE _ResearchSubject.associated_project = 'CPTAC-2'
        AND _Diagnosis.tumor_stage = 'Stage IIB'
        AND _ResearchSubject.primary_disease_site = 'Colon'
'''
runAPIQuery(querystring)