The purpose of this notebook is to try out vmMatch with CDA examples and see where it leads.  The basic idea it to provide a list of terms and see if I can tease out the CDEs they beling to

In [140]:
import pandas as pd
import requests
import pprint
import json
import pandas as pd

caDSR Swagger interface:  https://cadsrapi.cancer.gov/NCIAPI/1.0/index.html

In [141]:
vmatchprodurl = "https://cadsrapi.cancer.gov/rad/vmMatch/v1/vmMatch"

In [142]:
cdaSpeciesList = ["canis familiaris", "home sapeins", "homo sapiens; mus musculus", "internal reference-pooled sample", "jhu-qc", "mus musculus", "normal only ir", "not reported", "pnnl-jhu ref", "ref", "taiwanese ir", "tumor only ir"]
testlist = ["canis familiaris","jhu-qc","homo sapiens; mus musculus"]
#testlist = ["homo sapiens; mus musculus"]
bumlist = ["jhu-qc"]

In [143]:
headers = { 'Content-Type': 'application/json',
            'matchType': 'Restricted',
            'function': 'Concepts Only'}

In [144]:
def runPostQuery(url, query, headers):
    #url is the vmMatch URL
    #query is a list of dictionary
    #headers is HTML headers and 
    try:
        results = requests.post(url, data=json.dumps(query), headers=headers)
    except requests.exceptions.HTTPError as e:
        pprint.pprint(e)
    results = json.loads(results.content.decode())
    return results['matchResults']

vmMatch takes a list of dictionary with "name" and "userTip" defined in each dictionary.  List lengths in the 10-30 range probably OK, getting into hundreds may cause system errors

In [145]:
for item in testlist:
    querydict = [{"name":item, "userTip":item}]
    #pprint.pprint(querydict)
    queryres = runPostQuery(vmatchprodurl, querydict, headers)
   #pprint.pprint(queryres)

This is where things can get funky.  The concept id can be used in a caDSR Concept query (/DataElement/query/Concept,  curl -X GET "https://cadsrapi.cancer.gov/rad/NCIAPI/1.0/api/DataElement/query/Concept?conceptCode=C14201" -H "accept: application/json").  The records returned from that contain a publicID that can then be used
in a Data Element query (curl -X GET "https://cadsrapi.cancer.gov/rad/NCIAPI/1.0/api/DataElement/5729594" -H "accept: application/json")

So Step one will be to coleect the concept IDs

In [146]:
querylist = []
for item in testlist:
#for item in bumlist:
#for item in cdaSpeciesList:
    querylist.append({"name":item, "userTip":item})
bigres = runPostQuery(vmatchprodurl, querylist, headers)

For each entry we need to look at the 'name' field to find out what query it's related to.  Though this JSON looks a little bit duplicated?

In [147]:
conceptiddict = {}
nohitlist = []
for entry in bigres:
    testname = entry['name']
    if int(entry['numberOfMatches']) > 0:
        for match in entry['matches']:
            conceptid = match['concept']
            #pprint.pprint(testname)
            #pprint.pprint(conceptid)
            #pprint.pprint(match)
            if testname in conceptiddict:
                temparray = conceptiddict[testname]
                temparray.append(conceptid)
                conceptiddict[testname] = temparray
            else:
                conceptiddict[testname] = [conceptid]
    else:
        nohitlist.append(entry['name'])
        

In [148]:
#pprint.pprint(conceptiddict)
#pprint.pprint(nohitlist)

{'canis familiaris': ['C14201'],
 'homo sapiens; mus musculus': ['C45247',
                                'C19862',
                                'C192862',
                                '10039481',
                                '10011614',
                                'C79665']}
['jhu-qc']


For each of the concept codes, hit the Concept endpoint and see what comes back

In [149]:
def conceptCodeQuery(conceptcode):
    url = "https://cadsrapi.cancer.gov/rad/NCIAPI/1.0/api/DataElement/query/Concept?conceptCode={}".format(conceptcode)
    headers = {"accept" : "application/json"}
    results = requests.get(url, headers = headers)
    results = json.loads(results.content.decode())
    return results['DataElementQueryResults']

In [150]:
publiciddict = {}
for testname,list in conceptiddict.items():
    for id in list:
        conceptres = conceptCodeQuery(id)
        for entry in conceptres:
            publicid = entry['publicId']
            if testname in publiciddict:
                temparray = publiciddict[testname]
                temparray.append(publicid)
                publiciddict[testname] = temparray
            else:
                publiciddict[testname] = [publicid]

In [151]:
#pprint.pprint(publiciddict)

{'canis familiaris': ['2452737',
                      '2452741',
                      '2453180',
                      '2453343',
                      '2453345',
                      '2453351',
                      '2453403',
                      '2453731',
                      '2613129',
                      '2614959',
                      '2756032',
                      '2827061',
                      '3130966',
                      '3744672',
                      '3770708',
                      '3770719',
                      '4862813',
                      '5729594',
                      '6118266'],
 'homo sapiens; mus musculus': ['6951303', '3014701', '12662612']}


In [152]:
def dataElementQuery(publicid):
    url = "https://cadsrapi.cancer.gov/rad/NCIAPI/1.0/api/DataElement/{}".format(publicid)
    headers = {"accept" : "application/json"}
    results = requests.get(url, headers = headers)
    if results.status_code == 200:
        results = json.loads(results.content.decode())
    else:
        results = None
    return results

In [153]:
tempiddict = {'canis familiaris': ['2452737','2452741']}

Lastly, use the publicID in a CDE Query.  From this we'll want the context and preferredName to start

In [154]:
cdemapping = {}
unknownids = {}
#for key, list in tempiddict.items():
for key, list in publiciddict.items():
    for publicid in list:
        cderes = dataElementQuery(publicid)
        if cderes is not None:
            context = cderes['DataElement']['DataElementConcept']['ConceptualDomain']['context']
            name = cderes['DataElement']['DataElementConcept']['ConceptualDomain']['preferredName']
            status = cderes['DataElement']['DataElementConcept']['ConceptualDomain']['workflowStatus']
            publicid = cderes['DataElement']['DataElementConcept']['ConceptualDomain']['publicId']
            holding = {"publicId":publicid, "context":context, "preferredName":name, 'workflowStatus':status}
            if key in cdemapping:
                temp = cdemapping[key]
                temp.append(holding)
                cdemapping[key] = temp
            else:
                cdemapping[key] = [holding]
        else: #Store the IDs that likely generated a 500 error
            if key in unknownids:
                temp = unknownids[key]
                temp.append(publicid)
                unknownids[key] = temp
            else:
                unknownids[key] = [publicid]
    

In [155]:
#pprint.pprint(cdemapping)

{'canis familiaris': [{'context': 'CCR',
                       'preferredName': 'Veterinary Study',
                       'publicId': '2452699',
                       'workflowStatus': 'RELEASED'},
                      {'context': 'CCR',
                       'preferredName': 'Veterinary Study',
                       'publicId': '2452699',
                       'workflowStatus': 'RELEASED'},
                      {'context': 'CCR',
                       'preferredName': 'Veterinary Study',
                       'publicId': '2452699',
                       'workflowStatus': 'RELEASED'},
                      {'context': 'CCR',
                       'preferredName': 'Veterinary Study',
                       'publicId': '2452699',
                       'workflowStatus': 'RELEASED'},
                      {'context': 'CCR',
                       'preferredName': 'Veterinary Study',
                       'publicId': '2452699',
                       'workflowStatus': 'RELEASE

In [156]:
#pprint.pprint(unknownids)

{'canis familiaris': ['2452737',
                      '2452741',
                      '2453731',
                      '2614959',
                      '3130966',
                      '3770708']}


In [157]:
columns = ['OriginalSearchTerm', 'Context', 'PreferredName', 'PublicID','Status']
cdedf = pd.DataFrame(columns = columns)
for key, list in cdemapping.items():
    for entry in list:
        #cdedf = cdedf.append({'OriginalSearchTerm':key, 'Context':entry['context'], 'PreferredName': entry['preferredName'], 'PublicID':entry['publicId'], 'Status':entry['workflowStatus']}, ignore_index=True)
        tempdf = pd.DataFrame({'OriginalSearchTerm':key, 'Context':entry['context'], 'PreferredName': entry['preferredName'], 'PublicID':entry['publicId'], 'Status':entry['workflowStatus']}, index=[0,1,2,3,4])
        cdedf = pd.concat([cdedf, tempdf], ignore_index=True)

In [158]:
cdedf.head()

Unnamed: 0,OriginalSearchTerm,Context,PreferredName,PublicID,Status
0,canis familiaris,CCR,Veterinary Study,2452699,RELEASED
1,canis familiaris,CCR,Veterinary Study,2452699,RELEASED
2,canis familiaris,CCR,Veterinary Study,2452699,RELEASED
3,canis familiaris,CCR,Veterinary Study,2452699,RELEASED
4,canis familiaris,CCR,Veterinary Study,2452699,RELEASED


In [160]:
cdedf.groupby(cdedf.columns.tolist(), as_index=False).size()

Unnamed: 0,OriginalSearchTerm,Context,PreferredName,PublicID,Status,size
0,canis familiaris,CCR,Person Measure/Instrument Testing,2524082,RELEASED,10
1,canis familiaris,CCR,Veterinary Study,2452699,RELEASED,25
2,canis familiaris,CTEP,Assessment Results,2008556,RELEASED,5
3,canis familiaris,CTEP,Individuals,2008532,RELEASED,10
4,canis familiaris,CTEP,Specimen Characteristics,2008547,RELEASED,5
5,canis familiaris,caCORE,UML DEFAULT CD,2222502,RELEASED,10
6,homo sapiens; mus musculus,CTEP,Assessments,2008551,RELEASED,5
7,homo sapiens; mus musculus,CTEP,Data Source,2008576,RELEASED,5
8,homo sapiens; mus musculus,SPOREs,Behavior,2008566,RELEASED,5
