**Question: What are symptoms of Asthma subtypes? **
* Find patients diagnosed with Asthma
* Find symptoms for Asthma
* Find occurences of symptoms in Asthma patients
* Find symptom clusters among Asthma patients and also patient clusters among symptoms
* Compare symptom clusters of patients diagnosed with Asthma without COPD vs Asthma with COPD 
* For future consideration: If feasible identify another diease and compute symptom clusters and align to the cluster generated for Asthma. Depending on the relatedness of the new disease, it may align closely or not. It will be interesting to see if a disease unrelated to Asthma on the surface does align to the Asthma symptom clustering, suggesting underlying similarities. Can this be generalized to map diseases to one another via symptoms?

**Data Sources **
* HUSH+ synthetic data resource
* [FHIR synthetic data resource](http://ictrweb.johnshopkins.edu/ictr/synthetic/)
* [DE-SynPUF synthetic data resource](https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.html)
* [UMLS](https://www.nlm.nih.gov/research/umls/), NOTE: there is a Web API [here](https://documentation.uts.nlm.nih.gov/rest/home.html)
* [OHDSI Web API](http://www.ohdsi.org/web/wiki/doku.php?id=documentation:software:webapi)
* [Biolink API](https://api.monarchinitiative.org/api/#!/bioentity/get_disease_phenotype_associations), for Disease-Phenotype Associations

### Function and dataset definitions

In [12]:
import urllib, urllib2
import pprint, json, requests
from greentranslator.api import GreenTranslator
import mysql.connector

try:
    cnx = mysql.connector.connect(user='tadmin',
                                password='ncats_translator!',
                                database='umls',
                                host='translator.ceyknq0yekb3.us-east-1.rds.amazonaws.com')
except mysql.connector.Error as err:
    if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
        print("Something is wrong with your user name or password")
    elif err.errno == errorcode.ER_BAD_DB_ERROR:
        print("Database does not exist")
    else:
        print(err)

In [13]:
## Pull in disease to symptom mappings taken from the SI of
## https://www.nature.com/articles/ncomms5212. Takes a bit of time to pull down
DISEASE2SYMPTOMS = [x.split("\t") for x in urlopen("https://www.nature.com/article-assets/npg/ncomms/2014/140626/ncomms5212/extref/ncomms5212-s4.txt").read().split("\n")]
DISEASE2SYMPTOMS = filter(lambda x: len(x) == 4, DISEASE2SYMPTOMS)

NameError: name 'urlopen' is not defined

In [4]:
## Given disease/condition term, get back ICD codes from OHDSI
def findICD_ohdsi(txt, icd_version = 9):
    if icd_version == 9:
        icd_type = 'ICD9CM'
    elif icd_version == 10:
        icd_type = 'ICD10'
    else: raise Exception("Invalid ICD version specified")    
    url_con = "http://api.ohdsi.org/WebAPI/vocabulary/search"
    headers = {'content-type': 'application/json'}
    params = {"QUERY": txt,
              "VOCABULARY_ID": [icd_type]}
    response = requests.post(url_con, data=json.dumps(params), headers=headers)
    data= json.loads(response.text.decode('utf-8'))
    return [d["CONCEPT_CODE"] for d in data]
print findICD_ohdsi('asthma')

# Get ICD10/ICD9 code for a given string from UMLS. By default we get back ICD10.
def findICD_umls(name, icd_version = 10):
    if icd_version == 9:
        icd_type = 'ICD9CM'
    elif icd_version == 10:
        icd_type = 'ICD10'
    else: raise Exception("Invalid ICD version specified")

    cursor = cnx.cursor()
    query = ("SELECT CUI FROM umls.MRCONSO WHERE STR='"+name+"'")
    cursor.execute(query, ())
    res = "Undef"
    for code in cursor:
        if res=="Undef":
            res = code
    if res != "Undef":
        query = ("SELECT CODE FROM umls.MRCONSO WHERE SAB='"+icd_type+"' AND CUI='"+res[0]+"'")
        cursor.execute(query, ())
        icd10 = "Undef"
        for code in cursor:
            icd10 = code
        return (icd10[0])
    return ("Undef")

print(findICD_umls('Asthma'))
print(findICD_umls('Asthma', 9))

[u'E945.7', u'493', u'493.9', u'493.90', u'493.92', u'493.91', u'493.2', u'493.20', u'493.22', u'493.21', u'493.82', u'493.0', u'493.00', u'493.02', u'493.01', u'V17.5', u'493.1', u'493.10', u'493.12', u'493.11', u'493.8', u'975.7']


NameError: global name 'cnx' is not defined

In [5]:
## Given disease name, get back symptoms (defined using MeSH terms) along with TFIDF scores
## Taken from https://www.nature.com/articles/ncomms5212
def disease2symptoms(txt):
    s = filter(lambda x: txt.lower() in x[1].lower(), DISEASE2SYMPTOMS)
    return([(x[0], x[3]) for x in s])
symps = disease2symptom("Asthma")
print 'Found %s symptom MeSH terms for %s' % (len(symps), "Asthma")

NameError: name 'disease2symptom' is not defined

In [26]:
## Get all phenotypes for Asthma from Monarch

def getPhenoTypes(doid):
    url = "https://api.monarchinitiative.org/api/bioentity/disease/DOID%3A"+doid+"/phenotypes/?rows=20&fetch_objects=false&unselect_evidence=true"
    response = requests.get(url)
    #print response.text.decode('utf-8')
    res = json.loads(response.text.decode('utf-8'))
    phenotypes = []
    for o in res['objects']:
        cursor = cnx.cursor()
        query = ("SELECT CUI,STR FROM umls.MRCONSO WHERE CODE='"+o+"' AND SAB='HPO'")
        cursor.execute(query, ())
        cui_str = ("Undef","Undef")
        for code in cursor:
            cui_str = code
        #print cui_str
        query = ("SELECT CODE FROM umls.MRCONSO WHERE CUI='"+cui_str[0]+"' AND (SAB='ICD10' OR SAB='ICD10CM' or SAB='ICPC2ICD10ENG')")
        #print query
        cursor.execute(query, ())
        res_code = ("Undef","Undef")
        for code in cursor:
            res_code = code
        #print res_code
        phenotypes.append((cui_str[1],res_code[0]))
    return phenotypes

asthmaPhenotypes = getPhenoTypes("7148")
pprint.pprint(asthmaPhenotypes)


[(u'Urinary tract infections', 'Undef'),
 (u'Renal failure', u'MTHU064189'),
 (u'Tubular atrophy', 'Undef'),
 (u'High urine protein levels', u'MTHU062143'),
 (u'Tall stature', u'MTHU032935'),
 (u'Glomerular nephritis', u'MTHU032239'),
 (u'Nephrotic syndrome', u'MTHU072871'),
 (u'Kidney damage', u'MTHU052954'),
 (u'Increased calcium level in kidney', 'Undef'),
 (u'Kidney inflammation', u'MTHU051999'),
 (u'Abnormal mouth', u'MTHU006799'),
 (u'Mouth ulcer', u'MTHU077661'),
 (u'Gingival enlargement', u'MTHU090361'),
 (u'Dry mouth syndrome', u'MTHU068603'),
 (u'Gingival hemorrhage', u'MTHU011836'),
 (u'Inflamed gums', u'MTHU032001'),
 (u'Sinus disease', 'Undef'),
 (u'Weak chin', 'Undef'),
 (u'Robin mandible', u'M26.04'),
 (u'Hypoacusis', 'Undef'),
 (u'Chronic middle ear infection', u'MTHU056496'),
 (u'Corneal inflammation', u'MTHU054838'),
 (u'Recurrent corneal ulcerations', u'H18.83'),
 (u'Glaucoma', u'MTHU032085'),
 (u'Poor vision', u'MTHU031902'),
 (u'Cloudy lens', u'MTHU075490'),
 (u'Uv

In [None]:
## Functions to retreive patients from different sources - currently FHIR & UNC
def findPatients_fhir(code, count=1000):
    try:
        response = urllib2.urlopen("http://ictrweb.johnshopkins.edu/rest/synthetic/Condition?icd_10="+code+"&_count=%d" % (count))
    except Exception, e:
        raise Exception(e)
    return json.loads(response.read())

def findPatients_unc(age='8', sex='male', race='white', location='OUTPATIENT'):
    query = GreenTranslator ().get_query()
    return query.clinical_get_patients(age, sex, race, location)

### Workflow for "_What are symptoms of Asthma subtypes?_"

#### Find patients diagnosed with Asthma

In [None]:
asthmaCodes = findICD_umls("asthma") # We go with ICD10 codes
## get patients with asthma. First from FHIR, then with UNC
tmp = [findPatients_fhir(icd) for icd in asthmaCodes] # not useful right now
p_unc = findPatients_unc() # TODO needs to be updated to latest code

#### Find symptoms for Asthma

* Next we identify symptoms for asthma. Our starting point in a list of diseases and symptoms from (Zhou et al)[https://www.nature.com/articles/ncomms5212] derived based on co-occurence. The symptoms so obtained are MeSH terms which we then translate to ICD10 codes. For this translation we query both UMLS and OHDSI
* We will transition to using [BioLink API](https://api.monarchinitiative.org/api/#!/bioentity/get_disease_phenotype_associations), for Disease-Phenotype Associations.

In [None]:
asthmaSymptoms = disease2symptoms("asthma")
print 'Found %s symptom MeSH terms for %s' % (len(symps), "asthma")
asthmaSymptomCodes = filter(lambda x: x != 'U', [findICD_umls(x[0], 10) for x in symps])

flatten = lambda l: [item for sublist in l for item in sublist]
tmp2 = flatten([findICD_ohdsi(x[0], 10) for x in symps])
asthmaSymptomCodes.extend(tmp2)

asthmaSymptomCodes = list(set(asthmaSymptomCodes))
print 'Mapped to %d unique ICD10 codes' % (len(asthmaSymptomCodes))

#### Find occurences of symptoms in Asthma patients

Given the set of symptoms for the disease, we now identify patients matching these symptoms. Note that the lines between symptom, condition, diagnoses are not always well defined.

#### Find symptom clusters among Asthma patients and also patient clusters among symptoms

#### Compare symptom clusters of patients diagnosed with Asthma without COPD vs Asthma with COPD