# Diagnosis System Demo
This demo shows how our diagnosis system work. 

## Problem
While we have a lot of resources in the US, African people are suffering from high disease burden and reletively low information technology development. We are aiming to design a diagnosis system for them to help ease the disease burden. 
Although many US diagnosis systems are working well, the target diseases are quite different from African infectious diseases, which are the main cause of death there. In the mean time, due to a limited access of internet in Africa, US systems are not able to work there.

## Solution
We are aiming to develop a diagnosis system for Africa people. The system will be able to work offline, while high accuracy and broad spectrum are maintained. 
We have access to IBM MarketScan medical records and DeepDive text corpus data set, and we collected some common diseases and their corresponding symptoms, both in the US and in Nigeria, the largest country in Africa. 
The system is designed as following:
### Data preparing
IBM MarketScan medical records are used to summarized the frequency of different diseases and serve as a prior probability of diagnosis.
DeepDive corpus works as training data for the association between diseases and symptoms. The method we use is Word2Vec within gensim library. Combining with the disease information we collected, an association matrix is formed and this matrix is used for an adaptive diagnosis as followed.
### Adaptive diagnosis
Given the disease-symptom association, and the inputs from the patient, we are able to filter out possible diagnosises for the patient. With a list of possible diagnosis, a reduced association matrix is formed and the most informative symptom is asked. After the answer from the patient is got, the matrix is further reduced. 
Finally the matrix is reduced to only one line and the diagnosis is done. In this way, not only the final diagnosis result is shown, but also other possible diseases. This is because we have collected a list of symptoms, multiplied by the association matrix, we actually get the probabilities for all disease in our list.

## Step 1: load data and transform into dataframes or dictionaries

In [45]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec

# The medical dictionary trained from NLP for recognizing natural language
meddic = pd.read_csv('../WordEmbedding/trimmed_wv/Dictionary.csv', index_col=0, header = None)

# The disease-symptom association matrix
WM = pd.read_csv('../WeightMatrix/Dis_Sym_30.csv', index_col=0)

# The association between medical words and UMLS codes. The association matrix is shown as unified codes.
dis2sym = pd.read_csv('../UMLS/dis_symptom.csv', header=None)


# processing for matching of medical codes and medical terms, and the disease population information.
dis2sym.fillna(method='ffill',inplace=True)

umls_dis = {}
umls_sym = {}
dis_num = {}
for i in dis2sym.index:
    temp = dis2sym.loc[i][0]
    items = temp.split('^')
    item = items[0].strip('UMLS:').split('_')
    if len(item) != 2: continue
    umls_dis[item[0]] = item[1]
    dis_num[item[0]] = int(dis2sym.loc[i][1])
for i in dis2sym.index:
    temp = dis2sym.loc[i][2]
    items = temp.split('^')
    item = items[0].strip('UMLS:').split('_')
    if len(item) != 2: continue
    umls_sym[item[0]] = item[1]
    
rev_sym = {v: k for k, v in umls_sym.items()}
rev_dis = {v: k for k, v in umls_dis.items()}



## Function definitions

In [46]:
# inputs for basic information of the patient
def initial_input():
    # initial input part
    gendermap = {'F':'Female', 'M': 'Male'}
    print('Please type in the gender for the patient. F for female and M for male')
    g = input()
    gender = gendermap[g]
    print('Please type in the age for the patient in years.')
    age = int(input())
    print('What symptom do you have?')
    sym = input()
    
    return gender, age, sym

In [49]:
meddic.loc['abdominal'][1]

'C0000737'

### findsynonym function finds the closest medical terms for the natural language inputs
For example, you input 'coughing', 'cough' will be found and the UMLS code for cough is found and used for further diagnosis

In [50]:
import re
def findsynonym(sym):
    if sym in rev_sym:
        return rev_sym[sym]
    
    if sym in meddic.index:
        return meddic.loc[sym][1]

### SelectMatrix function reduces the association matrix to smaller ones
By choosing only the diseases that matches current symptoms, and the symtoms associated with these diseases.

In [74]:
def SelectedMatrix(sym):
    selected = WM[WM[sym] != 0]
    selected = selected.drop(columns=[sym])
    for c in selected.columns:
        if sum(selected[c]) == 0:
            selected.drop(columns=[c],inplace=True)
    return selected
    

### Renorm function renormalized the diagnosis vector via softmax
to distinguish the large amount of diseases and maintain a sum of unity

In [53]:
T = 0.03
def renorm(dia):
    dia.sort_values(ascending=False, inplace=True)
    dia.reset_index(drop=True)
    s = sum([np.exp(ai/T) for ai in dia])
    return np.exp(dia/T)/s

### Diagnosis function does the diagnosis
* step 1: Get the initial input including gender, age and a symptom
* step 2: Choose a most informative question to ask and get answer from the patient
* step 3: Reduce the association matrix 
* step 4: Check whether there is only one disease left in the reduced, if not go back step 2
* step 5: Calculate the diagnosis vector from the symptoms multiplied by the association matrix
* step 6: renormalized the diagnosis vector and print out the top five possible diseases
* step 7: Use the Bing Web Search SDK to find more information about the diagnosis result

In [83]:
def diagnosis():
    
    gender, age, sym = initial_input()
    
    sym = findsynonym(sym)
    
    selected = SelectedMatrix(sym)
    
    #The response vector
    res = pd.Series(index=WM.columns, data=[0]*len(WM.columns))
    res[sym] = 1
    
    #Diagnosis process
    while True:
        # compute the probabilities
        dia = WM.dot(res)
        
        # drop the columns with no information
        for j in selected.columns:
            if 0 not in selected[j].value_counts():
                res[j] = 1
                selected.drop(columns=[j], inplace = True)  
        
        # the diagnosis criterion
        if len(selected) == 1:
            dia = renorm(dia)
            print('-----------------------------------------------------------')
            print('Diagnosis results:')
            for i in range(len(dia)):
                if i < 5:
                    print(umls_dis[dia.keys()[i]], ':%2d'%(dia[i]*100), '%')
            print('-----------------------------------------------------------')        
            return umls_dis[dia.keys()[0]]
            
        #choose the most relevant symptom to ask: The symptom that are least shared with other diseases
        next_i = selected.columns[0]
        s = 100
        for i in selected.columns:   
            if 0 in selected[i].value_counts():
                pri = abs(selected[i].value_counts()[0] - len(selected)/2)
                if pri < s:
                    s = pri
                    next_i = i      
            else:
                res[next_i] = 1
                selected = selected[selected[next_i]!=0]
         
        print('-----------------------------------------------------------')
        print('Do you have the following symptom: (Y for Yes and N for No)')
        print(umls_sym[next_i])
        
        answer = input()
        while answer != 'Y' and answer != 'N':
            answer = input()
              
        if answer == 'Y':
            res[next_i] = 1
            selected = selected[selected[next_i]!=0]
        else:
            res[next_i] = 0
            selected = selected[selected[next_i]==0]
            
        selected.drop(columns=[next_i], inplace = True)

### Azure Bing web search for diagnosis result

In [1]:
# Import required modules.
from azure.cognitiveservices.search.websearch import WebSearchAPI
from azure.cognitiveservices.search.websearch.models import SafeSearch
from msrest.authentication import CognitiveServicesCredentials

# Replace with your subscription key.
subscription_key = "3b4066fbe4774c6da3a47e7a5399b440"

# Instantiate the client and replace with your endpoint.
client = WebSearchAPI(CognitiveServicesCredentials(subscription_key), base_url = "https://bingtril.cognitiveservices.azure.com/bing/v7.0")

def bing(diagnosis):

    # Make a request. Replace Yosemite if you'd like.
    web_data = client.web.search(query=diagnosis)

    '''
    Web pages
    If the search response contains web pages, the first result's name and url
    are printed.
    '''
    if hasattr(web_data.web_pages, 'value'):

        print("\r\nWebpage Results#{}".format(len(web_data.web_pages.value)))

        first_web_page = web_data.web_pages.value[0]
        print("First web page name: {} ".format(first_web_page.name))
        print("First web page URL: {} ".format(first_web_page.url))

    else:
        print("Didn't find any web pages...")
    return

ModuleNotFoundError: No module named 'azure'

## An example of diagnosis process

In [13]:
diagnosis()

Please type in the gender for the patient. F for female and M for male
M
Please type in the age for the patient in years.
24
What symptom do you have?
cough
-----------------------------------------------------------
Do you have the following symptom: (Y for Yes and N for No)
yellow sputum
Y
-----------------------------------------------------------
Do you have the following symptom: (Y for Yes and N for No)
green sputum
Y
-----------------------------------------------------------
Do you have the following symptom: (Y for Yes and N for No)
malaise
Y
-----------------------------------------------------------
Diagnosis results:
pneumonia :53 %
asthma :13 %
hepatitis B :11 %
influenza :11 %
colitis :10 %
-----------------------------------------------------------


'Diagnosis done'

In [None]:
bing(diagnosis())