# **Disease Detection using Symptoms and Treatment recommendation**

This notebook contains code to detect disease using the symptoms entered and selected by the user and recommends the appropriate treatments.


In [3]:
pip install google pip install google-search

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
# Predicts diseases based on the symptoms entered and selected by the user.
# importing all necessary libraries
import warnings
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split, cross_val_score
from statistics import mean
from nltk.corpus import wordnet 
import requests
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from itertools import combinations
from time import time
from collections import Counter
import operator
from xgboost import XGBClassifier
import math
from Treatment import diseaseDetail
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
warnings.simplefilter("ignore")

In [3]:
import warnings
import numpy as np
import pandas as pd

Download resources required for NLTK pre-processing

In [4]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to C:\Users\SOUMEN
[nltk_data]    |     MONDAL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to C:\Users\SOUMEN
[nltk_data]    |     MONDAL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\SOUMEN
[nltk_data]    |     MONDAL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\SOUMEN
[nltk_data]    |     MONDAL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagge

True

**synonyms function** finds the synonymous terms of a symptom entered by the user.

This is necessary as the user may use a term for a symptom which may be different from the one present in dataset.
This improves the accuracy by reducing the wrong predictions even when symptoms for a disease are entered slightly different than the ones on which model is trained.

*Synonyms are searched on Thesaurus.com and NLTK Wordnet*

In [5]:
# returns the list of synonyms of the input word from thesaurus.com (https://www.thesaurus.com/) and wordnet (https://www.nltk.org/howto/wordnet.html)
def synonyms(term):
    synonyms = []
    response = requests.get('https://www.thesaurus.com/browse/{}'.format(term))
    soup = BeautifulSoup(response.content,  "html.parser")
    try:
        container=soup.find('section', {'class': 'MainContentContainer'}) 
        row=container.find('div',{'class':'css-191l5o0-ClassicContentCard'})
        row = row.find_all('li')
        for x in row:
            synonyms.append(x.get_text())
    except:
        None
    for syn in wordnet.synsets(term):
        synonyms+=syn.lemma_names()
    return set(synonyms)

In [6]:
# utlities for pre-processing
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
splitter = RegexpTokenizer(r'\w+')

**Disease Symptom dataset** was created in a separate python program.

**Dataset scrapping** was done using **NHP website** and **wikipedia data**

Disease Combination dataset contains the combinations for each of the disease present in dataset as practically it is often observed that it is not necessary for a person to have a disease when all the symptoms are faced by the patient or the user.

*To tackle this problem, combinations are made with the symptoms for each disease.*

 **This increases the size of the data exponentially and helps the model to predict the disease with much better accuracy.**

*df_comb -> Dataframe consisting of dataset generated by combining symptoms for each disease.*

*df_norm -> Dataframe consisting of dataset which contains a single row for each diseases with all the symptoms for that corresponding disease.*

**Dataset contains 261 diseases and their symptoms**

In [1]:
# Load Dataset scraped from NHP (https://www.nhp.gov.in/disease-a-z) & Wikipedia
# Scrapping and creation of dataset csv is done in a separate program
df_comb = pd.read_csv("Dataset/dis_sym_dataset_comb.csv") # Disease combination
df_norm = pd.read_csv("Dataset/dis_sym_dataset_norm.csv") # Individual Disease

X = df_comb.iloc[:, 1:]
Y = df_comb.iloc[:, 0:1]

NameError: name 'pd' is not defined

Using **Logistic Regression (LR) Classifier** as it gives better accuracy compared to other classification models as observed in the comparison of model accuracies in Model_latest.py

Cross validation is done on dataset with cv = 5

In [8]:
import pickle

In [9]:
from sklearn.model_selection import cross_val_score
lr = LogisticRegression()
lr = lr.fit(X, Y)
scores = cross_val_score(lr, X, Y, cv=2)

In [10]:
with open('lr_model.pkl', 'rb') as file:
    lr_model = pickle.load(file)

In [11]:
import pickle
with open('lr_model.pkl', 'wb') as file:
    pickle.dump(lr, file)

In [12]:
# List of symptoms
dataset_symptoms = list(X.columns)
dataset_symptoms

['abdominal cramp',
 'abdominal distention',
 'abnormal behavior',
 'abnormal bleeding',
 'abnormal sensation',
 'abnormally frequent',
 'abscess',
 'aching',
 'acne',
 'acquiring drinking alcohol taking lot time',
 'affected part turning white',
 'anemia',
 'anxiety',
 'arm',
 'attack pain',
 'back',
 'bacterial infection',
 'bad breath',
 'bad smelling thin vaginal discharge',
 'bad smelling vaginal discharge',
 'barky cough',
 'belching',
 'better sitting worse lying',
 'birth baby younger week gestational age',
 'bleeding gum',
 'bleeding skin',
 'blindness',
 'blindness one eye',
 'blister sunlight',
 'bloating',
 'blood stool',
 'blood urine',
 'bloody diarrhea',
 'blue',
 'bluish skin coloration',
 'blurred vision',
 'blurry vision',
 'body tremor',
 'bone pain',
 'bowed leg',
 'breakdown skeletal muscle',
 'breathing problem',
 'bruising',
 'burning',
 'burning redness eye',
 'burning stabbing pain',
 'burning urination',
 'certain thought repeatedly',
 'change bowel movement',

# Symptoms initially taken from user.

In [13]:
# Taking symptoms from user as input 
user_symptoms = str(input("Please enter symptoms separated by comma(,):\n")).lower().split(',')
# Preprocessing the input symptoms
processed_user_symptoms=[]
for sym in user_symptoms:
    sym=sym.strip()
    sym=sym.replace('-',' ')
    sym=sym.replace("'",'')
    sym = ' '.join([lemmatizer.lemmatize(word) for word in splitter.tokenize(sym)])
    processed_user_symptoms.append(sym)

Please enter symptoms separated by comma(,):
 Joint pain, Stiffness or reduced range of motion (how far you can move a joint), Swelling (inflammation),Skin discoloration,Tenderness or sensitivity to touch around a joint, A feeling of heat or warmth near your joints


Pre-processing on symptoms entered by user is done.

In [16]:
# Taking each user symptom and finding all its synonyms and appending it to the pre-processed symptom string
user_symptoms = []
for user_sym in processed_user_symptoms:
    user_sym = user_sym.split()
    str_sym = set()
    for comb in range(1, len(user_sym)+1):
        for subset in combinations(user_sym, comb):
            subset=' '.join(subset)
            subset = synonyms(subset) 
            str_sym.update(subset)
    str_sym.add(' '.join(user_sym))
    user_symptoms.append(' '.join(str_sym).replace('_',' '))
# query expansion performed by joining synonyms found for each symptoms initially entered
print("After query expansion done by using the symptoms entered")
print(user_symptoms)

KeyboardInterrupt: 

The below procedure is performed in order to show the symptom synonmys found for the symptoms entered by the user.

The symptom synonyms and user symptoms are matched with the symptoms present in dataset. Only the symptoms which matches the symptoms present in dataset are shown back to the user. 

In [None]:
# Loop over all the symptoms in dataset and check its similarity score to the synonym string of the user-input 
# symptoms. If similarity>0.5, add the symptom to the final list
found_symptoms = set()
for idx, data_sym in enumerate(dataset_symptoms):
    data_sym_split=data_sym.split()
    for user_sym in user_symptoms:
        count=0
        for symp in data_sym_split:
            if symp in user_sym.split():
                count+=1
        if count/len(data_sym_split)>0.5:
            found_symptoms.add(data_sym)
found_symptoms = list(found_symptoms)

In [None]:
found_symptoms

## **Prompt the user to select the relevant symptoms by entering the corresponding indices.**

In [17]:
# Print all found symptoms
print("Top matching symptoms from your search!")
for idx, symp in enumerate(found_symptoms):
    print(idx,":",symp)
    
# Show the related symptoms found in the dataset and ask user to select among them
select_list = input("\nPlease select the relevant symptoms. Enter indices (separated-space):\n").split()

# Find other relevant symptoms from the dataset based on user symptoms based on the highest co-occurance with the
# ones that is input by the user
dis_list = set()
final_symp = [] 
counter_list = []
for idx in select_list:
    symp=found_symptoms[int(idx)]
    final_symp.append(symp)
    dis_list.update(set(df_norm[df_norm[symp]==1]['label_dis']))
   
for dis in dis_list:
    row = df_norm.loc[df_norm['label_dis'] == dis].values.tolist()
    row[0].pop(0)
    for idx,val in enumerate(row[0]):
        if val!=0 and dataset_symptoms[idx] not in final_symp:
            counter_list.append(dataset_symptoms[idx])

Top matching symptoms from your search!
0 : yellow skin
1 : muscle joint pain
2 : lower abdominal pain
3 : vomiting
4 : yellowish coloration skin white eye
5 : nausea vomiting
6 : jaundice
7 : yellowish skin crust
8 : diarrhea
9 : neck
10 : blue
11 : dark urine
12 : fatigue
13 : yellowish skin
14 : upper abdominal pain
15 : diarrhoea
16 : right lower abdominal pain
17 : painful
18 : nausea
19 : trouble sensation
20 : tiredness



Please select the relevant symptoms. Enter indices (separated-space):
 0 1 2 3 5 6 8 10 13 15 20


## To find symptoms which generally co-occur, for example with symptoms like cough, headache generally happens hence they co-occur.

In [18]:
# Symptoms that co-occur with the ones selected by user              
dict_symp = dict(Counter(counter_list))
dict_symp_tup = sorted(dict_symp.items(), key=operator.itemgetter(1),reverse=True)   
print(dict_symp_tup)

[('fever', 21), ('headache', 13), ('testicular pain', 11), ('nausea', 5), ('confusion', 4), ('seizure', 3), ('feeling tired', 3), ('sore throat', 3), ('dark urine', 3), ('constipation', 3), ('unintended weight loss', 3), ('chest pain', 3), ('poor appetite', 2), ('muscle weakness', 2), ('chill', 1), ('stiff neck', 1), ('bleeding skin', 1), ('abdominal cramp', 1), ('coughing', 1), ('runny nose', 1), ('abdominal distention', 1), ('dermatitis herpetiformis', 1), ('malabsorption', 1), ('none non specific', 1), ('frequent urination', 1), ('increased hunger', 1), ('missed period', 1), ('tender breast', 1), ('blister sunlight', 1), ('depending subtype abdominal pain', 1), ('itchy blister', 1), ('loss appetite', 1), ('small', 1), ('burning urination', 1), ('irregular menstruation', 1), ('pain sex', 1), ('vaginal discharge', 1), ('enlarged lymph node neck', 1), ('belching', 1), ('upper abdominal pain', 1), ('bad smelling vaginal discharge', 1), ('muscular pain', 1), ('vaginal bleeding', 1), ('af

## User is presented with a list of co-occuring symptoms to select from and is performed iteratively to recommend more possible symptoms based on the similarity to the previously entered symptoms.

As the co-occuring symptoms can be in overwhelming numbers, only the top 5 are recommended to the user from which user can select the symptoms.

If user does not have any of those 5 symptoms and wants to see the next 5, he can do so by giving input as -1.

To stop the recommendation, user needs to give input as "No".

In [19]:
# Iteratively, suggest top co-occuring symptoms to the user and ask to select the ones applicable 
found_symptoms=[]
count=0
for tup in dict_symp_tup:
    count+=1
    found_symptoms.append(tup[0])
    if count%5==0 or count==len(dict_symp_tup):
        print("\nCommon co-occuring symptoms:")
        for idx,ele in enumerate(found_symptoms):
            print(idx,":",ele)
        select_list = input("Do you have have of these symptoms? If Yes, enter the indices (space-separated), 'no' to stop, '-1' to skip:\n").lower().split();
        if select_list[0]=='no':
            break
        if select_list[0]=='-1':
            found_symptoms = [] 
            continue
        for idx in select_list:
            final_symp.append(found_symptoms[int(idx)])
        found_symptoms = [] 


Common co-occuring symptoms:
0 : fever
1 : headache
2 : testicular pain
3 : nausea
4 : confusion


Do you have have of these symptoms? If Yes, enter the indices (space-separated), 'no' to stop, '-1' to skip:
 no


Final Symptom list

In [20]:
final_symp

['yellow skin',
 'muscle joint pain',
 'lower abdominal pain',
 'vomiting',
 'nausea vomiting',
 'jaundice',
 'diarrhea',
 'blue',
 'yellowish skin',
 'diarrhoea',
 'tiredness']

In [21]:
# Create query vector based on symptoms selected by the user
print("\nFinal list of Symptoms that will be used for prediction:")
sample_x = [0 for x in range(0,len(dataset_symptoms))]
for val in final_symp:
    print(val)
    sample_x[dataset_symptoms.index(val)]=1


Final list of Symptoms that will be used for prediction:
yellow skin
muscle joint pain
lower abdominal pain
vomiting
nausea vomiting
jaundice
diarrhea
blue
yellowish skin
diarrhoea
tiredness


Prediction of disease is done

In [22]:
prediction = lr.predict_proba([sample_x])

Show top k diseases and their probabilities to the user.

K in this case is 10

In [23]:
k = 10
diseases = list(set(Y['label_dis']))
diseases.sort()
topk = prediction[0].argsort()[-k:][::-1]
topk

array([105, 106,  69,  15,  95,  90, 244,  46, 257, 114], dtype=int64)

# **Showing the list of top k diseases to the user with their prediction probabilities.**

# **For getting information about the suggested treatments, user can enter the corresponding index to know more details.**

In [24]:
print(f"\nTop {k} diseases predicted based on symptoms")
topk_dict = {}
# Show top 10 highly probable disease to the user.
for idx,t in  enumerate(topk):
    match_sym=set()
    row = df_norm.loc[df_norm['label_dis'] == diseases[t]].values.tolist()
    row[0].pop(0)

    for idx,val in enumerate(row[0]):
        if val!=0:
            match_sym.add(dataset_symptoms[idx])
    prob = (len(match_sym.intersection(set(final_symp)))+1)/(len(set(final_symp))+1)
    prob *= mean(scores)
    topk_dict[t] = prob
j = 0
topk_index_mapping = {}
topk_sorted = dict(sorted(topk_dict.items(), key=lambda kv: kv[1], reverse=True))
for key in topk_sorted:
  prob = topk_sorted[key]*100
  print(str(j) + " Disease name:",diseases[key], "\tProbability:",str(round(prob, 2))+"%")
  topk_index_mapping[j] = key
  j += 1

select = input("\nMore details about the disease? Enter index of disease or '-1' to discontinue and close the system:\n")
if select!='-1':
    dis=diseases[topk_index_mapping[int(select)]]
    print()
    print(diseaseDetail(dis))


Top 10 diseases predicted based on symptoms
0 Disease name: Hepatitis A 	Probability: 29.35%
1 Disease name: Hepatitis B 	Probability: 22.01%
2 Disease name: Crimean Congo haemorrhagic fever (CCHF) 	Probability: 22.01%
3 Disease name: Anthrax 	Probability: 22.01%
4 Disease name: Gastroenteritis 	Probability: 22.01%
5 Disease name: Food Poisoning 	Probability: 22.01%
6 Disease name: Trichinosis 	Probability: 22.01%
7 Disease name: Celiacs disease 	Probability: 14.68%
8 Disease name: Yellow Fever 	Probability: 14.68%
9 Disease name: Hyperthyroidism 	Probability: 14.68%



More details about the disease? Enter index of disease or '-1' to discontinue and close the system:
 0



Hepatitis A
Other names -  Infectious hepatitis 
Specialty -  Infectious disease, gastroenterology 
Symptoms -  Nausea, vomiting, diarrhea, dark urine, jaundice, fever, abdominal pain     
Complications -  Acute liver failure     
Usual onset -  2–6 weeks after infection     
Duration -  8 weeks     
Causes -  Fecal–oral route,     
Diagnostic method -  Blood tests     
Prevention -  Hepatitis A vaccine, hand washing, properly cooking food     
Treatment -  Supportive care, liver transplantation     
Frequency -  114 million symptomatic and nonsymptomatic (2015)     
Deaths -  11,200     



In [28]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Example function to scrape medicines for a disease
def get_medicine(disease):
    url = f'https://example-medical-site.com/search?query={disease}'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Assuming medicines are listed under a specific HTML tag/class
    medicines = soup.find_all('div', class_='medicine-name')
    
    # Extract and return medicine names
    medicine_list = [medicine.text for medicine in medicines]
    return medicine_list

# Your existing code
print(f"\nTop {k} diseases predicted based on symptoms")
topk_dict = {}
# Show top 10 highly probable disease to the user.
for idx, t in enumerate(topk):
    match_sym = set()
    row = df_norm.loc[df_norm['label_dis'] == diseases[t]].values.tolist()
    row[0].pop(0)

    for idx, val in enumerate(row[0]):
        if val != 0:
            match_sym.add(dataset_symptoms[idx])
    prob = (len(match_sym.intersection(set(final_symp))) + 1) / (len(set(final_symp)) + 1)
    prob *= mean(scores)
    topk_dict[t] = prob

j = 0
topk_index_mapping = {}
topk_sorted = dict(sorted(topk_dict.items(), key=lambda kv: kv[1], reverse=True))
for key in topk_sorted:
    prob = topk_sorted[key] * 100
    print(str(j) + " Disease name:", diseases[key], "\tProbability:", str(round(prob, 2)) + "%")
    topk_index_mapping[j] = key
    j += 1

select = input("\nMore details about the disease? Enter index of disease or '-1' to discontinue and close the system:\n")
if select != '-1':
    dis = diseases[topk_index_mapping[int(select)]]
    print()
    print(diseaseDetail(dis))

    # Fetch and display medicines for the selected disease
    medicines = get_medicine(dis)
    if medicines:
        print(f"\nMedicines for {dis}:")
        for med in medicines:
            print(f"- {med}")
    else:
        print(f"No medicines found for {dis}.")



Top 10 diseases predicted based on symptoms
0 Disease name: Ebola 	Probability: 44.03%
1 Disease name: Bleeding Gums 	Probability: 29.35%
2 Disease name: Pelvic inflammatory disease 	Probability: 29.35%
3 Disease name: Multiple sclerosis 	Probability: 29.35%
4 Disease name: Lead poisoning 	Probability: 29.35%
5 Disease name: Heat-Related Illnesses and Heat waves 	Probability: 29.35%
6 Disease name: Brain Tumour 	Probability: 29.35%
7 Disease name: Influenza 	Probability: 29.35%
8 Disease name: Iritis 	Probability: 29.35%
9 Disease name: Puerperal sepsis 	Probability: 29.35%



More details about the disease? Enter index of disease or '-1' to discontinue and close the system:
 no


ValueError: invalid literal for int() with base 10: 'no'

In [1]:
import warnings
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split, cross_val_score
from statistics import mean
from nltk.corpus import wordnet 
import nltk
import requests
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from itertools import combinations
from time import time
from collections import Counter
import operator
from xgboost import XGBClassifier
import math
from Treatment import diseaseDetail
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
nltk.download('all')
warnings.simplefilter("ignore")

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to C:\Users\SOUMEN
[nltk_data]    |     MONDAL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to C:\Users\SOUMEN
[nltk_data]    |     MONDAL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\SOUMEN
[nltk_data]    |     MONDAL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\SOUMEN
[nltk_data]    |     MONDAL\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagge

In [None]:
# returns the list of synonyms of the input word from thesaurus.com (https://www.thesaurus.com/) and wordnet (https://www.nltk.org/howto/wordnet.html)
def synonyms(term):
    synonyms = []
    response = requests.get('https://www.thesaurus.com/browse/{}'.format(term))
    soup = BeautifulSoup(response.content,  "html.parser")
    try:
        container=soup.find('section', {'class': 'MainContentContainer'}) 
        row=container.find('div',{'class':'css-191l5o0-ClassicContentCard'})
        row = row.find_all('li')
        for x in row:
            synonyms.append(x.get_text())
    except:
        None
    for syn in wordnet.synsets(term):
        synonyms+=syn.lemma_names()
    return set(synonyms)

# utlities for pre-processing
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
splitter = RegexpTokenizer(r'\w+')



df_comb = pd.read_csv("Dataset/dis_sym_dataset_comb.csv") # Disease combination
df_norm = pd.read_csv("Dataset/dis_sym_dataset_norm.csv") # Individual Disease

X = df_comb.iloc[:, 1:]
Y = df_comb.iloc[:, 0:1]

import pickle
with open('lr_model.pkl', 'rb') as file:
    lr = pickle.load(file)
dataset_symptoms = list(X.columns)

# Taking symptoms from user as input 
user_symptoms = str(input("Please enter symptoms separated by comma(,):\n")).lower().split(',')
# Preprocessing the input symptoms
processed_user_symptoms=[]
for sym in user_symptoms:
    sym=sym.strip()
    sym=sym.replace('-',' ')
    sym=sym.replace("'",'')
    sym = ' '.join([lemmatizer.lemmatize(word) for word in splitter.tokenize(sym)])
    processed_user_symptoms.append(sym)

# Taking each user symptom and finding all its synonyms and appending it to the pre-processed symptom string
print(".........Processing all of that symtomps...........")
for user_sym in processed_user_symptoms:
    user_sym = user_sym.split()
    str_sym = set()
    for comb in range(1, len(user_sym)+1):
        for subset in combinations(user_sym, comb):
            subset=' '.join(subset)
            subset = synonyms(subset) 
            str_sym.update(subset)
    print("Processing Wait!!!")
    str_sym.add(' '.join(user_sym))
    user_symptoms.append(' '.join(str_sym).replace('_',' '))
# query expansion performed by joining synonyms found for each symptoms initially entered
# print("After query expansion done by using the symptoms entered")
# print(user_symptoms)
print("Processed")
found_symptoms = set()
for idx, data_sym in enumerate(dataset_symptoms):
    data_sym_split=data_sym.split()
    for user_sym in user_symptoms:
        count=0
        for symp in data_sym_split:
            if symp in user_sym.split():
                count+=1
        if count/len(data_sym_split)>0.5:
            found_symptoms.add(data_sym)
found_symptoms = list(found_symptoms)
# Print all found symptoms
print("Top matching symptoms from your search!")
for idx, symp in enumerate(found_symptoms):
    print(idx,":",symp)
    
# Show the related symptoms found in the dataset and ask user to select among them
select_list = input("\nPlease select the relevant symptoms. Enter indices (separated-space):\n").split()

# Find other relevant symptoms from the dataset based on user symptoms based on the highest co-occurance with the
# ones that is input by the user
dis_list = set()
final_symp = [] 
counter_list = []
for idx in select_list:
    symp=found_symptoms[int(idx)]
    final_symp.append(symp)
    dis_list.update(set(df_norm[df_norm[symp]==1]['label_dis']))
   
for dis in dis_list:
    row = df_norm.loc[df_norm['label_dis'] == dis].values.tolist()
    row[0].pop(0)
    for idx,val in enumerate(row[0]):
        if val!=0 and dataset_symptoms[idx] not in final_symp:
            counter_list.append(dataset_symptoms[idx])
dict_symp = dict(Counter(counter_list))
dict_symp_tup = sorted(dict_symp.items(), key=operator.itemgetter(1),reverse=True)   
print(dict_symp_tup)
# Iteratively, suggest top co-occuring symptoms to the user and ask to select the ones applicable 
found_symptoms=[]
count=0
for tup in dict_symp_tup:
    count+=1
    found_symptoms.append(tup[0])
    if count%5==0 or count==len(dict_symp_tup):
        print("\nCommon co-occuring symptoms:")
        for idx,ele in enumerate(found_symptoms):
            print(idx,":",ele)
        select_list = input("Do you have have of these symptoms? If Yes, enter the indices (space-separated), 'no' to stop, '-1' to skip:\n").lower().split();
        if select_list[0]=='no':
            break
        if select_list[0]=='-1':
            found_symptoms = [] 
            continue
        for idx in select_list:
            final_symp.append(found_symptoms[int(idx)])
        found_symptoms = [] 
# Create query vector based on symptoms selected by the user
print("\nFinal list of Symptoms that will be used for prediction:")
sample_x = [0 for x in range(0,len(dataset_symptoms))]
for val in final_symp:
    print(val)
    sample_x[dataset_symptoms.index(val)]=1
prediction = lr.predict_proba([sample_x])
k = 10
diseases = list(set(Y['label_dis']))
diseases.sort()
topk = prediction[0].argsort()[-k:][::-1]
print(f"\nTop {k} diseases predicted based on symptoms")
topk_dict = {}
# Show top 10 highly probable disease to the user.
for idx,t in  enumerate(topk):
    match_sym=set()
    row = df_norm.loc[df_norm['label_dis'] == diseases[t]].values.tolist()
    row[0].pop(0)

    for idx,val in enumerate(row[0]):
        if val!=0:
            match_sym.add(dataset_symptoms[idx])
    prob = (len(match_sym.intersection(set(final_symp)))+1)/(len(set(final_symp))+1)
    scores = cross_val_score(lr, X, Y, cv=2)
    prob *= mean(scores)
    topk_dict[t] = prob
j = 0
topk_index_mapping = {}
topk_sorted = dict(sorted(topk_dict.items(), key=lambda kv: kv[1], reverse=True))
for key in topk_sorted:
  prob = topk_sorted[key]*100
  print(str(j) + " Disease name:",diseases[key], "\tProbability:",str(round(prob, 2))+"%")
  topk_index_mapping[j] = key
  j += 1

select = input("\nMore details about the disease? Enter index of disease or '-1' to discontinue and close the system:\n")
if select!='-1':
    dis=diseases[topk_index_mapping[int(select)]]
    print()
    print(diseaseDetail(dis))

Please enter symptoms separated by comma(,):
 joint pain, stiffness


.........Processing all of that symtomps...........
Processing Wait!!!
Processing Wait!!!
Processed
Top matching symptoms from your search!
0 : multiple painful joint
1 : muscle joint pain
2 : joint bone pain
3 : painful swollen joint
4 : neck
5 : trouble sensation
6 : stiffness
7 : painful



Please select the relevant symptoms. Enter indices (separated-space):
 0 1 3 6


[('fever', 4), ('decreased range motion', 2), ('joint bone pain', 2), ('feeling tired', 2), ('headache', 2), ('redness', 1), ('swelling', 1), ('swollen', 1), ('warm', 1), ('chest pain', 1), ('hair loss', 1), ('mouth ulcer', 1), ('red rash', 1), ('swollen lymph node', 1), ('coughing', 1), ('runny nose', 1), ('sore throat', 1), ('maculopapular rash', 1), ('erythema marginatum', 1), ('involuntary muscle movement', 1), ('joint swelling', 1)]

Common co-occuring symptoms:
0 : fever
1 : decreased range motion
2 : joint bone pain
3 : feeling tired
4 : headache


Do you have have of these symptoms? If Yes, enter the indices (space-separated), 'no' to stop, '-1' to skip:
 no



Final list of Symptoms that will be used for prediction:
multiple painful joint
muscle joint pain
painful swollen joint
stiffness

Top 10 diseases predicted based on symptoms
