Our application allows for patients to input some symptoms into a symptom tracker.  This information is then associated to some possible diagnoses.

Some new technologies we used (we did not cover in our boot camp) are:
1. **SentencePiece** is a supplement to our NLTK.  This supplement is needed to assist in translating medical terms or more complex words.
2. **%%capture** is unique to Google Colab.  This allows for the !pip installs to run without generating all the responses, which clutter up the application.
3. **sqlite3** is a lightweight database management system.  Given that we are dealing with large dataset(s) for our model, sqlite allows our application to store and retreive data using SQL (structured query language.)  We are using this for efficiency and speed of use.
4. **Flagging** we added this feature to our gradio interface.  It is used to collect information from users about how the application is working. It is part of improving the model over time.

In [1]:
# Our pip installs needed to run our application.  Note the %%capture being used is for google colab only.
# Remove if you are going to run this in VSCode.

%%capture

!pip install gradio
!pip install sklearn
!pip install nltk
!pip install transformers
!pip install torch
!pip install sentencepiece
!pip install tensorflow --upgrade
!pip install tensorflow_hub
!pip install tensorflow_text
!pip install pandas
!pip install numpy
!pip install sqlite3


In [2]:
# Imports needed for this application
import torch
import sentencepiece
import tensorflow
import tensorflow_hub
import tensorflow_text
import numpy as np
import sqlite3


In [3]:
# List versions of the imports
print("PyTorch Version:", torch.__version__)
print("SentencePiece Version:", sentencepiece.__version__)
print("TensorFlow Version:", tensorflow.__version__)
print("TensorFlow Hub Version:", tensorflow_hub.__version__)
print("TensorFlow Text Version:", tensorflow_text.__version__)
print("Numpy Version:", np.__version__)
print("SQLite3 Version:", sqlite3.version)


PyTorch Version: 2.5.1+cu124
SentencePiece Version: 0.2.0
TensorFlow Version: 2.18.0
TensorFlow Hub Version: 0.16.1
TensorFlow Text Version: 2.18.1
Numpy Version: 1.26.4
SQLite3 Version: 2.6.0


In [4]:
from google.colab import files
uploaded = files.upload()

Saving symbipredict.csv to symbipredict.csv


In [5]:
#  Read the .csv using pandas
import pandas as pd

# Load the data, specifying error handling and potential delimiter
disease_data = pd.read_csv('symbipredict.csv', on_bad_lines='skip', delimiter=',') # Added on_bad_lines='skip' and delimiter=','

# Display the first few rows of the dataset
print(disease_data.head())

                                                                                                                                                                                                                       <<<<<<< HEAD
Disease          Symptom_1 Symptom_2            Symptom_3            Symptom_4           Symptom_5 Symptom_6 Symptom_7 Symptom_8 Symptom_9 Symptom_1 Symptom_11 Symptom_12 Symptom_13 Symptom_14 Symptom_15 Symptom_16   Symptom_17
Fungal Infection itching   skin_rash            nodal_skin_eruptions dischromic _patches NaN       NaN       NaN       NaN       NaN       NaN       NaN        NaN        NaN        NaN        NaN        NaN                 NaN
                 skin_rash nodal_skin_eruptions dischromic _patches  NaN                 NaN       NaN       NaN       NaN       NaN       NaN       NaN        NaN        NaN        NaN        NaN        NaN                 NaN
                 itching   nodal_skin_eruptions dischromic _patches  NaN                

In [6]:
# Version of pandas
print(pd.__version__)

2.2.2


In [7]:
# After loading the data, it is necessary to combine the symptom_columns into a single column
symptom_columns = [col for col in disease_data.columns if col != 'Disease']
disease_data['Processed_Symptoms'] = disease_data[symptom_columns].apply(lambda x: ' '.join(x.astype(str)), axis=1)

In the section below, we import the necessary libraries and dictionaries in order to build our NLTK model.

In [8]:
# Import necessary libraries starting with nltk
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker')
nltk.download('words')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [9]:
# Version of nltk
print(nltk.__version__)

3.9.1


In the section below, we are defining how we want to use our dataset(s).  We want patients to input their symptoms, then we associate them to key words from our dataset(s).  This is our preprocessing of the model.


In [10]:
# Define the prepocessing of the data
def preprocess_symptoms(symptom_text):
    tokens = word_tokenize(symptom_text.lower())
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [w for w in tokens if not w in stop_words and w.isalnum()]
    return ' '.join(filtered_tokens)

# Example synonym dictionary (this can be expanded)
synonym_dict = {
    'fever': ['fever', 'pyrexia'],
    'headache': ['headache', 'migraine', 'cephalalgia'],
    'nausea': ['nausea', 'queasiness', 'sickness'],
    'vomiting': ['vomiting', 'throwing up', 'emesis'],
    'sore throat': ['sore throat', 'pharyngitis', 'throat pain']
}

def expand_keywords(keywords):
    expanded_keywords = set()
    for keyword in keywords:
        if keyword in synonym_dict:
            expanded_keywords.update(synonym_dict[keyword])
        else:
            expanded_keywords.add(keyword)
    return list(expanded_keywords)

def extract_keywords(patient_feedback):
    tokens = word_tokenize(patient_feedback.lower())
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [w for w in tokens if not w in stop_words and w.isalnum()]
    pos_tags = nltk.pos_tag(filtered_tokens)
    keywords = [word for word, pos in pos_tags if pos.startswith('NN') or pos.startswith('JJ') or pos.startswith('VB')]
    return keywords


In the section(s) below, the application will read the patient input and suggest diagnosis.  This is our vectorizing process.  Once we vectorize, we run the gradio app, which generates an input cell for patient data, and an output cell for possible diagnoses.  We chose to use Transformers (vectorizing) via TF-IDF becausee it performed better with gradio.  SpaCy and gradio had constant version conflicts, which caused our application to break, so we switched to TF-IDF.

NOTE:  We are allowing for a possible 5 diagnoses.  Many symptoms cross over numerous diagnoses.  For now, our app is merely suggesting some possible diagnoses.  Our future model will be more precise.  More data is needed to establish that kind of precision.  Given these challenges and the short runway of time we had to develop this application, we decided to put in a patient feedback loop in our gradio application called flagging.  This allows the patient to tell us if the proposed diagnoses are "Correct", "Incorrect", "Needs Improvement".

In [11]:
# Install needed tools to build and run the gradio interface
import os
import gradio as gr
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Use the preloaded disease_data
def extract_keywords(feedback):
    # Placeholder for actual keyword extraction logic
    return feedback.split()


In [12]:
# Version of gradio
print(gr.__version__)

5.16.0


In [13]:
# Breaking this down into smaller segments.  This piece suggests diagnoses
def suggest_diagnosis_tfidf(patient_feedback):
    print("Patient Feedback:", patient_feedback)
    keywords = extract_keywords(patient_feedback)
    print("Keywords:", keywords)
    expanded_keywords = expand_keywords(keywords)
    print("Expanded Keywords:", expanded_keywords)

    processed_feedback = ' '.join(expanded_keywords)
    vectorizer = TfidfVectorizer()
    symptom_matrix = vectorizer.fit_transform(disease_data['Processed_Symptoms'])
    feedback_vector = vectorizer.transform([processed_feedback])
    similarities = cosine_similarity(feedback_vector, symptom_matrix)
    sorted_indices = similarities.argsort()[0][::-1]
    print("Sorted Indices:", sorted_indices)

    possible_diagnoses = []
    added_diseases = set()  # To track added diagnoses and avoid duplicates
    for idx in sorted_indices:
        disease_name = disease_data.loc[idx, 'Disease']
        if disease_name not in added_diseases:
            possible_diagnoses.append(disease_name)
            added_diseases.add(disease_name)

    if not possible_diagnoses:
        possible_diagnoses = ["Unable to determine a diagnosis based on the provided information."]

    print("Possible Diagnoses:", possible_diagnoses[:5])
    return possible_diagnoses[:5]


In [14]:
# Check that flagging directory exists and accessible
custom_flagged_dir = 'custom_flagged_data'
if not os.path.exists(custom_flagged_dir):
    os.makedirs(custom_flagged_dir)
print(f"Flagged data should be saved in: {os.path.abspath(custom_flagged_dir)}")


Flagged data should be saved in: /content/custom_flagged_data


In [15]:
# Run gradio interface
iface = gr.Interface(
    fn=suggest_diagnosis_tfidf,
    inputs=gr.Textbox(lines=5, placeholder="Describe your symptoms..."),
    outputs="text",
    title="Symptom Checker",
    description="Enter your symptoms, and we'll suggest possible diagnoses.",
    flagging_options=["Correct", "Incorrect", "Needs Improvement"],
    flagging_dir='custom_flagged_data'
)

iface.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://0ac621b911f5e13063.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




**For Future Development**
 Our future development will consist of expanding our application to include treatment recommendations for the diagnoses output.

