**1. Data Preprocessing:**

This section focuses on preparing the ENT symptoms dataset for machine learning analysis. It involves the following steps:

1. **Loading the Dataset:**
   - The code reads the ENT symptoms dataset from a CSV file named "ENT symptoms dataset - no_dups_ent.csv" using the pandas library.
   - The dataset is loaded into a pandas DataFrame called `df`.

2. **Handling Missing Values and Merging Symptoms:**
   - The code identifies the columns containing symptom information: 'Common Symptoms', 'Patient Reported Symptoms', and 'Additional Symptoms'.
   - Any missing values (NaNs) in these columns are replaced with empty strings to avoid errors during text processing.
   - All symptom columns are merged into a single column named "All Symptoms" to combine all symptom information for each record into a single text string.

3. **Filtering Diseases with Limited Samples:**
   - The code filters out diseases that have fewer than 10 samples in the dataset. This is done to ensure sufficient data for reliable model training and to prevent overfitting.
   - The filtered DataFrame is stored in the variable `df_filtered`.

4. **Checking Remaining Diseases:**
   - The code prints the frequency of each remaining disease in the filtered dataset using `value_counts()`.
   - This provides an overview of the distribution of diseases after filtering, helping to understand the dataset's composition.

This preprocessing prepares the data for subsequent machine learning tasks, such as disease prediction, by cleaning, organizing, and filtering the symptom information.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.metrics import classification_report, f1_score

# Load dataset
df = pd.read_csv("ENT symptoms dataset - no_dups_ent.csv")

# Define symptom columns explicitly
symptom_cols = ['Common Symptoms', 'Patient Reported Symptoms', 'Additional Symptoms']

# Handle NaNs and merge symptoms into single text column
df[symptom_cols] = df[symptom_cols].fillna('')
df['All Symptoms'] = df[symptom_cols].agg(' '.join, axis=1)

# Filter out diseases with fewer than 10 samples
df_filtered = df.groupby('Disease Name').filter(lambda x: len(x) >= 10)

# Check the remaining diseases
print(df_filtered['Disease Name'].value_counts())

Disease Name
Acoustic Neuroma                               20
Adenoiditis                                    20
Allergic Rhinitis                              20
Otitis Externa (Swimmer's Ear)                 20
Acute Otitis Media                             20
Ménière’s Disease                              19
Presbycusis (Age-related Hearing Loss)         18
Cholesteatoma                                  18
Benign Paroxysmal Positional Vertigo (BPPV)    17
Eustachian Tube Dysfunction                    17
Acoustic Neuroma (Vestibular Schwannoma)       17
Otitis Media with Effusion                     17
Sudden Sensorineural Hearing Loss              17
Labyrinthitis                                  16
Ramsay Hunt Syndrome                           16
Hyperacusis                                    15
Otosclerosis                                   15
Tinnitus                                       15
Chronic Otitis Media                           15
Ear Barotrauma                       

**2. Experimenting Model Accuracy with ClinicalBERT model**

This section of code prepares the data for training a machine learning model to predict diseases based on patient symptoms. It imports the necessary libraries, transforms disease names into numerical labels, and divides the data into training and testing sets. The use of stratification ensures that the model is trained and evaluated on a balanced representation of different diseases, potentially improving its performance and generalizability.

In [None]:
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel, BertForSequenceClassification
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, f1_score
import numpy as np

# Encode labels
le = LabelEncoder()
df_filtered['label'] = le.fit_transform(df_filtered['Disease Name'])

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    df_filtered['All Symptoms'],
    df_filtered['label'],
    test_size=0.2,
    random_state=42,
    stratify=df_filtered['label']
)

**Loading ClinicalBERT Tokenizer and Model**

These two lines are essential for setting up the ClinicalBERT model for use. They load both the tokenizer, which prepares the text data, and the pre-trained model itself, which is used for generating insights from the text. By using a pre-trained model, the code leverages existing knowledge and avoids the need to train a new model from scratch.

In [None]:
# Load ClinicalBERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')
bert_model = BertModel.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

This function prepares the text data for ClinicalBERT by breaking it down, adding padding, and converting it into a suitable format for the model. This preprocessed data is crucial for the model to understand and analyze the text for disease prediction.

In [10]:
# Tokenize function
def tokenize(texts, tokenizer, max_length=128):
    return tokenizer(
        texts.tolist(),
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

This code snippet takes the training data (X_train), processes it using the tokenize function, feeds it to the ClinicalBERT model (bert_model), and finally extracts the generated text embeddings (train_embeddings). These embeddings will be used for training a machine learning model, for example, a classifier to identify disease based on symptoms description.

In [8]:
# Generate embeddings for training data
with torch.no_grad():
    train_tokens = tokenize(X_train, tokenizer)
    train_outputs = bert_model(**train_tokens)
    train_embeddings = train_outputs.pooler_output.detach().numpy()

This code snippet uses SMOTE to balance the dataset by creating synthetic samples for the minority classes. This helps in training a more robust and unbiased machine learning model that can better predict diseases even if they have limited representation in the original dataset.

In [9]:
# Apply SMOTE to balance the embeddings
smote = SMOTE(random_state=42)
train_embeddings_resampled, y_train_resampled = smote.fit_resample(train_embeddings, y_train)

# Ensure these variables are explicitly named and exist:
print(f"train_embeddings_resampled shape: {train_embeddings_resampled.shape}")
print(f"y_train_resampled shape: {y_train_resampled.shape}")

train_embeddings_resampled shape: (368, 768)
y_train_resampled shape: (368,)


This class structures your symptom embeddings and their associated disease labels into a format that PyTorch can efficiently work with during model training. By implementing these three methods (__init__, __len__, __getitem__), you've created a custom dataset that seamlessly integrates with PyTorch's data loading mechanisms.

In [11]:
# Define PyTorch dataset for embeddings
class SymptomEmbeddingDataset(Dataset):
    def __init__(self, embeddings, labels):
        self.embeddings = torch.tensor(embeddings, dtype=torch.float32)
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'embeddings': self.embeddings[idx],
            'labels': self.labels[idx]
        }

These two lines prepare your training data for use with a PyTorch model. The SymptomEmbeddingDataset structures the data, while the DataLoader manages how the data is fed to the model during training, including batching and shuffling. This setup is crucial for efficiently training your machine learning model to predict diseases based on symptom embeddings.

In [12]:
# Create Dataset and DataLoader for training
train_dataset = SymptomEmbeddingDataset(train_embeddings_resampled, y_train_resampled)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

The EmbeddingClassifier class defines a neural network model that takes symptom embeddings as input and uses a series of linear layers, activation functions, and dropout to classify the input into different disease categories. It's designed to work with pre-trained embeddings like those generated by ClinicalBERT.

In [13]:
# Define PyTorch model using embeddings
class EmbeddingClassifier(torch.nn.Module):
    def __init__(self, input_dim=768, num_classes=len(y_train.unique())):
        super().__init__()
        self.classifier = torch.nn.Sequential(
            torch.nn.Linear(input_dim, 256),
            torch.nn.ReLU(),
            torch.nn.Dropout(0.2),
            torch.nn.Linear(256, num_classes)
        )

    def forward(self, embeddings):
        return self.classifier(embeddings)

**Optimizer and Training Mode**

1. **optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)**: This line creates an optimizer called AdamW, which is an algorithm used to update the model's parameters during training to minimize the loss function.

2. **model.parameters()**: Provides the model's learnable parameters to the optimizer.
lr=1e-4: Sets the learning rate, a hyperparameter that controls how much the model's parameters are adjusted during each update step.
model.train():

This line sets the model to training mode. In PyTorch, models have different behaviors during training and evaluation. This ensures that operations like dropout are activated during training.

In [17]:
model = EmbeddingClassifier(input_dim=768, num_classes=len(df['label'].unique()))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Train the model (short training example)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

# train model
model.train()

EmbeddingClassifier(
  (classifier): Sequential(
    (0): Linear(in_features=768, out_features=256, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.2, inplace=False)
    (3): Linear(in_features=256, out_features=106, bias=True)
  )
)

**Training Loop Explanation**

This training loop iteratively feeds data to the model, calculates the error in its predictions, and adjusts the model's parameters to minimize the error over time. This process is repeated for the specified number of epochs, aiming to create a model that accurately predicts diseases based on patient symptoms.

In [19]:
epochs = 3
for epoch in range(epochs):
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        embeddings = batch['embeddings'].to(device)
        labels = batch['labels'].to(device)
        logits = model(embeddings)
        loss = torch.nn.CrossEntropyLoss()(logits, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

Epoch 1/3, Loss: 4.1050
Epoch 2/3, Loss: 3.6808
Epoch 3/3, Loss: 3.3554


**Preparing Test Data for Evaluation with ClinicalBERT**

This section prepares the test data for evaluation by:

1. **Generating embeddings**: Using ClinicalBERT to transform the test symptom text into numerical representations.
2. **Creating dataset and dataloader**: Organizing the embeddings and labels into a format that PyTorch can easily handle during evaluation.
This organized test data is then used to assess the performance of the trained model by comparing its predictions against the true disease labels.

In [20]:
# For evaluation, you must embed test data similarly
with torch.no_grad():
    test_tokens = tokenize(X_test, tokenizer)
    test_outputs = bert_model(**test_tokens)
    test_embeddings = test_outputs.pooler_output.detach().numpy()

test_dataset = SymptomEmbeddingDataset(test_embeddings, y_test.values)
test_loader = DataLoader(test_dataset, batch_size=32)

**Evaluation of the Model**

This code snippet efficiently uses the trained model to predict diseases for the test data, storing the actual labels and the model's predictions for later analysis and performance evaluation using metrics like classification reports and F1-score.


In [21]:
# Evaluate
model.eval()
y_true, y_pred = [], []

with torch.no_grad():
    for batch in test_loader:
        inputs = batch['embeddings'].to(device)
        labels = batch['labels'].cpu().numpy()
        outputs = model(inputs)
        preds = torch.argmax(outputs, dim=1).cpu().numpy()
        y_true.extend(labels)
        y_pred.extend(preds)

**Evaluating the Model's Performance**

This section assesses the effectiveness of the disease prediction model by providing detailed performance metrics for each disease and a summarized score for overall performance.

In [22]:
# Metrics
print(classification_report(y_true, y_pred, target_names=le.classes_))
print("Macro F1-score:", f1_score(y_true, y_pred, average='macro'))

                                             precision    recall  f1-score   support

                           Acoustic Neuroma       0.00      0.00      0.00         4
   Acoustic Neuroma (Vestibular Schwannoma)       0.00      0.00      0.00         3
                         Acute Otitis Media       0.00      0.00      0.00         4
                                Adenoiditis       0.00      0.00      0.00         4
                          Allergic Rhinitis       0.00      0.00      0.00         4
               Autoimmune Inner Ear Disease       0.13      1.00      0.23         3
Benign Paroxysmal Positional Vertigo (BPPV)       0.00      0.00      0.00         3
                              Cholesteatoma       0.00      0.00      0.00         4
                       Chronic Otitis Media       0.00      0.00      0.00         3
                             Ear Barotrauma       0.00      0.00      0.00         3
                Eustachian Tube Dysfunction       0.00      0.00

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Evaluation metrics reveal this model's limitations for symptom prediction and classification. It achieved a precision of 2%, meaning only 2% of predicted symptoms were correct. The recall of 13% indicates it only identified 13% of actual symptoms. The overall macro F1-score, a balanced measure of precision and recall, was only 3%. These results indicate the model is not suitable for this task.

**3. Experimenting Model Accuracy with Random Classifier model**

**Encoding Disease Labels**

This code takes a column of `Disease names` (text) and converts it into a column of corresponding numerical labels, making it suitable for use in machine learning models. For example, if you had diseases like 'Flu', 'Cold', and 'Allergy', they might be encoded as 0, 1, and 2, respectively.


In [23]:
# Encode labels
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_filtered['Disease Label'] = le.fit_transform(df_filtered['Disease Name'])

**TF-IDF Vectorization**

This code takes the textual symptom descriptions, converts them into numerical representations using TF-IDF, and prepares them to be used as input (X_vectors) for a machine learning model, along with the corresponding disease labels (y) as the target variable.

In [24]:
# TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=1000)
X_vectors = vectorizer.fit_transform(df_filtered['All Symptoms']).toarray()
y = df_filtered['Disease Label']


**Train-Test Split**

This code snippet prepares your data for model training and evaluation by dividing it into training and testing sets. The use of stratify ensures a balanced representation of diseases across these sets, crucial for reliable model assessment. By separating the data, we can train the model on one portion and then assess its performance on unseen data, giving us an indication of its real-world effectiveness.

In [25]:
# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X_vectors, y, test_size=0.2, random_state=42, stratify=y
)


**SMOTE for Data Balancing**

These lines use SMOTE to balance your training data by creating synthetic samples for the minority classes, making your model less biased and potentially improving its performance on the under-represented diseases.

In [26]:
# Apply SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

**Training a Random Forest Classifier**

This code section creates a Random Forest model, configures it with 100 decision trees and a fixed random state, and then trains the model on the preprocessed and balanced training data to learn the relationships between symptoms and diseases. After this step, the model (clf) is ready to make predictions on new, unseen symptom data. I hope this helps to explain the process!

In [27]:
# Train RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_resampled, y_train_resampled)

**Evaluating the Random Forest Model**

This code uses the trained Random Forest model (clf) to predict diseases for the test data (X_test) and stores these predictions in y_pred. It then imports functions that will be used to evaluate the quality of these predictions by comparing them to the actual disease labels (y_test, which was not used during training). The evaluation itself is likely done in subsequent code using the imported classification_report and f1_score functions.

In [28]:
# Evaluate model
from sklearn.metrics import classification_report, f1_score
y_pred = clf.predict(X_test)

**Evaluating the Random Forest Model's Performance**

This code snippet evaluates the Random Forest model by:

1. Generating a detailed classification report showing precision, recall, F1-score, and support for each disease.
2. Calculating the macro F1-score, providing an overall performance metric that treats all diseases equally.

These metrics are essential for understanding the model's strengths and weaknesses and for making informed decisions about its deployment and further improvement.

In [29]:
# Metrics
print(classification_report(y_test, y_pred, target_names=le.classes_))
print("Macro F1-score:", f1_score(y_test, y_pred, average='macro'))


                                             precision    recall  f1-score   support

                           Acoustic Neuroma       0.80      1.00      0.89         4
   Acoustic Neuroma (Vestibular Schwannoma)       1.00      1.00      1.00         3
                         Acute Otitis Media       1.00      1.00      1.00         4
                                Adenoiditis       0.57      1.00      0.73         4
                          Allergic Rhinitis       1.00      0.25      0.40         4
               Autoimmune Inner Ear Disease       1.00      1.00      1.00         3
Benign Paroxysmal Positional Vertigo (BPPV)       1.00      1.00      1.00         3
                              Cholesteatoma       1.00      0.50      0.67         4
                       Chronic Otitis Media       0.75      1.00      0.86         3
                             Ear Barotrauma       1.00      1.00      1.00         3
                Eustachian Tube Dysfunction       1.00      1.00

While the model demonstrates strong performance for several ENT conditions, overall performance is mixed.

**Strengths**:

High Precision and Recall for Many Conditions: The model achieved perfect or near-perfect precision and recall (and therefore F1-score) for a significant number of ENT conditions, including Acoustic Neuroma (Vestibular Schwannoma), Acute Otitis Media, Autoimmune Inner Ear Disease, Benign Paroxysmal Positional Vertigo (BPPV), Ear Barotrauma, Eustachian Tube Dysfunction, Hyperacusis, Ménière’s Disease, Otosclerosis, Presbycusis, Ramsay Hunt Syndrome, Sudden Sensorineural Hearing Loss, and Tinnitus. This indicates strong and reliable prediction for these specific conditions.

**Weaknesses:**

Lower Performance on Certain Conditions: For conditions like Allergic Rhinitis,

Cholesteatoma, Labyrinthitis, Mastoiditis, and Otitis Externa, the model showed lower precision, recall, or both. Notably, Allergic Rhinitis had a precision of 1.00 but a recall of only 0.25, indicating that while all predicted cases were correct, the model only identified 25% of actual cases. Cholesteatoma also had relatively lower performance with a precision of 1.00 and recall of 0.50.

Limited Support for Some Conditions: The support values (number of actual cases in the test set) are relatively low for some conditions, which may affect the reliability of the metrics. For example, Mastoiditis only had a support of 2.
Overall:

The model exhibits promising performance for many ENT conditions but requires improvement for others, particularly Allergic Rhinitis and Cholesteatoma. Further investigation and potential model adjustments are needed to enhance overall performance and ensure reliable predictions across all target conditions."

**Important Considerations:**

Macro F1-score: While not explicitly provided, it would be helpful to calculate the macro F1-score to get a balanced measure of the model's performance across all conditions, especially considering the varying support values.

Data Imbalance: The differing support values indicate potential class imbalance in the dataset. Addressing this imbalance through techniques like oversampling or using weighted loss functions during training could improve performance on under-represented conditions.

Error Analysis: Examining specific cases where the model made incorrect predictions would provide insights for further refinement and improvement.

Let's persist the model and start testing it out

In [30]:
import joblib

# Assuming you already have clf, vectorizer, and le from your training code:
joblib.dump(clf, 'ent_symptom_model.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')
joblib.dump(le, 'label_encoder.pkl')

['label_encoder.pkl']

In [31]:
def predict_diseases(symptom_texts):
    """
    Accepts either a single symptom string or a list of symptom strings.
    Returns the predicted disease(s).
    """
    # If a single string is passed, convert it into a list
    if isinstance(symptom_texts, str):
        symptom_texts = [symptom_texts]

    # Transform the list of symptom strings using the vectorizer
    text_vectorized = vectorizer.transform(symptom_texts)
    # Get predictions for each text entry
    predictions = clf.predict(text_vectorized)
    # Convert numerical labels back to disease names
    predicted_diseases = le.inverse_transform(predictions)
    return predicted_diseases

In [32]:
# Test multiple symptom cases:
symptom_cases = [
    "I have severe ear pain and dizziness",
    "I experience a high fever and persistent cough",
    "I have headache and nausea"
]

predicted_diseases = predict_diseases(symptom_cases)
for symptoms, disease in zip(symptom_cases, predicted_diseases):
    print(f"Symptoms: {symptoms} -> Predicted Disease: {disease}")

Symptoms: I have severe ear pain and dizziness -> Predicted Disease: Mastoiditis
Symptoms: I experience a high fever and persistent cough -> Predicted Disease: Adenoiditis
Symptoms: I have headache and nausea -> Predicted Disease: Adenoiditis


**Conclusion:**

The model's predictions demonstrate its ability to associate symptom descriptions with potential ENT conditions. However, the accuracy of these predictions needs further validation.

**Case 1**: "I have severe ear pain and dizziness" -> Mastoiditis: This prediction aligns with clinical knowledge, as severe ear pain and dizziness are common symptoms of Mastoiditis.

**Case 2:** "I experience a high fever and persistent cough" -> Adenoiditis: This prediction might require further scrutiny. While a fever can be present in Adenoiditis, a persistent cough is more characteristic of other respiratory infections.

**Case 3:** "I have headache and nausea" -> Adenoiditis: This prediction could also benefit from further evaluation. While headaches can occur with Adenoiditis, nausea is not a typical symptom.