<a href="https://colab.research.google.com/github/jahnavi-2116/AI-ML-projects/blob/main/LSTM_text_assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install torch
! pip install gradio

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
! unzip /content/archive.zip -d /content

Archive:  /content/archive.zip
  inflating: /content/Skin-Disease-Text-Data/Acne/problem_description_100.txt  
  inflating: /content/Skin-Disease-Text-Data/Acne/problem_description_108.txt  
  inflating: /content/Skin-Disease-Text-Data/Acne/problem_description_113.txt  
  inflating: /content/Skin-Disease-Text-Data/Acne/problem_description_142.txt  
  inflating: /content/Skin-Disease-Text-Data/Acne/problem_description_25.txt  
  inflating: /content/Skin-Disease-Text-Data/Acne/problem_description_33.txt  
  inflating: /content/Skin-Disease-Text-Data/Acne/problem_description_46.txt  
  inflating: /content/Skin-Disease-Text-Data/Acne/problem_description_68.txt  
  inflating: /content/Skin-Disease-Text-Data/Acne/problem_description_72.txt  
  inflating: /content/Skin-Disease-Text-Data/Acne/problem_description_79.txt  
  inflating: /content/Skin-Disease-Text-Data/Acne/problem_description_80.txt  
  inflating: /content/Skin-Disease-Text-Data/Athlete Foot (Tinea Pedis)/problem_description_10.t

In [None]:
# Importing necessary libraries
import torch
from torch import nn as nn
import gradio as gr
import pandas as pd
import numpy as np
import os # Import the os module

# Setting the path to the actual data folder containing subfolders for each class
data_folder = "/content/Skin-Disease-Text-Data"

# Initializing containers for text data and labels
texts = []
labels = []

# Doing a walk through of the folders and collect .txt file content
for class_folder in os.listdir(data_folder):
    class_path = os.path.join(data_folder, class_folder)
    if os.path.isdir(class_path):
        for file_name in os.listdir(class_path):
            file_path = os.path.join(class_path, file_name)
            if file_path.endswith(".txt"):
                with open(file_path, "r", encoding="utf-8") as file:
                    text = file.read().strip()
                    texts.append(text)
                    labels.append(class_folder)

In [None]:
# Creating a dataframe
df = pd.DataFrame({"Text": texts, "Label": labels})
df.head()


Unnamed: 0,Text,Label
0,"Psoriasis\t""My psoriasis is all over my body, ...",Psoriasis
1,"Psoriasis\t""My psoriasis is on my genitals, an...",Psoriasis
2,"Psoriasis\t""I have these red, itchy patches on...",Psoriasis
3,"Psoriasis\t""I have these thick, red patches on...",Psoriasis
4,"Psoriasis\t""My skin is itchy and sometimes hur...",Psoriasis


I prepared a skin disease text dataset for use in an LSTM/GRU NLP project. After uploading and unzipping the dataset in Colab, I found that it contained subfolders named after different skin diseases, each with `.txt` files holding text samples. Then traversed these folders, read each file, and stored the content along with its folder name as the label. These were compiled into a pandas DataFrame with `text` and `label` columns. Finally, displayed the first few rows to confirm the data was correctly loaded and structured for further preprocessing and model training.


In [None]:
# Preprocessing the data
! pip install keras
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
import numpy as np





The content of each file (a skin disease-related description) was stored in a list named texts, while the corresponding folder name was stored in another list named labels. These two lists were combined into a Pandas DataFrame with columns Text and Label, providing a structured format for further preprocessing and model training. The raw text data was not yet suitable for neural networks, so it was preprocessed. A Tokenizer from Keras was used to convert each word in the dataset into a unique integer. The special token <OOV> was defined to handle out-of-vocabulary words during testing. Once tokenized, each text was transformed into a sequence of integers. Since neural networks require inputs of consistent length, the sequences were padded using pad_sequences to ensure uniform size across all examples.

In [None]:
# Spliting the data
train_texts, test_texts, train_labels, test_labels = train_test_split(df['Text'], df['Label'], test_size=0.2, random_state=42)

# Tokenize text
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(train_texts)
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

In [None]:
# Pad sequences
max_len = max(len(seq) for seq in train_sequences)
train_padded = pad_sequences(train_sequences, maxlen=max_len, padding='post', truncating='post')
test_padded = pad_sequences(test_sequences, maxlen=max_len, padding='post', truncating='post')


The LabelEncoder from sklearn was used to convert categorical string labels (like "Eczema" or "Acne") into integer labels. This numerical encoding is essential because the model's output layer needs to match the number of unique class labels, and it can only work with numerical targets.

In [None]:
# Encode labels
label_encoder = LabelEncoder()
train_labels_encoded = label_encoder.fit_transform(train_labels)
test_labels_encoded = label_encoder.transform(test_labels)

train_labels_encoded = np.array(train_labels_encoded)
test_labels_encoded = np.array(test_labels_encoded)


Next, I tried to split the dataset into training and testing sets using an 80-20 ratio to ensure proper evaluation of model performance. To make the text data compatible with neural networks, I applied tokenization using Keras's Tokenizer, which converts words into integer sequences and handles out-of-vocabulary words with a special token <OOV>. Since neural networks require uniform input lengths, I padded all sequences to the length of the longest sequence in the training data. After preparing the text, Then i encoded the categorical labels (disease names) into integer values using LabelEncoder, enabling the model to process them as target outputs. These preprocessing steps ensure the dataset is cleaned, structured, and numerically formatted, making it ready for input into an LSTM or GRU model.

In [None]:
# Importing necesarry libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Define vocabulary size (number of unique tokens + 1 for OOV)
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 64
hidden_size = 128
num_layers = 2
output_size = len(label_encoder.classes_)


In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Define the LTSM model
class LSTMTextClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers, output_size):
        super(LSTMTextClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim,
                            hidden_size=hidden_size,
                            num_layers=num_layers,
                            batch_first=True)
        self.dropout = nn.Dropout(0.2) # Add dropout layer
        self.fc = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, x):
        embedded = self.embedding(x)
        h0 = torch.zeros(num_layers, x.size(0), hidden_size).to(x.device)
        c0 = torch.zeros(num_layers, x.size(0), hidden_size).to(x.device)

        out, _ = self.lstm(embedded, (h0, c0))
        out = self.dropout(out[:, -1, :])  # Apply dropout to the last time-step output
        out = self.fc(out)
        return self.softmax(out)

model = LSTMTextClassifier(vocab_size, embedding_dim, hidden_size, num_layers, output_size)

LSTM model was defined using PyTorch's nn.Module. The model includes an embedding layer to convert token IDs into dense vectors and an LSTM layer to capture sequential patterns in the text. A dropout layer (with probability 0.2) to prevent overfitting. And a fully connected layer followed by LogSoftmax to produce a probability distribution over the possible disease classes. This structure allows the model to learn contextual features from the text and make multi-class

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Define loss function and optimizer
criterion = nn.NLLLoss()  # Since we're using LogSoftmax
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Define device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training loop setup
num_epochs = 40  # You can increase this for better performance
model.to(device)

# Create dataset instances
class TextDataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = torch.tensor(sequences, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx]

train_dataset = TextDataset(train_padded, train_labels_encoded)
test_dataset = TextDataset(test_padded, test_labels_encoded)

# Create DataLoader instances
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Training the model
for epoch in range(num_epochs):
    model.train()  # Set model to training mode
    running_loss = 0.0
    correct = 0
    total = 0

    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Track performance
        running_loss += loss.item()
        _, predicted = torch.max(outputs, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

    epoch_loss = running_loss / len(train_loader)
    epoch_acc = correct / total * 100

    print(f"Epoch {epoch+1}/{num_epochs} - Loss: {epoch_loss:.4f}, Accuracy: {epoch_acc:.2f}%")

Epoch 1/40 - Loss: 2.5839, Accuracy: 7.89%
Epoch 2/40 - Loss: 2.5661, Accuracy: 7.02%
Epoch 3/40 - Loss: 2.5579, Accuracy: 7.02%
Epoch 4/40 - Loss: 2.5650, Accuracy: 10.53%
Epoch 5/40 - Loss: 2.5616, Accuracy: 8.77%
Epoch 6/40 - Loss: 2.5524, Accuracy: 8.77%
Epoch 7/40 - Loss: 2.5625, Accuracy: 10.53%
Epoch 8/40 - Loss: 2.5580, Accuracy: 14.91%
Epoch 9/40 - Loss: 2.5494, Accuracy: 12.28%
Epoch 10/40 - Loss: 2.5520, Accuracy: 13.16%
Epoch 11/40 - Loss: 2.5572, Accuracy: 12.28%
Epoch 12/40 - Loss: 2.5577, Accuracy: 12.28%
Epoch 13/40 - Loss: 2.5395, Accuracy: 13.16%
Epoch 14/40 - Loss: 2.5385, Accuracy: 12.28%
Epoch 15/40 - Loss: 2.5451, Accuracy: 13.16%
Epoch 16/40 - Loss: 2.5334, Accuracy: 11.40%
Epoch 17/40 - Loss: 2.5026, Accuracy: 12.28%
Epoch 18/40 - Loss: 2.5135, Accuracy: 9.65%
Epoch 19/40 - Loss: 2.5068, Accuracy: 10.53%
Epoch 20/40 - Loss: 2.4986, Accuracy: 12.28%
Epoch 21/40 - Loss: 2.4785, Accuracy: 13.16%
Epoch 22/40 - Loss: 2.4850, Accuracy: 12.28%
Epoch 23/40 - Loss: 2.479

The model was instantiated using the previously defined parameters and moved to the appropriate device (GPU if available, otherwise CPU). NLLLoss was chosen as the loss function because the model uses LogSoftmax at the output layer. The Adam optimizer was selected for training due to its adaptive learning rate capabilities.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

# Set model to evaluation mode
model.eval()
test_loss = 0.0
correct = 0
total = 0

with torch.no_grad():  # No gradient calculation needed
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        test_loss += loss.item()
        _, predicted = torch.max(outputs, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

avg_test_loss = test_loss / len(test_loader)
accuracy = correct / total * 100

print(f"Test Loss: {avg_test_loss:.4f}, Test Accuracy: {accuracy:.2f}%")

Test Loss: 2.6293, Test Accuracy: 0.00%


A custom PyTorch Dataset class was created to handle the padded sequences and their corresponding labels. This Dataset was wrapped into a DataLoader, which efficiently manages batch processing during training and evaluation. The training set was shuffled to ensure the model does not learn based on order.

And while Training The model was trained over 40 epochs. For each batch in the training data:
The model made predictions (forward pass).
The loss was calculated and gradients computed (backward pass).
The optimizer updated the model's weights. Accuracy and loss were tracked for each epoch.
This process allowed the model to iteratively learn patterns in the input text and associate them with specific skin disease categories. After training, the model was switched to evaluation mode. Using the test set, predictions were made and the loss was calculated without updating the weights. Accuracy was computed to assess the model's performance on unseen data.

In [None]:
print(df['Label'].value_counts())


Label
Psoriasis                     11
Scabies                       11
Impetigo                      11
Eczema                        11
Folliculitis                  11
Contact Dermatitis            11
Acne                          11
Rosacea                       11
Shingles (Herpes Zoster)      11
Athlete Foot (Tinea Pedis)    11
Vitiligo                      11
Hives (Urticaria)             11
Ringworm (Tinea Corporis)     11
Name: count, dtype: int64


To verify if the dataset was balanced, the Label column of the DataFrame was analyzed. This step helps understand whether some classes had significantly more examples than others, which could bias the model toward over-predicting those classes.



In [None]:
all_preds = []
with torch.no_grad():
    for inputs, _ in test_loader:
        inputs = inputs.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        all_preds.extend(predicted.cpu().numpy())

import numpy as np
unique_preds = np.unique(all_preds)
print("Unique predicted classes:", unique_preds)

Unique predicted classes: [1 3]


The final step checked which classes the model was actually predicting. If only a few class IDs (like [3] or [1, 3]) were being predicted consistently, it indicated that the model had not generalized well and might be stuck predicting dominant classes.