<a href="https://colab.research.google.com/github/kirmanioussema12/Medical-Disease-Classification/blob/main/Assignment5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ***Data loading***
This part processes all JSON files in the sample_data/ directory, extracting relevant information from each file. It first lists all files with a .json extension, then iterates through them, loading their content as dictionaries. For each file, it extracts the title, abstract, concatenates body_text fields into a single string, and retrieves keywords from a nested structure. If any fields are missing, it provides fallback values (e.g., empty strings or lists). The extracted data is stored as dictionaries in a list called data. and it was converted into a dataframe


In [None]:
import json
import os
import pandas as pd

In [None]:
# Directory containing JSON files
data_dir = "sample_data/"

# List all JSON files
json_files = [f for f in os.listdir(data_dir) if f.endswith('.json')]

# Initialize an empty list to store data
data = []

In [None]:
for file in json_files:
    with open(os.path.join(data_dir, file), 'r', encoding='utf-8') as f:
        content = json.load(f)

        # Ensure content is a dictionary
        if isinstance(content, dict):
            # Safely extract fields with fallback values
            title = content.get("title", "")
            abstract = content.get("abstract", "")
            body_text = " ".join(bt['text'] for bt in content.get("body_text", []) if 'text' in bt)
            keywords = content.get("pdf_parse", {}).get("keywords", [])

            # Append extracted data as a dictionary
            data.append({
                "title": title,
                "abstract": abstract,
                "body_text": body_text,
                "keywords": keywords
            })

In [None]:
# Convert to DataFrame
df = pd.DataFrame(data)

In [None]:
df.head(5)

Unnamed: 0,title,abstract,body_text,keywords
0,Peri-implantation glucocorticoid administratio...,of findings 1. Glucocorticoids compared to no ...,,"[Trusted evidence, Informed decisions, Better ..."
1,Interventions for uterine fibroids: an overvie...,This is a protocol for a Cochrane Review (Over...,,[]
2,Uterine distension media for outpatient hyster...,Hysteroscopy done in an outpatient setting is ...,,"[Informed decisions, Better health Informed de..."
3,"Awareness, knowledge, and misconceptions of ad...","Objective: To assess the awareness, knowledge,...",,"[Intrauterine device, adolescent, LARC, miscon..."
4,Maternal postures for fetal malposition in lab...,,,[]


***Adding a label feature***

To achieve classification within the dataset, a new label column is created by applying a function that assigns categories based on specific keywords and phrases. The assign_label function evaluates two fields, keywords and abstract, both converted to lowercase for uniformity. It uses conditional checks to classify rows into distinct categories, such as "Male Factor Infertility," "Ovarian Dysfunction," and "Pregnancy Complications," based on scientific terminology and contextual relevance.

For example, the presence of terms like "sperm" or "semen analysis" assigns the label "Male Factor Infertility," while references to "ivf" or "art" result in the "Assisted Reproductive Technologies (ART)" label. If no predefined conditions are met, the row is categorized as "Other." The function is applied row-wise to the DataFrame, generating a new label column. This enriched dataset, now equipped with classifications, enables deeper analysis and insights

In [None]:
def assign_label(row):
    keywords = " ".join(row['keywords']).lower()
    abstract = row['abstract'].lower()

    # Infertility Subcategories with Scientific Terminology
    if "sperm" in keywords or "spermatozoa" in abstract or "semen analysis" in abstract:
        return "Male Factor Infertility"
    elif "hormone" in keywords or "hormonal" in abstract or "endocrine" in keywords or "hormone therapy" in abstract:
        return "Endocrine Disorders"
    elif "ovary" in keywords or "ovarian" in abstract or "pcos" in keywords or "follicular" in abstract:
        return "Ovarian Dysfunction"
    elif "hpg axis" in keywords or "hypothalamic" in abstract or "gonadotropin" in abstract:
        return "Hypothalamic-Pituitary-Gonadal (HPG) Axis Disorders"
    elif "uterus" in keywords or "tubal" in abstract or "fallopian" in abstract or "uterine" in abstract:
        return "Uterine and Tubal Factors"
    elif "unexplained infertility" in keywords or "idiopathic" in abstract:
        return "Unexplained Infertility"

    # Other Main Labels
    elif "pregnancy" in keywords or "pregnancy" in abstract:
        return "Pregnancy Complications"
    elif "treatment" in keywords or "ivf" in keywords or "icsi" in keywords or "art" in abstract:
        return "Assisted Reproductive Technologies (ART)"
    elif "endometriosis" in keywords or "endometriosis" in abstract:
        return "Endometriosis"
    elif "adenomyosis" in keywords or "adenomyosis" in abstract:
        return "Adenomyosis"
    elif "neonatal" in keywords or "birth weight" in abstract:
        return "Neonatal Outcomes"
    elif "guidelines" in keywords or "protocol" in abstract:
        return "Clinical Guidelines"
    else:
        return "Other"

# Apply the function to create a new column
df['label'] = df.apply(assign_label, axis=1)

# Preview the DataFrame with the new label column
df.head(5)


Unnamed: 0,title,abstract,body_text,keywords,label
0,Peri-implantation glucocorticoid administratio...,of findings 1. Glucocorticoids compared to no ...,,"[Trusted evidence, Informed decisions, Better ...",Other
1,Interventions for uterine fibroids: an overvie...,This is a protocol for a Cochrane Review (Over...,,[],Uterine and Tubal Factors
2,Uterine distension media for outpatient hyster...,Hysteroscopy done in an outpatient setting is ...,,"[Informed decisions, Better health Informed de...",Uterine and Tubal Factors
3,"Awareness, knowledge, and misconceptions of ad...","Objective: To assess the awareness, knowledge,...",,"[Intrauterine device, adolescent, LARC, miscon...",Uterine and Tubal Factors
4,Maternal postures for fetal malposition in lab...,,,[],Other


***Data cleaning***

To prepare the dataset for analysis, text cleaning is performed on the abstract and body_text fields. The clean_text function is designed to standardize and sanitize the text by removing unnecessary elements.

First, it eliminates extra whitespace, reducing multiple spaces to a single space. Then, it removes special characters, keeping only alphanumeric characters and whitespace. Finally, it trims leading and trailing spaces to ensure a clean output.

This cleaning process is applied to each row of the abstract and body_text columns using the apply method, ensuring the text is uniform and free from noise, which is essential for accurate analysis and classification.

In [None]:
import re

def clean_text(text):
    # Remove extra whitespace (multiple spaces)
    text = re.sub(r'\s+', ' ', text)

    # Remove special characters (non-alphanumeric characters except whitespace)
    text = re.sub(r'[^\w\s]', '', text)
    return text.strip()

# Apply the cleaning function to 'abstract' and 'body_text'
df['abstract'] = df['abstract'].apply(clean_text)
df['body_text'] = df['body_text'].apply(clean_text)

In [None]:
df.head()

Unnamed: 0,title,abstract,body_text,keywords,label
0,Peri-implantation glucocorticoid administratio...,of findings 1 Glucocorticoids compared to no g...,,"[Trusted evidence, Informed decisions, Better ...",Other
1,Interventions for uterine fibroids: an overvie...,This is a protocol for a Cochrane Review Overv...,,[],Uterine and Tubal Factors
2,Uterine distension media for outpatient hyster...,Hysteroscopy done in an outpatient setting is ...,,"[Informed decisions, Better health Informed de...",Uterine and Tubal Factors
3,"Awareness, knowledge, and misconceptions of ad...",Objective To assess the awareness knowledge an...,,"[Intrauterine device, adolescent, LARC, miscon...",Uterine and Tubal Factors
4,Maternal postures for fetal malposition in lab...,,,[],Other


***Task***

To perform the machine learning task, it is essential to install the required libraries. The following commands ensure that the necessary tools are available:

!pip install transformers: Installs the Hugging Face Transformers library, which provides pre-trained models and tools for natural language processing (NLP) tasks.

!pip install torch: Installs PyTorch, a widely-used deep learning framework for building and training models.

!pip install datasets: Installs the Hugging Face Datasets library, which provides access to a wide variety of datasets for machine learning tasks.

In [None]:
!pip install transformers
!pip install torch
!pip install datasets


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

***Preparation for training***

This code prepares text data for training a machine learning model using a pre-trained tokenizer and a custom dataset class. It begins by loading the BERT tokenizer from the Hugging Face Transformers library, which is used to tokenize text into subword tokens. A function is defined to tokenize a list of texts, applying truncation and padding while limiting the sequences to a maximum length of 512 tokens.

The dataset is then split into training and validation sets, with 20% of the data allocated to validation. The texts from both sets are tokenized using the tokenizer function, converting the raw text into tokenized encodings. A custom dataset class is defined to handle these encodings and labels in a format that can be used with PyTorch. This class includes methods to retrieve individual samples and convert them into tensors, which are compatible with PyTorch’s DataLoader. Finally, instances of the custom dataset class are created for both the training and validation sets, preparing the data for model training in a format that supports batching and shuffling.

In [None]:
from transformers import BertTokenizer
from sklearn.model_selection import train_test_split
import torch

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenization function
def tokenize_data(texts, tokenizer):
    encodings = tokenizer(list(texts), truncation=True, padding=True, max_length=512)
    return encodings

# Split data into train and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(df['abstract'], df['label_id'], test_size=0.2)

# Tokenize the data
train_encodings = tokenize_data(train_texts, tokenizer)
val_encodings = tokenize_data(val_texts, tokenizer)

# Convert to Torch dataset format
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels.iloc[idx])  # Use iloc for accessing the label by index
        return item

    def __len__(self):
        return len(self.labels)

# Create datasets
train_dataset = CustomDataset(train_encodings, train_labels)
val_dataset = CustomDataset(val_encodings, val_labels)


***Class-imbalance treatement***

This code calculates class weights to address any potential class imbalance in the dataset. It uses the compute_class_weight function from sklearn.utils.class_weight, specifying the 'balanced' strategy to automatically adjust weights inversely proportional to the class frequencies. The unique class labels are obtained from df['label_id'], and the computed class weights are stored in an array.

Next, a dictionary is created to map each class label to its corresponding weight by iterating over the class weights array. The resulting dictionary, class_weight_dict, contains the class labels as keys and their respective weights as values. This dictionary can then be used in model training to apply these weights, helping to handle imbalanced classes by giving more importance to underrepresented classes.

In [None]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Assuming you have your labels in a list or array `df['label_id']`
class_weights = compute_class_weight('balanced', classes=np.unique(df['label_id']), y=df['label_id'])

# Create a dictionary mapping class labels to weights
class_weight_dict = {i: class_weights[i] for i in range(len(class_weights))}
print(class_weight_dict)


{0: 0.28205128205128205, 1: 1.2222222222222223, 2: 1.8333333333333333, 3: 3.6666666666666665, 4: 1.8333333333333333, 5: 3.6666666666666665}


***Setting Up BERT for Sequence Classification with Class Weights***

This code prepares a BERT model for sequence classification with the consideration of class imbalances by using weighted loss. First, it loads the pre-trained BERT model (bert-base-uncased) for sequence classification, specifying the number of labels as the length of label_map, which maps the class labels to integer values.

The class weights, which were previously computed, are converted into a PyTorch tensor with dtype=torch.float32. These weights will be used to adjust the loss function during model training.

The weighted_loss_fn function is defined to apply the class weights during training. It utilizes the CrossEntropyLoss function, passing the class_weights_tensor to ensure that the loss for each class is weighted according to its frequency in the dataset.

In [None]:
from transformers import BertForSequenceClassification
import torch.nn as nn
import torch

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_map))

# Convert class weights to a tensor
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float32)

# Modify the loss function to use class weights
def weighted_loss_fn(output, target):
    loss_fct = nn.CrossEntropyLoss(weight=class_weights_tensor)
    return loss_fct(output, target)

model.config.problem_type = "multi_label_classification"  # In case it's a multi-label problem


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


***Implementing a Custom Trainer with Class Weights and Evaluation Metrics***

This code defines a custom loss function and a custom trainer to handle sequence classification tasks with class weights and evaluation metrics. The weighted_loss_fn function uses CrossEntropyLoss with the previously computed class weights to adjust the loss based on class imbalances during training. The custom CustomTrainer class, which inherits from the Hugging Face Trainer class, overrides the compute_loss method to use the custom weighted loss function. The compute_metrics method calculates several evaluation metrics, including precision, recall, F1 score, and accuracy, using precision_recall_fscore_support and accuracy_score from sklearn. These metrics are computed by comparing the predicted labels (derived from the logits) against the true labels. This setup ensures that both the loss function and evaluation metrics account for class imbalances, enhancing model performance in cases of skewed datasets.

In [None]:
# Define a custom loss function that uses class weights
def weighted_loss_fn(logits, labels):
    loss_fct = nn.CrossEntropyLoss(weight=class_weights_tensor)
    return loss_fct(logits, labels)

# Custom Trainer to handle the loss and metrics calculation
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")

        # Use weighted loss function
        loss = weighted_loss_fn(logits, labels)

        if return_outputs:
            return (loss, outputs)
        return loss

    def compute_metrics(self, eval_pred):
        # Extract predictions and labels
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)

        # Calculate precision, recall, f1, and accuracy
        precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
        accuracy = accuracy_score(labels, predictions)

        return {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1': f1
        }



***Training a BERT Model with Custom Metrics and Class Weights***

This code sets up the training of a BERT model for sequence classification, incorporating custom metrics and class weights to handle class imbalances. First, the class weights are computed using compute_class_weight to ensure that underrepresented classes are given more importance during training. These weights are converted into a PyTorch tensor to be used in the loss function.

The CustomTrainer class, which extends Hugging Face's Trainer, overrides the compute_metrics method to calculate the accuracy, precision, recall, and F1 score using precision_recall_fscore_support and accuracy_score from sklearn. These metrics are computed after each evaluation step to monitor the model's performance.

The TrainingArguments class is configured with essential hyperparameters, including the number of epochs, batch sizes, learning rate warmup, and evaluation frequency. The pre-trained BERT model (bert-base-uncased) is loaded and configured for the classification task, with the loss function incorporating the class weights to adjust the loss for imbalanced data.

Finally, the CustomTrainer is initialized with the model, datasets, and training arguments, and the training process is started with the trainer.train() method. After training, the model is evaluated using trainer.evaluate(), and the evaluation metrics are displayed. This setup ensures that the model is trained effectively, taking class imbalances into account and providing detailed performance metrics.

In [None]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
import torch
import numpy as np
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn.utils.class_weight import compute_class_weight

# Calculate class weights
class_weights = compute_class_weight('balanced', classes=np.unique(df['label_id']), y=df['label_id'])
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float32)

# Define a custom Trainer to compute metrics
class CustomTrainer(Trainer):
    def compute_metrics(self, eval_pred):
        # Extract predictions and labels
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)

        # Calculate precision, recall, f1, and accuracy
        precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
        accuracy = accuracy_score(labels, predictions)

        return {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1': f1
        }

# Set the training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=16,  # batch size for training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch",     # Evaluate after each epoch
    save_strategy="epoch",           # Save after each epoch
)

# Load the pre-trained BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_map))

# Set the loss function with class weights
loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights_tensor)

# Initialize the custom trainer with the model, datasets, and training arguments
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,  # Replace with your actual training dataset
    eval_dataset=val_dataset,     # Replace with your actual evaluation dataset
    compute_metrics=trainer.compute_metrics  # Use custom compute_metrics
)

# Start training
trainer.train()

# After training, evaluate the model to get metrics
eval_results = trainer.evaluate()

# Display the evaluation metrics
print(f"Evaluation Results: {eval_results}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,1.724511,0.0,0.0,0.0,0.0
2,No log,1.723556,0.0,0.0,0.0,0.0
3,No log,1.721952,0.0,0.0,0.0,0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Evaluation Results: {'eval_loss': 1.7219524383544922, 'eval_accuracy': 0.0, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_runtime': 0.0442, 'eval_samples_per_second': 113.169, 'eval_steps_per_second': 22.634, 'epoch': 3.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
# Evaluate the model on the validation set
eval_results = trainer.evaluate()

# Print the evaluation results
print(f"Evaluation Results: {eval_results}")


Evaluation Results: {'eval_loss': 1.7219524383544922, 'eval_accuracy': 0.0, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_runtime': 0.0441, 'eval_samples_per_second': 113.45, 'eval_steps_per_second': 22.69, 'epoch': 3.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
