# HealthScribe ( a Clinical Note Generator using Generative AI )

 A web application that allows users to generate clinical notes from transcribed ASR(Automatic Speech Recognition) data of conversations between doctors and patients using fine-tuned BART model.\
 This `.ipynb` file was used to train the model

# Importing the necessary packages.

This project utilizes the following technologies and libraries:

### Programming Language
- **Python**

### Deep Learning Libraries
- **Transformers** (Hugging Face): A popular library for building and fine-tuning transformer models like BERT, GPT, and BART.
- **PyTorch**: A deep learning library used through the Transformers library.

### Data Processing Libraries
- **Datasets** (Hugging Face): A library for loading and processing datasets in various formats.
- **Pandas**: A library for reading and processing CSV files containing the dataset.
- **NumPy**: A library for numerical operations.

### Other Libraries
- **Accelerate**: A library for distributed training and mixed precision in PyTorch.
- **BERTViz**: A library for visualizing attention in transformer models like BERT.
- **UMAP-learn**: A library for dimensionality reduction and visualization.
- **SentencePiece**: A library for tokenization and text processing.
- **urllib3**: A library for handling HTTP requests in Python.
- **py7zr**: A library for handling 7-zip compressed files.
- **rouge_score**: A library for computing ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, used for evaluating summarization tasks.



In [None]:
!pip install -U transformers
!pip install -U accelerate
!pip install -U datasets
!pip install -U bertviz
!pip install -U umap-learn
!pip install -U sentencepiece
!pip install -U urllib3
!pip install py7zr
!pip install rouge_score

Collecting accelerate
  Downloading accelerate-0.30.0-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.4/302.4 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.w

In [None]:
from datasets import load_dataset,load_metric, load_from_disk
from transformers import pipeline
import transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch


### Models
**BART (Bidirectional and Auto-Regressive Transformers)**: The project uses the Facebook BART-large-cnn model, which is a pre-trained sequence-to-sequence model.\
BART is available

In [None]:
device = 'gpu'                                                                  #specifies that the code should run on a GPU if available.
model_ckpt = 'facebook/bart-large-cnn'                                          #sets the pre-trained BART model to be used.
metric = load_metric('rouge')                                                   #loads the ROUGE metric for evaluating text summaries.
max_target=512                                                                  #sets the maximum length for output sequences.
max_input=1024                                                                  #sets the minimum length for input sequences.
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)                           #loads the tokenizer associated with the pre-trained model
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)                       #loads the pre-trained BART model for sequence-to-sequence language modeling tasks, such as text summarization.

##Loading Datasets
We are using a modified version of MTS-Dialog dataset. The modified version has clinical note samples with parameters like : \
- Symptoms:
- Diagnosis:
- History of Patient:
- Plan of Action:

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
train_dataset = load_dataset("csv", data_files="https://huggingface.co/datasets/har1/MTS_Dialogue-Clinical_Note/raw/main/MTS-Dialog-TrainingSet%20(SDHP).csv")
val_dataset = load_dataset("csv",data_files="https://huggingface.co/datasets/har1/MTS_Dialogue-Clinical_Note/raw/main/MTS-Dialog-Validation%20Set%20(SDHP).csv")
#test_dataset = load_dataset("csv",data_files="/content/drive/MyDrive/MTS Dataset/MTS-Dialog-TestSet-1-MEDIQA-Chat-2023.csv")

In [None]:
def preprocess_data(data_to_process):
  #get the dialogue text
  inputs = [dialogue for dialogue in data_to_process['dialogue']]
  #tokenize text
  model_inputs = tokenizer(inputs,  max_length=max_input, padding='max_length', truncation=True)

  #tokenize labels
  with tokenizer.as_target_tokenizer():
    targets = tokenizer(data_to_process['section_text'], max_length=max_target, padding='max_length', truncation=True)

  model_inputs['labels'] = targets['input_ids']
  #reuturns input_ids, attention_masks, labels
  return model_inputs


Tokenzing the dataset

In [None]:
tokenize_train_data = train_dataset.map(preprocess_data, batched = True)
tokenize_val_data = val_dataset.map(preprocess_data, batched = True)


Batch Declaration

In [None]:
batch_size = 1

#collator to create batches. It preprocess data with the given tokenizer
collator = transformers.DataCollatorForSeq2Seq(tokenizer, model=model)

ROUGE Score Calculation function

In [None]:
import numpy as np

from collections import Counter

def compute_f1(pred):
    predictions, labels = pred
    decode_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decode_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    true_positives = 0
    false_positives = 0
    false_negatives = 0

    for pred_tokens, label_tokens in zip(decode_predictions, decode_labels):
        pred_counter = Counter(pred_tokens.split())
        label_counter = Counter(label_tokens.split())

        tp = sum((pred_counter & label_counter).values())
        fp = sum((pred_counter - label_counter).values())
        fn = sum((label_counter - pred_counter).values())

        true_positives += tp
        false_positives += fp
        false_negatives += fn

    precision = true_positives / (true_positives + false_positives) if true_positives + false_positives > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if true_positives + false_negatives > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0

    return f1_score

##GPU Instance and Training Parameters declaration for Fine-Tuning


`!/usr/local/cuda/bin/nvcc --version` checks the version of the NVIDIA CUDA Compiler (nvcc) installed on the system. This is likely being executed in a Google Colab environment, which provides GPU support for deep learning tasks. The exclamation mark `!` is used in notebook environments to execute system commands. Knowing the CUDA version is important for ensuring compatibility with GPU-accelerated deep learning libraries and frameworks.

In [None]:
!/usr/local/cuda/bin/nvcc --version

`!nvidia-smi` is a command used to run the NVIDIA System Management Interface (nvidia-smi) utility. This utility provides information about NVIDIA GPU devices installed on the system, such as the GPU model, driver version, performance status, and memory usage.

In the context of the provided code, this command is likely being executed in a GPU-enabled environment, such as Google Colab, to verify the availability and status of the NVIDIA GPU(s) that will be used for training or inference with deep learning models.

The exclamation mark `!` at the beginning of the command is a convention used in Python notebook environments (like Jupyter Notebook or Google Colab) to execute system commands from within the notebook.

By running `!nvidia-smi`, the output will display detailed information about the NVIDIA GPU(s) detected on the system, which can be useful for debugging and monitoring purposes when working with GPU-accelerated deep learning workloads.

In [None]:
!nvidia-smi

## Training Configuration

This code creates an instance of `Seq2SeqTrainingArguments` from the Transformers library, which is used to configure the training process for the sequence-to-sequence model. The arguments include:

In [None]:
import time
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'
args = transformers.Seq2SeqTrainingArguments(
    'HealthScribe-Clinical_Note_Generator',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size= 1,
    gradient_accumulation_steps=2,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=3,
    predict_with_generate=True,
    eval_accumulation_steps=1,

    fp16=True

    )


- `'HealthScribe-Clinical_Note_Generator'`: The name of the experiment.
- `evaluation_strategy='epoch'`: Evaluate the model at the end of each epoch.
- `learning_rate=2e-5`: The learning rate for the optimizer.
- `per_device_train_batch_size=1` and `per_device_eval_batch_size=1`: The batch size for training and evaluation, respectively.
- `gradient_accumulation_steps=2`: Accumulate gradients for 2 steps before updating the weights.
- `weight_decay=0.01`: The weight decay regularization value.
- `save_total_limit=2`: Save the 2 best checkpoints based on the evaluation metric.
- `num_train_epochs=3`: Train for 3 epochs.
- `predict_with_generate=True`: Use the `generate` method for prediction.
- `eval_accumulation_steps=1`: Accumulate predictions for 1 step during evaluation.
- `fp16=True`: Use mixed precision training (FP16) for better performance.

These arguments configure various aspects of the training process, such as the learning rate, batch size, regularization, and mixed precision training. They are passed to the `Seq2SeqTrainer` class later in the code to control the training behavior.

In [None]:
trainer = transformers.Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenize_train_data['train'],
    eval_dataset=tokenize_val_data['train'],
    data_collator=collator,
    tokenizer=tokenizer,
    compute_metrics=compute_f1
)

In [None]:
trainer.train()


## Push to hub from the Trainer directly

The `Trainer` has a new method to directly upload the model, tokenizer and model configuration in a repo on the [Hub](https://huggingface.co/). It will even auto-generate a model card draft using the hyperparameters and evaluation results!

In [None]:
#trainer.push_to_hub()

###Testing the model


In [None]:
import pandas as pd
test = pd.read_csv("/content/drive/MyDrive/MTS-Dialog-main/Main-Dataset/MTS-Dialog-TestSet-2-MEDIQA-Sum-2023.csv")

In [None]:
test

In [None]:
model_inputs = tokenizer(test['dialogue'][7])

In [None]:
raw_pred, _, _ = trainer.predict([model_inputs])

In [None]:
tokenizer.decode(raw_pred[0],skip_special_tokens=True)