# Bonus Assignment: Question Answering based on Medical Transcription using BioBERT Model

Submitted by: Merishna Singh Suwal



## Retrieve the data from GitHub

The data set used for this project has been derived from [Kaggle](https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions).

In [1]:
!git clone https://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@github.com/merishnaSuwal/Medical_transcript_QA.git

Cloning into 'Medical_transcript_QA'...
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 23 (delta 6), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (23/23), 4.78 MiB | 18.00 MiB/s, done.
Resolving deltas: 100% (6/6), done.


In [3]:
import sys
sys.path.append('/content/Medical_transcript_QA')

## Imports and settings

In [4]:
# Wrap the text in ipython notebook
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [5]:
%%capture
# Install Hugging Face related libraries
# !source /content/.venv/bin/activate;
!python3 -m pip install -U pip install huggingface_hub
!python3 -m pip install -U pip install accelerate
!python3 -m pip install -U pip install transformers
!python3 -m pip install -U pip install datasets evaluate
!python3 -m pip install -U pip install tensorflow


# Install Spacy related libraries
# !source /content/.venv/bin/activate;
!python3 -m pip install -U pip setuptools wheel spacy;
!python3 -m pip install -U pip spacy_cleaner;
!python3 -m spacy download en_core_web_sm;

In [6]:
import accelerate
import transformers

transformers.__version__, accelerate.__version__

('4.35.2', '0.24.1')

Before proceeding with this step, to login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .

In [7]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [8]:
import os
import pandas as pd
import spacy
import en_core_web_sm
import spacy_cleaner
from spacy_cleaner.processing import removers, replacers, mutators

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

from datasets import load_dataset
import json
from pathlib import Path

## Read and preprocess the data

In [9]:
main_dir = 'Medical_transcript_QA/'
data_dir = os.path.join(main_dir, 'bio_qa_data')
df = pd.read_csv(os.path.join(data_dir, "mtsamples.csv"), index_col=0)

In [10]:
df.head(1) # view the dataframe

Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,A 23-year-old white female presents with complaint of allergies.,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female presents with complaint of allergies. She used to have allergies when she lived in Seattle but she thinks they are worse here. In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness. She has used Allegra also. She used that last summer and she began using it again two weeks ago. It does not appear to be working very well. She has used over-the-counter sprays but no prescription nasal sprays. She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS: , Her only medication currently is Ortho Tri-Cyclen and the Allegra.,ALLERGIES: , She has no known medicine allergies.,OBJECTIVE:,Vitals: Weight was 130 pounds and blood pressure 124/78.,HEENT: Her throat was mildly erythematous without exudate. Nasal mucosa was erythematous and swollen. Only clear drainage was seen. TMs were clear.,Neck: Supple without adenopathy.,Lungs: Clear.,ASSESSMENT:, Allergic rhinitis.,PLAN:,1. She will try Zyrtec instead of Allegra again. Another option will be to use loratadine. She does not think she has prescription coverage so that might be cheaper.,2. Samples of Nasonex two sprays in each nostril given for three weeks. A prescription was written as well.","allergy / immunology, allergic rhinitis, allergies, asthma, nasal sprays, rhinitis, nasal, erythematous, allegra, sprays, allergic,"


In [11]:
# Import spacy model
spacy_nlp = spacy.load("en_core_web_sm")
stops = spacy_nlp.Defaults.stop_words

In [12]:
# Initialize a pipeline for cleaning text
spacy_pipeline = spacy_cleaner.Pipeline(
    spacy_nlp,
    removers.remove_stopword_token,
    removers.remove_punctuation_token,
    mutators.mutate_lemma_token,
)

# BioBERT Inference w/o Fine-Tuning

## Prepare the pre-trained model

We will first try using BioBERT to inference on the given medical transcripts without any fine-tuning.



In [13]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline, TrainingArguments, Trainer

# import pretrained model
model_name = "dmis-lab/biobert-large-cased-v1.1-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Define a pipeline for inference
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/467k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.45G [00:00<?, ?B/s]

## Prepare the helper functions for inference

In [14]:
def get_model_answer(nlp_pipeline, q, passage, tokenizer):
  """Return answer from the QA model for the provided question.
  """
  QA_input = {
      'question': q,
      'context': passage
  }
  res = nlp_pipeline(QA_input)
  return res['answer']

def clean_up_text(texts):
  """Clean up transcripts using spacy.
  """
  clean_text = spacy_pipeline.clean([texts])
  return ' '.join(clean_text)

## **INFERENCE**: Ask necessary questions to the model based on a given passage

In [15]:
# clean up passage to remove stop words
passage = clean_up_text(df.iloc[2]['transcription'])
print('\n')
print(passage)

Cleaning Progress: 100%|██████████| 1/1 [00:00<00:00,  3.83it/s]



history present ILLNESS see ABC today pleasant gentleman 42 year old 344 pound 5'9 BMI 51 overweight year age 33 high 358 pound low 260 pursue surgical attempt weight loss feel good healthy begin exercise want able exercise play volleyball physically sluggish get tired quickly lose weight regain gain lose big weight loss 25 pound month gain month drink alcohol take calorie multiple commercial weight loss program include Slim Fast month year ago Atkin Diet month year ago ,PAST MEDICAL history difficulty climb stair difficulty airline seat tie shoe public seating difficulty walk high cholesterol high blood pressure asthma difficulty walk block go step sleep apnea snore diabetic medication joint pain knee pain pain foot ankle pain leg foot swell hemorrhoid ,PAST surgical history include orthopedic knee surgery ,SOCIAL history currently single drink alcohol drink week drink day week binge drink smoke half pack day 15 year recently stop smoke past week ,FAMILY history obesity heart diseas




In [16]:
q1 = "How old is the patient?"
q2 = "Does the patient have any complaints?"
q3 = "What is the reason for this consultation?"
q4 = "What does the echodiagram show?"
q5 = "What other symptoms does the patient have?"

questions = [q1, q2, q3, q4, q5]

for i, q in enumerate(questions):
  print(f"Question {i+1}: {q}\n")
  print(f"Answer: {get_model_answer(nlp, q, passage, tokenizer)}\n")

Question 1: How old is the patient?

Answer: 42

Question 2: Does the patient have any complaints?

Answer: joint pain knee pain pain foot ankle pain leg foot swell hemorrhoid

Question 3: What is the reason for this consultation?

Answer: pursue surgical attempt weight loss

Question 4: What does the echodiagram show?

Answer: review systems

Question 5: What other symptoms does the patient have?

Answer: hemorrhoid



The pretrained model performs decent, but it has not been explicitly trained on the medical transcripts for a better context and understanding. Hence, we will attempt to Fine tune the model by preparing a custom annotated dataset from the provided data.

---

# **Fine-Tuning (training) the model**

## **Preparing custom SQuAD dataset**

Manually annotated medical transcripts in SQuAD dataset format using [Dataset annotation](https://github.com/cdqa-suite/cdQA-annotator).

- **Training data (train.json)**

Contains 4 medical transcripts with 4 questions (information types) for each, i.e. 16 total questions.

- **Validation data (val.json)**

Contains 2 medical transcripts with 4 questions (information types) for each, i.e. 8 total questions.



## Load our manually annotated SQuAD dataset 'train.json' and 'val.json' using custom dataset loader 'custom_squad.py'

 - [Custom data loading script](https://huggingface.co/datasets/lhoestq/custom_squad/raw/main/custom_squad.py) -> `custom_squad.py`

In [17]:
squad = load_dataset(os.path.join(data_dir, "custom_squad.py"), \
                     data_files={
                          "train": os.path.join("train.json"),
                          "validation": os.path.join("val.json")
                            }
                     )

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

## Validate our sample dataset after loading

In [18]:
squad['train'][0]

{'id': 'e5576669-48d1-4d5b-bf3a-4048bd70624b',
 'title': 'Question answering',
 'context': 'SUBJECTIVE:, This 23-year-old white female presents with complaint of allergies. She used to have allergies when she lived in Seattle but she thinks they are worse here. In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness. She has used Allegra also. She used that last summer and she began using it again two weeks ago. It does not appear to be working very well. She has used over-the-counter sprays but no prescription nasal sprays. She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS: , Her only medication currently is Ortho Tri-Cyclen and the Allegra.,ALLERGIES: , She has no known medicine allergies.,OBJECTIVE:,Vitals: Weight was 130 pounds and blood pressure 124/78.,HEENT: Her throat was mildly erythematous without exudate. Nasal mucosa was erythematous and swollen. Onl

## Prepare the preprocess function

Preprocessing is necessary to deal with various scenarios in the dataset especially in the question answering task.

 - Some context in the dataset maybe very lengthy beyound the maximum input length of the model. Such, longer sequences or context should be truncated.

 - We need to map the start and end positions of the answer to the original context by setting return_offset_mapping=True.

 - With the mapping in hand, now you can find the start and end tokens of the answer. Use the sequence_ids method to find which part of the offset corresponds to the question and which corresponds to the context.

In [19]:
def preprocess_function(examples):
    """Preprocess the dataset before training."""
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

## Apply the preprocessing over our custom dataset

In [20]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/16 [00:00<?, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

## Check if the tokenized dataset contains required features

In [21]:
tokenized_squad

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 16
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 8
    })
})

## Start the training process

In [22]:
from transformers import DefaultDataCollator, TrainingArguments, Trainer

data_collator = DefaultDataCollator()

model_path = os.path.join(main_dir, "saved/")

# Set arguments
training_args = TrainingArguments(
    output_dir=model_path,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

# Define the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)


In [23]:
trainer.train() # run

Epoch,Training Loss,Validation Loss
1,No log,3.739172
2,No log,3.706528
3,No log,3.744939


TrainOutput(global_step=6, training_loss=1.5146601994832356, metrics={'train_runtime': 14.4834, 'train_samples_per_second': 3.314, 'train_steps_per_second': 0.414, 'total_flos': 33433451716608.0, 'train_loss': 1.5146601994832356, 'epoch': 3.0})

## Save our fine-tuned model

In [35]:
trainer.save_model(model_path) # save

# BioBERT Inference w/ Fine-Tuned Model

We will now try to predict the answers to the questions based on the passage using out custom QA model.

In [36]:
from transformers import pipeline

context = df.iloc[21]['transcription']
print(context)
print('\n')

custom_question_answerer = pipeline("question-answering", model=model_path)

for i, q in enumerate(questions):
  print(f"Question {i+1}: {q}\n")
  print(f"Answer: {custom_question_answerer(question=q, context=context)['answer']}\n")

FINAL DIAGNOSES,1.  Morbid obesity, status post laparoscopic Roux-en-Y gastric bypass. ,2.  Hypertension. ,3.  Obstructive sleep apnea, on CPAP.,OPERATION AND PROCEDURE: , Laparoscopic Roux-en-Y gastric bypass.,BRIEF HOSPITAL COURSE SUMMARY:  ,This is a 30-year-old male, who presented recently to the Bariatric Center for evaluation and treatment of longstanding morbid obesity and associated comorbidities.  Underwent standard bariatric evaluation, consults, diagnostics, and preop Medifast induced weight loss in anticipation of elective bariatric surgery. ,Taken to the OR via same day surgery process for elective gastric bypass, tolerated well, recovered in the PACU, and sent to the floor for routine postoperative care.  There, DVT prophylaxis was continued with subcu heparin, early and frequent mobilization, and SCDs.  PCA was utilized for pain control, efficaciously, he utilized the CPAP, was monitored, and had no new cardiopulmonary complaints.  Postop day #1, labs within normal limit

In [37]:
# Try with another transcription
context = df.iloc[2]['transcription']
print(context)
print('\n')

for i, q in enumerate(questions):
  print(f"Question {i+1}: {q}\n")
  print(f"Answer: {custom_question_answerer(question=q, context=context)['answer']}\n")

HISTORY OF PRESENT ILLNESS: , I have seen ABC today.  He is a very pleasant gentleman who is 42 years old, 344 pounds.  He is 5'9".  He has a BMI of 51.  He has been overweight for ten years since the age of 33, at his highest he was 358 pounds, at his lowest 260.  He is pursuing surgical attempts of weight loss to feel good, get healthy, and begin to exercise again.  He wants to be able to exercise and play volleyball.  Physically, he is sluggish.  He gets tired quickly.  He does not go out often.  When he loses weight he always regains it and he gains back more than he lost.  His biggest weight loss is 25 pounds and it was three months before he gained it back.  He did six months of not drinking alcohol and not taking in many calories.  He has been on multiple commercial weight loss programs including Slim Fast for one month one year ago and Atkin's Diet for one month two years ago.,PAST MEDICAL HISTORY: , He has difficulty climbing stairs, difficulty with airline seats, tying shoes,

The model performs much better than before. Let us now evaluate and compute some metrics to benchmark the performance of our model.

# Model Evaluation

Evaluate model metrics

In [25]:
# Generate predictions
from tqdm import tqdm

def predict_answer(model_path, question, context):
    """Predict answer for each question and context provided"""
    # Your code to obtain predictions for a single question-context pair
    question_answerer = pipeline("question-answering", model=model_path)

    answer = question_answerer(question=question, context=context)
    return answer

# Load and preprocess the SQuAD validation dataset
with open(data_dir + '/val.json', 'r') as file:
    squad_data = json.load(file)


Generate Predictions in SQuAD evaluation format i.e., `{<question_id>: <answer>}`

In [26]:
predictions = {}

# Run inference and generate predictions
for example in tqdm(squad_data['data']):
    for paragraph in example['paragraphs']:
        context = paragraph['context']
        for qas in paragraph['qas']:
            question = qas['question']
            # Return the predicted answer
            answer = predict_answer(model_path, question, context)
            print(f"\nQ: {question} \n A: {answer['answer']}")
           # Save as a key value
            predictions[qas['id']] = answer['answer']

print("Predictions:", predictions)

  0%|          | 0/1 [00:00<?, ?it/s]


Q: How old is the patient? 
 A: 56

Q: Does the patient have any complaints? 
 A: peripheral ankle swelling and heartburn

Q: What is the reason for this consultation? 
 A: hypertension

Q: What other symptoms does the patient have? 
 A: cardiac disease and prior history of cancer

Q: How old is the patient? 
 A: 50

Q: Does the patient have any complaints? 
 A: comorbidities related to the obesity

Q: What is the reason for this consultation? 
 A: Morbid obesity


100%|██████████| 1/1 [01:37<00:00, 97.51s/it]


Q: What other symptoms does the patient have? 
 A: comorbidities
Predictions: {'6d008f70-d3c3-447d-8a61-c2fb19135b5e': '56', 'bf5d0d61-77d4-4d88-8d12-84613f82817f': 'peripheral ankle swelling and heartburn', '2e581290-4df8-4e8c-8ef1-ee65961916ea': 'hypertension', '259db804-291a-49d9-905d-b3f583a357f7': 'cardiac disease and prior history of cancer', '7d633826-f00b-42d5-8cd9-0c833736b85d': '50', '5f4b6ce0-2480-4e77-8d27-b694e22ae314': 'comorbidities related to the obesity', 'd2e340f8-a786-49b6-9bf3-68b181d2a0b8': 'Morbid obesity', '25097dea-6522-4fa1-935f-8161a31a8c0d': 'comorbidities'}





In [27]:
# Save predictions in JSON format
with open(model_path + '/predictions.json', 'w') as output_file:
    json.dump(predictions, output_file)

## Evaluate using SQuAD evaluation script

Calculating the score for Exact Match and F1 score using the [evaluation script](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py) by feeding the `val.json` and `predictions.json`.

In [32]:
!python Medical_transcript_QA/eval/evaluate-v1.1.py Medical_transcript_QA/bio_qa_data/val.json Medical_transcript_QA/saved/predictions.json

{"exact_match": 62.5, "f1": 71.66666666666667}


Our custom trained model gives:
- **Exact match of 62.5%**
- **F1 score of 71.67%**.



THANK YOU!