## Finetuning a clinical LLM

We will be finetuning the T5-small model from Google, a checkpoint with 60 million parameters, for the task of clinical note summarization. The code is based on [this](https://huggingface.co/docs/transformers/en/tasks/summarization) tutorial from the Hugging Face .

The dataset used in finetuning is the [augmented-clinical-notes](https://huggingface.co/datasets/AGBonnet/augmented-clinical-notes), from AGBonnet, available in the Hugging Face datasets. The final model, [clinical-t5](https://huggingface.co/hossboll/clinical-t5), is available in my Hugging Face account.

The model was created for learning purposes. Hence, although being briefly evaluated at the end of this notebook, it should be further refined.

### Tips for running in a Colab notebook without errors

* Activate GPU
* Run `pip install accelerate -U` + other `pip`
* In the top menu click `Runtime → Restart Runtime`
* Do not rerun any cells with `!pip install` in them
* Rerun all the other code cells


#### Pip install

In [None]:
!pip install accelerate -U --q
!pip install ipywidgets==7.7.1 --q
!pip install huggingface_hub --q
!pip install datasets --q
!pip install transformers==4.30 --q
!pip install evaluate --q
!pip install rouge_score --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.4/302.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

### Logging into Hugging Face

In [None]:
from huggingface_hub import notebook_login
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Data preprocessing

First, we will set the dataset, checkpoint (the model we will be finetuning) and its specific tokenizer.

In [None]:
dataset = load_dataset("AGBonnet/augmented-clinical-notes", split="train")
checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.76k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/372M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]



tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [None]:
# observing a sample from the dataset => summary = label (target output)
dataset["summary"][0]

'{\n"visit motivation": "Discomfort in the neck and lower back, restriction of body movements, inability to maintain an erect posture, and requiring assistance in standing and walking.",\n"admission": [\n{\n"reason": "None",\n"date": "None",\n"duration": "None",\n"care center details": "None"\n}\n],\n"patient information": {\n"age": "Sixteen years old",\n"sex": "Female",\n"ethnicity": "None",\n"weight": "None",\n"height": "None",\n"family medical history": "None",\n"recent travels": "None",\n"socio economic context": "None",\n"occupation": "None"\n},\n"patient medical history": {\n"physiological context": "None",\n"psychological context": "Diagnosed with bipolar affective disorder at the age of eleven, first episode was that of mania.",\n"vaccination history": "None",\n"allergies": "None",\n"exercise frequency": "None",\n"nutrition": "None",\n"sexual history": "None",\n"alcohol consumption": "None",\n"drug usage": "None",\n"smoking status": "None"\n},\n"surgeries": [\n{\n"reason": "Non

In [None]:
# splitting the dataset
train_test_split = dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split['train'] #or simply use dataset
test_dataset = train_test_split['test']

In [None]:
prefix = "summarize: " # prefix the input with a summarization prompt

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["full_note"]] # full_note = model input
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True) # reminder: this might be a bit too short, but due to learning purposes we will keep it

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# tokenize the text
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True) #or simply dataset.map()

Map:   0%|          | 0/24000 [00:00<?, ? examples/s]

Map:   0%|          | 0/6000 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint) # collator with dynamic padding=adjusts to max size from batch

### Evaluation

In [None]:
import evaluate

rouge = evaluate.load("rouge") # Recall-Oriented Understudy for Gisting Evaluation

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

### Loading and finetuning model

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) # loading t5 for summarization



config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
training_args = Seq2SeqTrainingArguments( # training hyperparameters
    output_dir="hf://hossboll/clinical-t5", #or local folder
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

#trainer.train()

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/hossboll/clinical-t5 into local empty directory.


In [None]:
checkpoint_path="/content/drive/MyDrive/Colab Notebooks/NLP/clinical-t5/checkpoint-5500" #finishing training from checkpoint
trainer.train(resume_from_checkpoint=checkpoint_path)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
4,0.3174,0.250425,0.26,0.2,0.2564,0.2565,19.0


Several commits (2) will be pushed upstream.


TrainOutput(global_step=6000, training_loss=0.026449076334635415, metrics={'train_runtime': 888.2679, 'train_samples_per_second': 108.076, 'train_steps_per_second': 6.755, 'total_flos': 2.595766934175744e+16, 'train_loss': 0.026449076334635415, 'epoch': 4.0})

### Uploading finetuned model to the hub

In [None]:
trainer.push_to_hub()

Upload file pytorch_model.bin:   0%|          | 1.00/231M [00:00<?, ?B/s]

Upload file training_args.bin:   0%|          | 1.00/4.43k [00:00<?, ?B/s]

Upload file spiece.model:   0%|          | 1.00/773k [00:00<?, ?B/s]

To https://huggingface.co/hossboll/clinical-t5
   29b9eea..a5809a7  main -> main

   29b9eea..a5809a7  main -> main

To https://huggingface.co/hossboll/clinical-t5
   a5809a7..5a18ca8  main -> main

   a5809a7..5a18ca8  main -> main



'https://huggingface.co/hossboll/clinical-t5/commit/a5809a784af499c93d39057fcc794ddb8506247a'

### Inference

Now, we can use the Pipeline function from Hugging Face to test some clinical note summarizations.

In [None]:
from transformers import pipeline

finetuned = "hossboll/clinical-t5"
summarizer = pipeline("summarization", model=finetuned)



config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
clinicalnote = """
Chief Complaint:
Patient is a 58-year-old male presenting with acute chest pain and shortness of breath that began approximately 2 hours prior to admission.

History of Present Illness:
The patient describes the pain as a pressing sensation behind the sternum, radiating to the left arm.
The pain was preceded by episodes of mild exertional dyspnea over the past week, which the patient did not seek medical attention for.
Denies associated symptoms of palpitations, dizziness, or loss of consciousness. Past medical history is significant for hypertension and
type 2 diabetes mellitus. The patient is a current smoker, with a 30-pack-year history, and reports occasional alcohol use.

Physical Examination:

General: The patient is alert and oriented, in mild distress due to pain.
Vital Signs: Blood pressure 160/90 mmHg, heart rate 110 bpm, respiratory rate 20 breaths/min, temperature 37.1°C, oxygen saturation 94% on room air.
Cardiovascular: Tachycardic regular rhythm, no murmurs, rubs, or gallops noted. Jugular venous pressure is not elevated.
Respiratory: Mild bilateral basilar crackles, no wheezes or stridor.
Abdomen: Soft, non-distended, with no tenderness or guarding.
Extremities: No edema, cyanosis, or clubbing. Pulses are palpable and symmetrical.
Assessment/Plan:
The clinical presentation is suggestive of an acute coronary syndrome, possibly a myocardial infarction.
Immediate steps include administration of aspirin, nitroglycerin, and morphine for pain control.
Additional diagnostic tests ordered include a 12-lead ECG, chest X-ray, and cardiac biomarkers.
A cardiology consult has been requested for further evaluation and management.
The patient has been advised to remain NPO (nothing by mouth) pending further evaluation.

""" # generated with gpt

In [None]:
summary = summarizer(clinicalnote)
summary

[{'summary_text': ' "visit motivation": "Acute chest pain, shortness of breath, and acute chest pain", "admission": [  “reason”: "Paining sensation behind the sternum, radiating to the left arm, and episodes of mild exertional dyspnea over the past week, preceded by episodes of moderate exertional dizziness, or loss of consciousness, and occasional alcohol use".'}]

In [None]:
clinicalnote_2 = """Chief Complaint:
Patient is a 65-year-old female presenting with sudden onset of severe headache and blurred vision that began early this morning.

History of Present Illness:
The patient describes the headache as the worst she has ever experienced, localized mainly in the back of the head. She reports that the onset was sudden, and the pain has not subsided. She also experiences nausea without vomiting. There is no history of similar headaches. Past medical history is significant for controlled hypertension and hyperlipidemia. The patient denies smoking but reports occasional alcohol use.

Physical Examination:

General: The patient is alert but appears anxious and in moderate distress due to pain.
Vital Signs: Blood pressure 180/100 mmHg, heart rate 85 bpm, respiratory rate 18 breaths/min, temperature 36.8°C, oxygen saturation 97% on room air.
Neurological: Pupils equally round and reactive to light, no focal neurological deficits observed, Glasgow Coma Scale score of 15.
Cardiovascular: Regular rhythm, no murmurs or gallops noted.
Respiratory: Clear to auscultation bilaterally, no wheezes, crackles, or stridor.
Abdomen: Soft, non-tender, non-distended, no guarding or rebound.
Extremities: No edema, no cyanosis, pulses are symmetrical and strong in all extremities.
Assessment/Plan:
The clinical presentation raises concerns for a possible subarachnoid hemorrhage given the sudden onset and severity of the headache. Immediate actions include administration of pain relief medication and antiemetic for nausea control.
Further diagnostic tests ordered include a CT scan of the head and a possible lumbar puncture if no hemorrhage is detected on the CT.
A neurology consult has been requested for further evaluation and management.
The patient has been advised to remain NPO (nothing by mouth) and under close observation pending further diagnostic results.
""" # generated with gpt

In [None]:
summary_2 = summarizer(clinicalnote_2)
summary_2

[{'summary_text': ' "visit motivation": "Sudden onset of severe headache and blurred vision"  ], "admission": [  «reason": The patient is a 65-year-old female presenting with a sudden onset and severity of the headache, a possible subarachnoid hemorrhage, and a potential lumbar puncture".'}]

In [None]:
noisy_clinicalnote = """Chief Complaint:
Patient is a 70-year-old male who came to the emergency department complaining of general fatigue and a sudden onset of dizziness that began this morning while he was gardening. His wife insisted he come to the hospital after he refused lunch, which was unusual for him.

History of Present Illness:
The patient reports feeling unusually tired over the past several days, with today marking the first instance of vertigo. He mentions the dizziness was severe enough to cause him to sit down abruptly, but he did not lose consciousness. He recalls feeling similar, though less severe, episodes last month which he attributed to the hot weather and his rigorous gardening schedule. Past medical history is notable for controlled hypertension and benign prostatic hyperplasia. He is retired from being a school principal, has three children, all in good health, and is a non-smoker. He occasionally drinks wine with dinner.

Physical Examination:

General: The patient is a well-nourished, well-groomed elderly male who appears his stated age and is cooperative but slightly anxious about the hospital visit.
Vital Signs: Blood pressure 150/90 mmHg, heart rate 90 bpm, respiratory rate 18 breaths/min, temperature 36.7°C, oxygen saturation 95% on room air.
Neurological: Alert, oriented to person, place, and time. No focal neurological deficits observed. He complains of mild continuous dizziness during the exam.
Cardiovascular: Regular rhythm, no murmurs or gallops, peripheral pulses are intact.
Respiratory: Effort normal, lung fields clear to auscultation, no wheezes, rhonchi, or crackles.
Abdomen: Soft, non-tender, no organomegaly, normal bowel sounds.
Extremities: No cyanosis, clubbing, or significant peripheral edema. Mild arthritis noted in the fingers.
Skin: Shows signs of chronic sun exposure with multiple benign-looking lesions noted on the forearms and back of the neck.
Assessment/Plan:
Given the patient's age and symptoms, the differential diagnosis includes benign positional vertigo, vestibular neuronitis, or possibly a cardiovascular event. Plan to conduct a more detailed vestibular assessment and review his medications to rule out side effects. Blood tests to check electrolytes and cardiac enzymes, and an ECG are ordered. Considering an outpatient follow-up with his primary care physician and possibly a referral to an ENT specialist if symptoms persist.
Advise to avoid strenuous activities and stay hydrated, especially while outdoors. The patient has been given instructions to monitor his symptoms and return if he experiences any worsening of his condition or new symptoms.
""" # generated with gpt

In [None]:
summary_3 = summarizer(noisy_clinicalnote)
summary_3

[{'summary_text': ' "visit motivation": "General fatigue and a sudden onset of dizziness", "admission": [  “reason”: "Gast fatigue and sudden apparition of dizzyness, severe enough to cause him to sit down abruptly, but he did not lose consciousness, and symptoms related to the hot weather and his rigorous gardening schedule", "date":" Today marks the first instance of vertigo, and is not a school principal, has three children, all in good health, and slightly anxious about the hospital visit", "duration"'}]