# HealTac 2024 Tutorial
## Instruction Tuning for Discharge Notes Summarization

- Yunsoo Kim (yunsoo.kim.23@ucl.ac.uk), Jinge Wu (jinge.wu.20@ucl.ac.uk), Honghan Wu (honghan.wu@ucl.ac.uk)

<a target="_blank" href="https://colab.research.google.com/github/knowlab/healtac_2024_tutorial/blob/main/discharge_notes_summarization.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Set the runtime to be T4 GPU.  

We will get started with installing packages and downloading the model.

In [3]:
# Run nvidia-smi to check the gpu resource
!nvidia-smi

Tue Jun  4 20:02:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [4]:
# First, install required packages
!pip install -q accelerate==0.25.0 peft==0.6.2 bitsandbytes==0.41.1 transformers==4.36.2 trl==0.7.4 einops gradio

In [5]:
# Import Libraries
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
import gradio as gr

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [6]:
# Define Quantization Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
)

# Load Model and Dataset
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map="auto",
    revision="refs/pr/23"
)

tokenizer = AutoTokenizer.from_pretrained('microsoft/phi-2')
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_sight = "right"

dataset = load_dataset("bluesky333/synthetic_discharge_summ")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading data:   0%|          | 0.00/635k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.68k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [7]:
# Let's have a look at the train dataset
print(dataset['train'])
dataset['train'][0]

Dataset({
    features: ['patient_id', 'note', 'question', 'answer'],
    num_rows: 442
})


{'patient_id': 3781,
 'note': 'Discharge Summary:\n\nPatient Condition: Venous thrombosis of the left gastrocnemius and fibular veins.\n\nHistory of Present Illness:\nThe patient, a 66-year-old female presented with pain and edema of the left lower limb spreading to the top of the thigh. The patient reported immobilization for a few hours as the only risk factor for thrombosis during an interview. Doppler ultrasonography showed venous thrombosis of the left gastrocnemius and fibular veins and a left PVA.\n\nHospital Course:\nThe patient was treated with systemic anticoagulation, with the medication Rivaroxaban administered for 6 months. The patient received instructions to care for the site with localized heat and elevation. Medium pressure elastic stockings were used for compression therapy.\n\nDischarge Instructions:\nThe patient should continue to care for the site with localized heat and elevation. Compression therapy with medium pressure elastic stockings should be continued as we

In [8]:
# the test dataset
print(dataset['test'])
dataset['test'][0]

Dataset({
    features: ['patient_id', 'note', 'question', 'answer'],
    num_rows: 5
})


{'patient_id': 8827,
 'note': "Discharge Summary:\n\nPatient Identification:\nThe patient is a 42-year-old female who underwent a forehead lift using Endotine fixation two years ago.\n\nReason for Admission and Treatment:\nThe patient was admitted to the hospital for skin necrosis on the left frontotemporal scalp as a result of her Endotine fixation. This was addressed with a transplant of 210 FUs harvested from the occipital scalp using the FUE method. The procedure lasted 1.3 hours, and only a single session was required.\n\nProgress During Hospitalization:\nFollowing the graft, the patient's left frontotemporal scalp showed visible improvement at the 12-month follow-up, with 80 % graft survival. The preoperative POSAS of the patient scale was 12 and that of the observer scale was 9; the postoperative POSAS was of the patient scale was 5 and that of the observer scale was 3.",
 'question': 'Write a discharge instructions for the given note.',
 'answer': 'The patient is advised to avo

In [9]:
# We make this dataset to phi-2 compatible
# Phi-2 instruction-answer format: "Instruct: <prompt>\nOutput:"

# Make your own prompt!
prompt_template="""Instruct: Please write down your own prompt.
For instance, you can insert the note as {{note}}
{note}
Model should answer to {{question}} based on the note.
{question}
You should maintain the phi-2 format
Accordingly, the last line must be like the below.
Do not forget to insert a new line between your prompt and 'Output'!
Output: {answer}
"""

prompt_template="""Instruct: Answer the question about the following clinical note. \n{note}.
Output: {answer}
"""


# Should get Dict[List] as input, return list of prompts
def format_dataset(samples):
    outputs = []
    for _, note, question, answer in zip(*samples.values()):
        out = prompt_template.format(note=note, question=question, answer=answer)
        outputs.append(out)
    return outputs

sample_input = format_dataset({k: [v] for k, v in dataset['train'][0].items()})[0]
print(sample_input)
print("*"*20)

# Sanity Check
prompt_len = len(tokenizer.encode(prompt_template))
if prompt_len > 180:
    raise ValueError(f"Your prompt is too long! Please reduce the length from {prompt_len} to 180 tokens")
print(f"Prompt Length: {prompt_len} tokens")

Instruct: Answer the question about the following clinical note. 
Discharge Summary:

Patient Condition: Venous thrombosis of the left gastrocnemius and fibular veins.

History of Present Illness:
The patient, a 66-year-old female presented with pain and edema of the left lower limb spreading to the top of the thigh. The patient reported immobilization for a few hours as the only risk factor for thrombosis during an interview. Doppler ultrasonography showed venous thrombosis of the left gastrocnemius and fibular veins and a left PVA.

Hospital Course:
The patient was treated with systemic anticoagulation, with the medication Rivaroxaban administered for 6 months. The patient received instructions to care for the site with localized heat and elevation. Medium pressure elastic stockings were used for compression therapy.

Discharge Instructions:
The patient should continue to care for the site with localized heat and elevation. Compression therapy with medium pressure elastic stockings s

In [10]:
sample_idx = 0
sample_input = format_dataset({k: [v] for k, v in dataset['train'][sample_idx].items()})[0].split('Output: ')[0]
input_ids = tokenizer.encode(sample_input, return_tensors='pt').to('cuda')
with torch.no_grad():
  output = model.generate(input_ids=input_ids,
                            max_length=512,
                            use_cache=True,
                            temperature=0.,
                            eos_token_id=tokenizer.eos_token_id
  )
print(tokenizer.decode(output.to('cpu')[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instruct: Answer the question about the following clinical note. 
Discharge Summary:

Patient Condition: Venous thrombosis of the left gastrocnemius and fibular veins.

History of Present Illness:
The patient, a 66-year-old female presented with pain and edema of the left lower limb spreading to the top of the thigh. The patient reported immobilization for a few hours as the only risk factor for thrombosis during an interview. Doppler ultrasonography showed venous thrombosis of the left gastrocnemius and fibular veins and a left PVA.

Hospital Course:
The patient was treated with systemic anticoagulation, with the medication Rivaroxaban administered for 6 months. The patient received instructions to care for the site with localized heat and elevation. Medium pressure elastic stockings were used for compression therapy.

Discharge Instructions:
The patient should continue to care for the site with localized heat and elevation. Compression therapy with medium pressure elastic stockings s

In [11]:
# Then, let's define dataset.
response_template = "Output:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

train_dataset = dataset['train']

In [12]:
# SFTTrainer Do everything else for you!

lora_config=LoraConfig(
    r=4,
    task_type="CAUSAL_LM",
    target_modules= ["Wqkv", "fc1", "fc2" ]
)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    fp16=True,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    optim="paged_adamw_32bit",
    save_strategy="no",
    warmup_ratio=0.03,
    logging_steps=1,
    lr_scheduler_type="cosine",
    report_to=None,
    gradient_checkpointing=True
)

trainer = SFTTrainer(
    model,
    training_args,
    train_dataset=train_dataset,
    formatting_func=format_dataset,
    data_collator=collator,
    peft_config=lora_config,
    max_seq_length=512,
    tokenizer=tokenizer,
)

You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.


Map:   0%|          | 0/442 [00:00<?, ? examples/s]

In [13]:
# Run Training
trainer.train()

You're using a CodeGenTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,1.2148
2,1.2299
3,1.133
4,1.3171
5,1.2493
6,1.2606
7,1.1878
8,1.2054
9,1.2285
10,1.12


TrainOutput(global_step=27, training_loss=1.1367349094814725, metrics={'train_runtime': 234.3161, 'train_samples_per_second': 1.886, 'train_steps_per_second': 0.115, 'total_flos': 2045680383344640.0, 'train_loss': 1.1367349094814725, 'epoch': 0.98})

In [29]:
# Wrap-up Training
model = trainer.model
model.eval()

note_samples = dataset['test']['note']

def inference(note, question, model):
    prompt = prompt_template.format(note=note, question=question, answer="")
    tokens = tokenizer.encode(prompt, return_tensors="pt").to('cuda')
    outs = model.generate(input_ids=tokens,
                          max_length=512,
                          use_cache=True,
                          temperature=0.,
                          eos_token_id=tokenizer.eos_token_id
                          )
    output_text = tokenizer.decode(outs.to('cpu')[0], skip_special_tokens=True)
    return output_text[len(prompt):]


def compare_models(note, question):
    with torch.no_grad():
        asc_answer = inference(note, question, trainer.model)
        with model.disable_adapter():
            phi_answer = inference(note, question, trainer.model)
    return asc_answer, phi_answer

demo = gr.Interface(fn=compare_models, inputs=[gr.Dropdown(note_samples), "text"], outputs=[gr.Textbox(label="Trained"), gr.Textbox(label="Phi-2")])
demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://dda660614404bcb67c.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [30]:
dataset['test']['answer']

['The patient is advised to avoid excessive sun exposure and follow recommended care procedures as directed by the physician. The physician should be contacted immediately if there is any evidence of renewed skin necrosis on the left frontotemporal scalp or other adverse reactions.',
 'The patient is to continue monitoring hepatic function. Follow-up visits will be scheduled as needed.\n\nFollow-up: \n\nFollow-up appointments will be scheduled as needed to monitor hepatic function.',
 '1) Take oral antibiotics as directed by the physician.\n2) Follow-up with primary care physician as required.\n3) Inform the physician of any new or recurrent symptoms.',
 'The patient is to avoid any hard or crunchy foods for 24 hours. A soft, bland diet is recommended to minimize discomfort. Over-the-counter pain medication may be taken to manage any post-operative discomfort. If any complications arise, please seek medical attention immediately.\n\nClinical Team:\n[redacted]',
 'The patient was discha