<a href="https://colab.research.google.com/github/nyp-sit/iti107-2024s2/blob/main/session-6/finetune_medicalchat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tune a Causal Language Model for Medical Chatbot

In this exercise, you will fine-tune Meta's Llama 3.2 LLM to be a medical chatbot. We will explore how to use the Huggingface TRL (Transformer Reinforcement Learning) library to help us to perform Supervised Finetuning (SFT).  We will explore the use of Parameter Efficient Fine-Tuning (PEFT) for efficient and fast finetuning.

Before you start the exercise, make sure you have requested to access the Llama 3.2 model. If you have not done so, go to the [model page](https://huggingface.co/meta-llama/Llama-3.2-1B) and fill up your personal info and agree to the license agreement. You may need to wait for a few minutes before the access is granted. You can check the status using the [gated repo link](https://huggingface.co/settings/gated-repos).

You also need to create an access token and use the access token to login to the huggingface hub to access the model in the codes below. You can create the access token at your profile page, under access tokens, or use this [link](https://huggingface.co/settings/tokens).


In [None]:
%%capture
!pip install -q accelerate peft transformers trl wandb
!pip install -U bitsandbytes

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import load_dataset
import torch

In [None]:
# login into huggingface hub using your access token
from huggingface_hub import notebook_login
notebook_login()

## Templating Instruction Data

To fine-tune a base LLM to follow instructions, we will need to prepare instruction data that follows a chat template.

<img src="https://github.com/nyp-sit/iti107-2024S2/blob/main/assets/chat_template.png?raw=true" />

This chat template differentiates between what the LLM generates and what the user generates. Many LLM chat models that are available on HuggingFace comes with built-in chat template that you can use.

You can read more about chat templates [here](https://huggingface.co/docs/transformers/v4.46.3/chat_templating).

In [None]:
# This is the chat model of Llama-3.2-1B-Instruct. We only load it because we want to use it's chat template to format our data
chat_model = "meta-llama/Llama-3.2-1B-Instruct"
base_model = "meta-llama/Llama-3.2-1B"

In [None]:
template_tokenizer = AutoTokenizer.from_pretrained(chat_model)
tokenizer = AutoTokenizer.from_pretrained(base_model)
chat_template = template_tokenizer.get_chat_template()
print(chat_template)
tokenizer.chat_template = chat_template

The template is written in Jinja (a templating language). You can see that the template consists of some special tokens such as `<|start_header_id|>`, `<|end_header_id|`.  These are used to specify the roles, such as `user`, `assistant`, `system`. There is also a special token `<|eot_id|>`, which basically signify end of sentence.
This template also allows the use of tool.

#### Using ChatML template (optional)

ChatML template (from OpenAI) is a very common template used in LLM chatbot model. The tempate that looks like this:
```
{%- for message in messages %}
    {{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }}
{%- endfor %}
```

We can set your base model tokenizer to use this template instead.  Here is how you can do it in a single line.

```
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
```

In [None]:
# tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
# tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
# print(tokenizer.chat_template)

### Format the data according to chat template

Let's download our data and format them according to the template given. We select a subset of 1500 samples to reduce training time.


In [None]:
dataset_name = "ruslanmv/ai-medical-chatbot"
dataset = load_dataset(dataset_name, split="train")
dataset = dataset.shuffle(seed=128).select(range(3000)).train_test_split(test_size=0.2)

In [None]:
dataset_train = dataset['train']
dataset_val = dataset['test']

Let's define a map function to map the data fields to the prompt template.
Note that the completed prompt is put under 'text' field of the json. This is the default field that model will look for the text data.

In [None]:
def format_chat_template(row):
    row_json = [
        {"role": "system", "content": "You are a helpful medical doctor"},
        {"role": "user", "content": row["Patient"]},
        {"role": "assistant", "content": row["Doctor"]}]

    prompt = tokenizer.apply_chat_template(row_json, tokenize=False, add_generation_prompt=True)
    # print(prompt)
    return {"text": prompt}

In [None]:
dataset_train = dataset_train.map(format_chat_template, remove_columns=list(dataset_train.features))
dataset_val = dataset_val.map(format_chat_template, remove_columns=list(dataset_val.features))

Using the "text" column, we can explore these formatted prompts:

In [None]:
dataset_train[0]['text']

### Model Quantization

Now that we have our data, we can start loading in our model. This is where we apply the Q in QLoRA, namely quantization. We use the
bitsandbytes package to compress the pretrained model to a 4-bit representation.

In BitsAndBytesConfig, you can define the quantization scheme. We follow the steps used in the original QLoRA paper and load the model in 4-bit (load_in_4bit) with a normalized float representation (bnb_4bit_quant_type) and double quantization (bnb_4bit_use_double_quant).

For an excellent explanation of quantization, read the blog post "[A Visual Guide to Quantization](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization)" by Maarten Grootendorst.

In [None]:
model_name = "meta-llama/Llama-3.2-1B"

# 4-bit quantization configuration - Q in QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Use 4-bit precision model loading
    bnb_4bit_quant_type="nf4", # Quantization type
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute dtype
    bnb_4bit_use_double_quant=True, # Apply nested quantization
)

# Load the model to train on the GPU
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    device_map="auto",
    # Leave this out for regular SFT
    quantization_config=bnb_config,
)



### Test the Model with Zero Shot Inferencing

Let's test the base model (non-instruction tuned model) with zero shot inferencing (i.e. ask it to summarize without giving any example. You can see that the model struggles to respond to user's question, and just repeating what the user has entered.

In [None]:
messages = [{"role": "user", "content": "I have stomach pain. What should I do?"}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)
model_input = tokenizer(prompt, return_tensors="pt").to("cuda:0")
model.eval()
with torch.no_grad():   # no gradient update
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=200)[0], skip_special_tokens=True))

### LoRA Configuration

We will be using LoRA to train our model. LoRA is supported in Hugging Face's PEFT library.
Here are some explanation about the parameters used in the LoRA:
- `r` - This is the rank of the compressed matrices. Increasing this value will also increase the sizes of compressed matrices leading to less compression and thereby improved representative power. Values typically range between 4 and 64.
- `lora_alpha` - Controls the amount of change that is added to the original weights. In essence, it balances the knowledge of the original model with that of the new task. A rule of thumb is to choose a value twice the size of r.
- `target_modules` - Controls which layers to target. The LoRA procedure can choose to ignore specific layers, like specific projection layers. This can speed up training but reduce performance and vice versa.

In [None]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# Prepare LoRA Configuration
peft_config = LoraConfig(
    lora_alpha=32,  # LoRA Scaling
    lora_dropout=0.1,  # Dropout for LoRA Layers
    r=64,  # Rank
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=  # Layers to target
     ['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

Let's compare the number of trainable parameters of the PEFT model vs the base model.

In [None]:
model.print_trainable_parameters()

### Training Configuration

Next we need to set our training configuration. Since we are going to use SFTTrainer, we can specify the training arguments in SFTConfig.

Note that we set `fp16` to True for mixed-precision training. If you are using Ampere and newer GPU architecture, you can set `bf16` to better accuracy and faster training.

Modern LLM has quite a large context window, typically more than a 100K. Many of the text sample we encountered are very much shorter than that. For more efficient use of the context window, Instead of having one text per sample in the batch and then padding to either the longest text or the maximal context of the model, we concatenate a lot of texts with a EOS token in between and cut chunks of the context size to fill the batch without any padding.

<img src="https://github.com/nyp-sit/iti107-2024S2/blob/main/assets/packing.png?raw=1" width="700"/>

TRL allows us to do this packing very easily, by just specifying `packing=True`.  Internally, a [`ConstantLengthDataset`](https://huggingface.co/docs/trl/en/sft_trainer#trl.trainer.ConstantLengthDataset) is being created so we can iterate over the dataset on fixed-length sequences.

In [None]:
import os
import wandb

# Reduce VRAM usage by reducing fragmentation
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# Set up WANDB project settings
# os.environ["WANDB_PROJECT"] = "llama3.2-summarize"
# os.environ["WANDB_API_KEY"] = "Your secret wandb key"

## convenience method to generate unique run name for WanDB
def get_run_id():
    import time
    run_id = time.strftime("run_%Y%m%d_%H%M%S")
    return run_id

# You can navigate to https://wandb.ai/authorize to get your key
wb_token = '90c8e9188f485d7fef8cd4d76beac203d1dd589e'
wandb.login(key=wb_token)
run = wandb.init(
    project='Llama3.2_Finetune_doctor_chat',
    job_type="training",
    anonymous="allow"
)

In [None]:
from trl import SFTConfig

model.config.use_cache = False
model.config.pretraining_tp = 1

# Configure the tokenizer
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# where to write the checkpoint to
output_dir = "./results"

sft_config = SFTConfig(
    output_dir=output_dir,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    num_train_epochs=2,
    logging_steps=5,
    report_to="wandb",
    max_steps=30,
    bf16=True,
    # fp16=True
    gradient_checkpointing=True,
    resume_from_checkpoint=True,
    packing=True,
    eval_packing=True,
    dataset_text_field="text",
    max_seq_length=1024,
    save_strategy = "steps",
    save_steps=10,
    eval_strategy='steps',
    eval_steps=10
)

In [None]:
from trl import SFTTrainer

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_train,
    eval_dataset=dataset_val,
    # dataset_text_field="text",
    tokenizer=tokenizer,
    # Leave this out for regular SFT
    peft_config=peft_config,
    args=sft_config
)

# switch the model in train mode
trainer.model.train()
# Train model
trainer.train()

In [None]:
# Save QLoRA weights
trainer.model.save_pretrained("Llama-3.2-1B-chat-doctor-QLoRA")

In [None]:
model = AutoModelForCausalLM.from_pretrained("Llama-3.2-1B-chat-doctor-QLoRA").to('cuda:0')

### Save the model in HuggingFace hub

Uncomment the following to push your model to the hub.  Change the path to your hugging face ID, e.g. khengkok

In [None]:
from huggingface_hub import notebook_login

# Logging using your HF access token
notebook_login()

# push the model to hub, change <HuggingFaceID> to your own userid
model.push_to_hub("<HuggingFaceID>/Llama-3.2-1B-chat-doctor-QLoRA")

### Merge Weights

After we have trained our QLoRA weights, we still need to combine them with the original weights to use them. We reload the model in 16 bits, instead of the quantized 4 bits, to merge the weights.

In [None]:
from peft import AutoPeftModelForCausalLM


model = AutoPeftModelForCausalLM.from_pretrained(
    "Llama-3.2-1B-chat-doctor-QLoRA",
    low_cpu_mem_usage=True,
    device_map="auto",
)

# Uncomment the following to load the pretrained model if you did not manage to train your own
# model = AutoPeftModelForCausalLM.from_pretrained(
#     "khengkok/Llama-3.2-1B-chat-doctor-QLoRA",
#     low_cpu_mem_usage=True,
#     device_map="auto",
# )

# Merge LoRA and base model
merged_model = model.merge_and_unload()

After merging the adapter with the base model, we can use it with the prompt template that we defined earlier:

In [None]:
from transformers import TextStreamer
from transformers import pipeline

messages = [{"role": "user", "content": "I have stomach pain. What should I do?"}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipe = pipeline(
    "text-generation",
    model=merged_model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto"
)

outputs = pipe(prompt, max_new_tokens=120, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

In [None]:
# #Streaming support
streamer = TextStreamer(tokenizer)
merged_model.eval()
model_input = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    merged_model.generate(**model_input, streamer=streamer, max_new_tokens=250, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)

### Saving and pushing the merged model

We'll now save a tokenizer and model using the save_pretrained() function.

In [None]:
merged_model.save_pretrained("Llama-3.2-1B-chat-doctor")
tokenizer.save_pretrained("Llama-3.2-1B-chat-doctor")

In [None]:
merged_model.push_to_hub("Llama-3.2-1B-chat-doctor", use_temp_dir=False)
tokenizer.push_to_hub("Llama-3.2-1B-chat-doctor", use_temp_dir=False)