# üß† Workshop: Adding Knowledge to LLMs  
### Dataset: lavita/ChatDoctor-HealthCareMagic-100k  
HuggingFace: https://huggingface.co/datasets/lavita/ChatDoctor-HealthCareMagic-100k  

### Base Model: google/gemma-2-2b-it  
HuggingFace: https://huggingface.co/google/gemma-2-2b-it  

---

## 2Ô∏è‚É£ LoRA Fine-Tuning (Parameter-Efficient FT)

In **LoRA (Low-Rank Adaptation)** Fine-Tuning, we freeze the base model weights and train small low-rank adapter matrices inserted into transformer layers.

This significantly reduces memory usage while maintaining strong domain adaptation performance.

---

In [1]:
# ============================================================
# Workshop: Adding Knowledge to LLMs
# ============================================================
# Dataset: lavita/ChatDoctor-HealthCareMagic-100k
#         HuggingFace Dataset Link: https://huggingface.co/datasets/lavita/ChatDoctor-HealthCareMagic-100k

# Model: google/gemma-2-2b-it
#         HuggingFace Model Link: https://huggingface.co/google/gemma-2-2b-it

# ============================================================
# Goal:
# - Fine-tune a model on Medical ChatDoctor Data using:
# 1) Full Fine-Tuning
# 2) LoRA
# 3) QLoRA (4-bit + LoRA)
# 4) Build a RAG baseline using the SAME data and Evaluate all approaches using the SAME questions
# 5) Create a Medical Agent
# ============================================================


In [2]:
# =====================================================
# LoRA Fine-Tuning
# =====================================================

---

### üì¶ Step 0: Environment Setup


In [3]:
# =====================================================
# 0. Setup
# =====================================================
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer,
    DataCollatorForLanguageModeling, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
from sklearn.model_selection import train_test_split
from utils.utils import get_gpu_memory, generate_chat_response
import bitsandbytes as bnb
import torch.nn as nn


  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# Define Environment Variables
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.environ["DATA_PATH"] = "/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/datasets/ChatDoctor-dataset/data/"
os.environ["MODEL_PATH"] = "/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/gemma-2-2b-it"
os.environ["LORA_FT_MODEL_PATH"] = "/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it"


In [5]:
#!nvidia-smi

In [6]:
gpu_mem = get_gpu_memory()
print(gpu_mem)

{'total_gb': 63.42, 'used_gb': 0.47, 'free_gb': 62.95, 'source': 'torch'}


---

### üì• Step 1: Load Dataset


In [7]:
# =====================================================
# 1. Load ChatDoctor Dataset
# =====================================================
# Load the dataset from the local directory
chatdoctor = load_dataset(os.getenv("DATA_PATH", None))


---

### üìÇ Step 2: Define Model Path and Load Tokenizer



In [8]:
# =====================================================
# 2. Tokenizer
# =====================================================
# Define the model we want to fine tune.
model_path = os.getenv("MODEL_PATH", None)
model_name = str(model_path.split("/")[-1])

# Get Model Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token


In [9]:
print(f"Model used for LoRA Fine-Tuning: {model_name}")

Model used for LoRA Fine-Tuning: gemma-2-2b-it


---

### üßπ Step 3: Apply Chat Template to the Data + Tokenization


In [10]:
# =====================================================
# 3. Apply Chat Template & Data Collator with Dynamic Padding
# =====================================================
def format_chat_template(row):
    row_json = [
        {"role": "user", "content": f"INSTRUCTION:\n{row['instruction']}\n\nPATIENT MESSAGE:\n{row['input']}"},
        {"role": "assistant", "content": row["output"]}
    ]
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

# Apply chat template to all data
chatdoctor = chatdoctor.map(format_chat_template, num_proc=4)

# Split Train and Test datasets
split_dataset = chatdoctor['train'].train_test_split(
    test_size=0.20,
    seed=42,
    shuffle=True,
)
train_dataset = split_dataset['train']
val_dataset = split_dataset['test']


# Define the Data Collator for creating batches of the data
def data_collator(batch):
    tokenized = tokenizer(
        [example["text"] for example in batch],
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=2048,
    )
    # For causal LM, labels are just input_ids
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized

# Subsample for workshop (select only X rows)
train_data = train_dataset.select(range(3000)) #.shuffle(seed=42).select(range(2000)) # Shuffle before choosing X rows
val_data = val_dataset.select(range(300))


---

### ü§ñ Step 4: Load Gemma Model and Run the LoRA Fine Tuning


In [11]:
#help(TrainingArguments)

In [12]:
# =====================================================
# 4. LoRA Fine-Tuning
# =====================================================
# Read LoRA Model
lora_model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map='auto')

# Find Linear modules
modules = ["q_proj", "v_proj"]
print("Modules:", modules)

# Define LoRA Config
lora_config = LoraConfig(
    r=8,                        # Try distinct Rank values
    lora_alpha=32,              # Try distinct lora_alpha values
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Get only the LoRA params to modify during training
lora_model = get_peft_model(lora_model, lora_config)
lora_model.print_trainable_parameters()

# Define Training Arguments
lora_args = TrainingArguments(
    # Throughput critical
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,

    # Training length
    num_train_epochs=3,

    # Optimizer
    learning_rate=5e-5, #2e-4,
    fp16=False,
    bf16=True,

    # Loggingtrain_output
    logging_strategy="epoch",
    warmup_steps=30,

    output_dir=os.environ["LORA_FT_MODEL_PATH"],
    save_total_limit=1,
    save_strategy="epoch",

    # Evaluation
    eval_strategy="epoch",
    #eval_steps=50,
    
    # System
    report_to="none",
    remove_unused_columns=False,
    dataloader_pin_memory=True,
    dataloader_num_workers=2,
)

# Trainer class
lora_trainer = SFTTrainer(
    model=lora_model,
    args=lora_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    data_collator=data_collator,
)

# Before training
torch.cuda.reset_peak_memory_stats()
print("Allocated before training:", torch.cuda.memory_allocated()/1e9, "GB")
print("Reserved before training:", torch.cuda.memory_reserved()/1e9, "GB")

# Train LoRA Model with the Medical Q&A data.
# After training, get peak memory usage
train_output = lora_trainer.train()

print("Peak Allocated during training:", torch.cuda.max_memory_allocated()/1e9, "GB")
print("Peak Reserved during training:", torch.cuda.max_memory_reserved()/1e9, "GB")


`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:01<00:00,  1.16it/s]


Modules: ['q_proj', 'v_proj']
trainable params: 1,597,440 || all params: 2,615,939,328 || trainable%: 0.0611


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 1}.


Allocated before training: 1.335625728 GB
Reserved before training: 1.356857344 GB


Epoch,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
1,4.2729,4.12457,2.468862,791315.0,0.346524
2,3.802,4.061824,2.406161,1582630.0,0.353356
3,3.7496,4.050017,2.392014,2373945.0,0.355113


Peak Allocated during training: 18.706855424 GB
Peak Reserved during training: 65.020100608 GB


In [13]:
train_output

TrainOutput(global_step=282, training_loss=3.9415091656624, metrics={'train_runtime': 510.4373, 'train_samples_per_second': 17.632, 'train_steps_per_second': 0.552, 'total_flos': 3.765156048298598e+16, 'train_loss': 3.9415091656624, 'epoch': 3.0})

---
#### üíæ Step 4.1: Save LoRA Fine-Tuned Model


In [14]:
# =====================================================
# 4.1. LoRA Save Fine-Tuning Model
# =====================================================
# FT LoRA Model - ChatDoctor
lora_ft_model_chatdoctor = lora_trainer.model

# Save LoRA models - ChatDoctor
save_path_lora_ft_model = os.getenv("LORA_FT_MODEL_PATH", None)
lora_ft_model_chatdoctor.save_pretrained(save_path_lora_ft_model, save_serialization=True)
lora_trainer.tokenizer.save_pretrained(save_path_lora_ft_model)
lora_trainer.processing_class.save_pretrained(save_path_lora_ft_model)


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


('/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/tokenizer_config.json',
 '/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/special_tokens_map.json',
 '/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/chat_template.jinja',
 '/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/tokenizer.model',
 '/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/added_tokens.json',
 '/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/tokenizer.json')

In [15]:
# FT LoRA Model - ChatDoctor

# Save LoRA models - ChatDoctor
save_path_lora_ft_model = os.getenv("LORA_FT_MODEL_PATH", None)
lora_model.save_pretrained(save_path_lora_ft_model, save_serialization=True)
tokenizer.save_pretrained(save_path_lora_ft_model)


('/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/tokenizer_config.json',
 '/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/special_tokens_map.json',
 '/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/chat_template.jinja',
 '/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/tokenizer.model',
 '/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/added_tokens.json',
 '/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/FT-models/test/LoRA_model_chatdoctor_gemma-2-2b-it/tokenizer.json')

---
#### üîÑ Step 5: Restart Kernel


In [14]:
# ========================================
# 5. Restart Kernel 
# ========================================
# Restart Kernel to clear cached objects and training artifacts
# and to free GPU Memory (VRAM). This ensures a clean state for inference
# and prevent Out-Of-Memory (OOM) errors.


---

### üîÆ Step 6: Inference with Base Model and LoRA FT Model


In [7]:
# =====================================================
# 6. Inference with Base Model and FT LoRA Model
# =====================================================
# Import models alone
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
from utils.utils import generate_chat_response
import os

#os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
#os.environ["TORCH_USE_CUDA_DSA"] = "1"
#os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.environ["DATA_PATH"] = "/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/datasets/ChatDoctor-dataset/data/"
os.environ["MODEL_PATH"] = "/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/gemma-2-2b-it"
os.environ["LORA_FT_MODEL_PATH"] = "/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/FT-models/LoRA_model_chatdoctor_gemma-2-2b-it"

# Define path of the Base Model
base_model_path = os.getenv("MODEL_PATH", None)
base_model_name = str(base_model_path.split("/")[-1])

# Define the path where LoRA FT Model is saved.
save_path_lora_ft_model = os.getenv("LORA_FT_MODEL_PATH", None)

# Read Base Model and Base Tokenizer
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    torch_dtype=torch.bfloat16,    # Reduce GPU memory
    device_map="auto"             # Automatically put layers on GPU
)
base_tokenizer = AutoTokenizer.from_pretrained(base_model_path)

# Read LoRA FT Model and LoRA FT Tokenizer
#lora_model = PeftModel.from_pretrained(base_model, save_path_lora_ft_model)

#lora_tokenizer = AutoTokenizer.from_pretrained(save_path_lora_ft_model)


Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:01<00:00,  1.17it/s]


In [2]:
#base_model

In [3]:
#lora_model

In [4]:
# How to do inference?
#help(generate_chat_response)

---

### ‚úÖ Step 6.1: Inference with Base Model


In [5]:
# =====================================================
# 6.1. Inference with Base Model
# =====================================================
instruction = "If you are a doctor, please answer the medical questions based on the patient's description."

user_message = "I woke up this morning feeling the whole room is spinning when i was sitting down. I went to the bathroom walking unsteadily, as i tried to focus i feel nauseous. I try to vomit but it wont come out.. After taking panadol and sleep for few hours, i still feel the same.. By the way, if i lay down or sit down, my head do not spin, only when i want to move around then i feel the whole world is spinning.. And it is normal stomach discomfort at the same time? Earlier after i relieved myself, the spinning lessen so i am not sure whether its connected or coincidences.. Thank you doc!"
user_message2 = "Hello, My husband is taking Oxycodone due to a broken leg/surgery. He has been taking this pain medication for one month. We are trying to conceive our second baby. Will this medication afect the fetus? Or the health of the baby? Or can it bring birth defects? Thank you."

messages = [
    {"role": "user", "content": f"INSTRUCTION:\n{instruction}\n\nPATIENT MESSAGE:\n{user_message}"}
]

response = generate_chat_response(
    messages=messages,
    model=base_model,
    tokenizer=base_tokenizer,
    device="cuda",
    max_new_tokens=512,
    temperature=0.2,
    top_p=0.85,
    top_k=50,
    no_repeat_ngram_size=3,
)

print(response)


I understand you're experiencing dizziness and nausea, and it's concerning. I'm an AI and cannot provide medical advice, but I can offer some information and suggest you seek immediate medical attention. 

**What you've described could be a sign of a serious condition, and you should not rely on self-diagnosis.**

Here's why you need to see a doctor:

* **Dizziness and Nausea:** These symptoms can be caused by various things, from dehydration to more serious issues like inner ear problems, migraines, or even a stroke. 
* **Spinning Sensation:**  The feeling of spinning is called vertigo, and can be a symptom of inner ear issues, migraines or even neurological problems.
*  **Inability to Vomit:**  This could indicate a blockage in your digestive system or a more serious issue.
 
**What to do:**

1. **Seek immediate medical help:**  Call your doctor or go to the nearest emergency room. 


**Possible causes of your symptoms:**

*  Inner ear problems (Benign Paroxysmal Positional Vertigo, 

---

### üß™ Step 6.2: Inference with LoRA FT Model


In [9]:
# =====================================================
# 6.2. Inference with LoRA FT Model
# =====================================================
# Read LoRA FT Model and LoRA FT Tokenizer
lora_model = PeftModel.from_pretrained(base_model, save_path_lora_ft_model)

lora_tokenizer = AutoTokenizer.from_pretrained(save_path_lora_ft_model)

os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
os.environ["TORCH_USE_CUDA_DSA"] = "1"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

response = generate_chat_response(
    messages=messages,
    model=lora_model,
    tokenizer=lora_tokenizer,
    device="cuda",
    max_new_tokens=512,
    temperature=0.5,
    top_p=0.85,
    top_k=50,
    no_repeat_ngram_size=3,
)

print(response)


Hi, Thanks for your query on Chat Doctor. I have gone through your query and understand your concern. You have mentioned that you have been feeling nauseous and the room is feeling spinning. This could be due to a few reasons like inner ear problem, migraine, or even anxiety. I would suggest you to consult an ENT doctor to rule out any inner ear problems. If you are having migraine, then you can take a painkiller like paracetamol. If it is anxiety, then I would recommend you to take a stress management class. Hope this helps. Thanks.


In [10]:
# =====================================================
# 6.2. Inference with LoRA FT Model
# =====================================================
# Read LoRA FT Model and LoRA FT Tokenizer
lora_model = PeftModel.from_pretrained(base_model, save_path_lora_ft_model)

lora_tokenizer = AutoTokenizer.from_pretrained(save_path_lora_ft_model)

os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
os.environ["TORCH_USE_CUDA_DSA"] = "1"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

response = generate_chat_response(
    messages=messages,
    model=lora_model,
    tokenizer=lora_tokenizer,
    device="cuda",
    max_new_tokens=512,
    temperature=0.5,
    top_p=0.85,
    top_k=50,
    no_repeat_ngram_size=3,
)

print(response)


Hello, Welcome to Chat Doctor. I have gone through your query and understand your concern. You are having nausea and dizziness. This could be due to various reasons. It could be anxiety, stress, dehydration, low blood sugar, low iron, or some other medical condition. It is important to rule out any serious medical condition first. I would suggest you to visit your doctor and get your blood sugar level checked. Also, get your iron level checked and get a complete blood count done. If you are having any other symptoms, please mention them. Hope I have answered your question. Let me know if I can assist you further. Take care.
