# üß† Workshop: Adding Knowledge to LLMs  
### Dataset: lavita/ChatDoctor-HealthCareMagic-100k  
HuggingFace: https://huggingface.co/datasets/lavita/ChatDoctor-HealthCareMagic-100k  

### Base Model: google/gemma-2-2b-it  
HuggingFace: https://huggingface.co/google/gemma-2-2b-it  

---

## 3Ô∏è‚É£ QLoRA Qantized LoRA (Parameter-Efficient FT)

In **QLoRA (Quantized Low-Rank Adaptation)**, we load the base model in **4-bit quantized format** and train low-rank adapter matrices on top of it, while keeping the original weights frozen.

This dramatically reduces GPU memory usage compared to standard LoRA, enabling fine-tuning of large models on limited hardware while preserving strong domain adaptation performance.
    
---

In [1]:
# ============================================================
# Workshop: Adding Knowledge to LLMs
# ============================================================
# Dataset: lavita/ChatDoctor-HealthCareMagic-100k
#         HuggingFace Dataset Link: https://huggingface.co/datasets/lavita/ChatDoctor-HealthCareMagic-100k

# Model: google/gemma-2-2b-it
#         HuggingFace Model Link: https://huggingface.co/google/gemma-2-2b-it

# ============================================================
# Goal:
# - Fine-tune a model on Medical ChatDoctor Data using:
# 1) Full Fine-Tuning
# 2) LoRA
# 3) QLoRA (4-bit + LoRA)
# 4) Build a RAG baseline using the SAME data and Evaluate all approaches using the SAME questions
# 5) Create a Medical Agent
# ============================================================


In [2]:
# =====================================================
# QLoRA Fine-Tuning
# =====================================================

---

### üì¶ Step 0: Environment Setup


In [3]:
# =====================================================
# 0. Setup
# =====================================================
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer,
    DataCollatorForLanguageModeling, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from sklearn.model_selection import train_test_split
from utils.utils import get_gpu_memory, generate_chat_response
import bitsandbytes as bnb
import torch.nn as nn
from peft import prepare_model_for_kbit_training
import warnings
warnings.filterwarnings('ignore')


  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# Define Environment Variables
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.environ["DATA_PATH"] = "/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/datasets/ChatDoctor-dataset/data/"
os.environ["MODEL_PATH"] = "/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/gemma-2-2b-it"
os.environ["QLORA_FT_MODEL_PATH"] = os.path.join(os.getenv("CINECA_SCRATCH"), "FT-models/scratch/QLoRA_model_chatdoctor_gemma-2-2b-it")


In [5]:
#!nvidia-smi

In [6]:
gpu_mem = get_gpu_memory()
print(gpu_mem)

{'total_gb': 63.42, 'used_gb': 0.47, 'free_gb': 62.95, 'source': 'torch'}


---

### üì• Step 1: Load Dataset


In [7]:
# =====================================================
# 1. Load ChatDoctor Dataset
# =====================================================
# Load the dataset from the local directory
chatdoctor = load_dataset(os.getenv("DATA_PATH", None))


In [8]:
chatdoctor

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 112165
    })
})

---

### üìÇ Step 2: Define Model Path and Load Tokenizer



In [9]:
# =====================================================
# 2. Tokenizer
# =====================================================
# Define the model we want to fine tune.
model_path = os.getenv("MODEL_PATH", None)
model_name = str(model_path.split("/")[-1])

# Get Model Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token


In [10]:
print(f"Model used for QLoRA Fine-Tuning: {model_name}")

Model used for QLoRA Fine-Tuning: gemma-2-2b-it


---

### üßπ Step 3: Apply Chat Template to the Data + Tokenization


In [11]:
# =====================================================
# 3. Apply Chat Template & Data Collator with Dynamic Padding
# =====================================================
def format_chat_template(row):
    row_json = [{"role": "user", "content": f"INSTRUCTION:\n{row['instruction']}\n\nPATIENT MESSAGE:\n{row['input']}"},
                {"role": "assistant", "content": row["output"]}]
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

# Apply chat template to all data
chatdoctor = chatdoctor.map(format_chat_template, num_proc=1)

# Get train dataset
train_dataset = chatdoctor['train']

# Define the Data Collator for creating batches of the data
def data_collator(batch):
    tokenized = tokenizer(
        [example["text"] for example in batch],
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=2048,
    )
    # For causal LM, labels are just input_ids
    tokenized["labels"] = tokenized["input_ids"].clone()

    # Move everything to GPU 0.
    #tokenized = {k: v.to('cuda:0') for k, v in tokenized.items()}
    
    return tokenized

# Define the Data Collator for creating batches of the data
def data_collator(batch):
    tokenized = tokenizer(
        [example["text"] for example in batch],
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=2048,
    )
    # For causal LM, labels are just input_ids
    tokenized["labels"] = tokenized["input_ids"].clone()

    # Move everything to GPU 0.
    #tokenized = {k: v.to('cuda:0') for k, v in tokenized.items()}
    
    return tokenized


# Subsample for workshop
train_data = train_dataset.select(range(3000)) #.shuffle(seed=42).select(range(2000))
#val_data = val_dataset.select(range(300))


---

### ü§ñ Step 4: Load Gemma Model and Run the QLoRA Fine Tuning


In [12]:
# =====================================================
# 4. QLoRA (Quantized + LoRA)
# =====================================================
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
torch.cuda.set_device(0)

# Get Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                     # Load weights in 4-bit precision instead of the usual 16-bit
    #bnb_4bit_compute_dtype=torch.bfloat16, # data type used for computations after quantization.
    #bnb_4bit_use_double_quant=True,        # double quantization, a technique to reduce quantization error (re-quantized with a second small scale factor)
    bnb_4bit_quant_type="nf4"              # nf4 stands for NormalFloat 4-bit (nf4 uses nonlinear mapping)
)

# Base model quantized to 4-bit
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    #load_in_4bit=True,
    quantization_config=bnb_config,
    device_map={'': 0},
    torch_dtype=torch.bfloat16  # optional for LoRA
)

modules = ["q_proj", "v_proj"]
print("Modules:", modules)

peft_config = LoraConfig(
    r=2,
    lora_alpha=8,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Get LoRA Config in the Quantized Model
model = get_peft_model(model, peft_config)
model

`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:11<00:00,  5.88s/it]

Modules: ['q_proj', 'v_proj']





PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): Embedding(256000, 2304, padding_idx=0)
        (layers): ModuleList(
          (0-25): 26 x Gemma2DecoderLayer(
            (self_attn): Gemma2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2304, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2304, out_features=2, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=2, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
          

In [13]:
modules = ["q_proj", "v_proj"]
print("Modules:", modules)


Modules: ['q_proj', 'v_proj']


In [14]:
# Define Training Arguments
qlora_args = TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    
    optim="paged_adamw_32bit",
    
    output_dir=os.environ["QLORA_FT_MODEL_PATH"],
    save_total_limit=1,
    save_strategy="epoch",
    
    logging_strategy="epoch",
    warmup_steps=30,
    
    learning_rate=5e-5,
    fp16=False,
    bf16=False,

    # System
    report_to="none",
    remove_unused_columns=False,
    dataloader_pin_memory=True,
    dataloader_num_workers=2,
)

# Trainer class
qlora_trainer = SFTTrainer(
    model=model,
    args=qlora_args,
    train_dataset=train_data,
    data_collator=data_collator,
)

# Before training
torch.cuda.reset_peak_memory_stats()
print("Allocated before training:", torch.cuda.memory_allocated()/1e9, "GB")
print("Reserved before training:", torch.cuda.memory_reserved()/1e9, "GB")

# Train qLoRA Model with the Medical Q&A data.
# After training, get peak memory usage
qlora_trainer.train()

print("Peak Allocated during training:", torch.cuda.max_memory_allocated()/1e9, "GB")
print("Peak Reserved during training:", torch.cuda.max_memory_reserved()/1e9, "GB")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 1}.


Allocated before training: 2.33914368 GB
Reserved before training: 2.64241152 GB


Step,Training Loss
24,4.1105
48,3.9467
72,3.753


Peak Allocated during training: 46.381351936 GB
Peak Reserved during training: 65.781366784 GB


In [16]:
#help(TrainingArguments)

---
#### üíæ Step 4.1: Save QLoRA Fine-Tuned Model


In [17]:
# =====================================================
#    4.1. Save QLoRA Fine-Tuning Model
# =====================================================
# FT QLoRA Model - ChatDoctor
qlora_model_chatdoctor = qlora_trainer.model

# Save QLoRA Model - ChatDoctor
save_path_qlora_ft_model = os.getenv("QLORA_FT_MODEL_PATH", None)
qlora_model_chatdoctor.save_pretrained(save_path_qlora_ft_model)
tokenizer.save_pretrained(save_path_qlora_ft_model)
#qlora_trainer.processing_class.save_pretrained(save_path_qlora_ft_model)


('/leonardo_scratch/large/userexternal/gcortiad/FT-models/scratch/QLoRA_model_chatdoctor_gemma-2-2b-it/tokenizer_config.json',
 '/leonardo_scratch/large/userexternal/gcortiad/FT-models/scratch/QLoRA_model_chatdoctor_gemma-2-2b-it/special_tokens_map.json',
 '/leonardo_scratch/large/userexternal/gcortiad/FT-models/scratch/QLoRA_model_chatdoctor_gemma-2-2b-it/chat_template.jinja',
 '/leonardo_scratch/large/userexternal/gcortiad/FT-models/scratch/QLoRA_model_chatdoctor_gemma-2-2b-it/tokenizer.model',
 '/leonardo_scratch/large/userexternal/gcortiad/FT-models/scratch/QLoRA_model_chatdoctor_gemma-2-2b-it/added_tokens.json',
 '/leonardo_scratch/large/userexternal/gcortiad/FT-models/scratch/QLoRA_model_chatdoctor_gemma-2-2b-it/tokenizer.json')

---
#### üîÑ Step 5: Restart Kernel


In [1]:
# ========================================
# 5. Restart Kernel 
# ========================================
# Restart Kernel to clear cached objects and training artifacts
# and to free GPU Memory (VRAM). This ensures a clean state for inference
# and prevent Out-Of-Memory (OOM) errors.use_gradient_checkpointing=True


---

### üîÆ Step 6: Inference with Base Model and QLoRA FT Model


In [27]:
# =====================================================
# 6. Inference with Base Model and QLoRA FT Model
# =====================================================
# Import models alone
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
from utils.utils import generate_chat_response
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer,
    DataCollatorForLanguageModeling, BitsAndBytesConfig
)
#os.environ["CUDA_VISIBLE_DEVICES"]="0"

# Define Environment Variables
os.environ["DATA_PATH"] = "/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/datasets/ChatDoctor-dataset/data/"
os.environ["MODEL_PATH"] = "/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/models/gemma-2-2b-it"
os.environ["QLORA_FT_MODEL_PATH"] = "/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/FT-models/QLoRA_model_chatdoctor_gemma-2-2b-it"
#"/leonardo_scratch/large/userexternal/gcortiad/FT-models/scratch/QLoRA_model_chatdoctor_gemma-2-2b-it"
#os.path.join("/leonardo_work/tra26_minwinsc/workshop-AddingKnowledgeToLLMs/FT-models/QLoRA_model_chatdoctor_gemma-2-2b-it")

# Define path of the Base Model
base_model_path = os.getenv("MODEL_PATH", None)
base_model_name = str(base_model_path.split("/")[-1])

# Define the path where Full FT Model is saved.
save_path_qlora_ft_model = os.getenv("QLORA_FT_MODEL_PATH", None)

# Read Base Model and Base Tokenizer
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    torch_dtype=torch.bfloat16,                     # Reduce GPU memory
    device_map="auto"                               # Automatically put layers on GPU
)
base_tokenizer = AutoTokenizer.from_pretrained(base_model_path)
#base_tokenizer.pad_token = base_tokenizer.eos_token          # ensure padding token is set


Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:01<00:00,  1.16it/s]


In [3]:
# How to do inference?
#help(generate_chat_response)

---

### ‚úÖ Step 6.1: Inference with Base Model


In [28]:
# =====================================================
#    6.1. Inference with Base Model
# =====================================================
instruction = "If you are a doctor, please answer the medical questions based on the patient's description."

user_message = "I woke up this morning feeling the whole room is spinning when i was sitting down. I went to the bathroom walking unsteadily, as i tried to focus i feel nauseous. I try to vomit but it wont come out.. After taking panadol and sleep for few hours, i still feel the same.. By the way, if i lay down or sit down, my head do not spin, only when i want to move around then i feel the whole world is spinning.. And it is normal stomach discomfort at the same time? Earlier after i relieved myself, the spinning lessen so i am not sure whether its connected or coincidences.. Thank you doc!"
user_message2 = "Hello, My husband is taking Oxycodone due to a broken leg/surgery. He has been taking this pain medication for one month. We are trying to conceive our second baby. Will this medication afect the fetus? Or the health of the baby? Or can it bring birth defects? Thank you."

messages = [
    {"role": "user", "content": f"INSTRUCTION:\n{instruction}\n\nPATIENT MESSAGE:\n{user_message}"}
]

response = generate_chat_response(
    messages=messages,
    model=base_model,
    tokenizer=base_tokenizer,
    device="cuda",
    max_new_tokens=512,
    temperature=0.2,
    top_p=0.85,
    top_k=50,
    no_repeat_ngram_size=3,
)

print(response)


I understand you're experiencing dizziness and nausea, and I'm sorry to hear you've been feeling unwell.  

**It's important to understand that I am not a real doctor and cannot provide medical advice.** The information below is for general knowledge and should not be considered a substitute for professional medical advice. 

Based on your description, it sounds like you might be experiencing **vertigo**, which is a feeling of dizziness or spinning sensation.  Here are some possible causes:

* **Benign Paroxysmal Positional Vertigo (BPPV):** This is a common cause of vertigo, where tiny calcium crystals in the inner ear become dislodged and cause the sensation of spinning. It's often triggered by changes in head position.
* **Migraine:**  Some people experience vertigo as a symptom of migraines.
  
**What you should do:**

1. **Seek immediate medical attention:**  It'd be best to see a doctor as soon as possible to rule out any serious underlying conditions. 
2. **Keep track of your sy

---

### üß™ Step 6.2: Inference with QLoRA FT Model


In [29]:
# =====================================================
#    6.2. Inference with QLoRA FT Model
# =====================================================
# Read QLoRA FT Model and QLoRA FT Tokenizer
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

qmodel = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    quantization_config=bnb_config,
    device_map="cuda:0", # device_map="auto",
)

# Read LoRA FT Model and LoRA FT Tokenizer
qlora_model = PeftModel.from_pretrained(qmodel, save_path_qlora_ft_model)

qlora_tokenizer = AutoTokenizer.from_pretrained(save_path_qlora_ft_model)

response = generate_chat_response(
    messages=messages,
    model=qlora_model,
    tokenizer=qlora_tokenizer,
    device="cuda",
    max_new_tokens=512,
    temperature=0.2,
    top_p=0.85,
    top_k=50,
    no_repeat_ngram_size=3,
)

print(response)


Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:02<00:00,  1.19s/it]


Hi, thanks for your query. I understand your concern. It seems you are suffering from vertigo. It is a condition where you feel dizzy and the room is rotating. It can be due to various reasons. The most common cause is inner ear problem. It may be due imbalance in inner ear or due to some infection. You need to consult an ENT specialist for proper diagnosis and treatment. He will examine you and may need some tests like MRI of the brain and inner ear. Treatment will be based on diagnosis. It will be mostly anti-vertigo medicines. You will need to take these medicines regularly for some time. You should avoid alcohol and smoking. Hope this answers your query, please feel free to ask further. Wish you good health.
