#  QLoRA Finetuning Sprint - Medical AI Assistant


---

##  Workflow & Dependencies

###  Complete Workflow (All Sections Work!)
1. **Sections 1-10**: Training & Model Saving
2. **Sections 11-12**: LLM Evaluation + Guardrails (uses `google-genai`)
3. **Section 13**: GGUF Export


###  Single Dependency Set:
```python
pip install google-genai pydantic>=2.9 rouge-score
```

---

##  Important Disclaimers

### Medical Disclaimer
**This is for EDUCATIONAL PURPOSES ONLY.** The models and outputs produced here are NOT intended for clinical use, medical diagnosis, treatment recommendations, or any real-world medical application. Always consult qualified healthcare professionals for medical advice.

### Licensing & Redistribution
- **Model License**: Check the license of the base model you use (e.g., Qwen, Llama, Mistral). Some models have restrictions on commercial use or redistribution.
- **Dataset License**: Verify the license for the medical dataset (e.g., AlpaCare-MedInstruct-52k). Ensure finetuning and redistribution are permitted.
- **Finetuned Weights**: If you merge and share the finetuned model, you must comply with both the base model and dataset licenses. Always include proper attribution and model cards.

---

## Hardware Assumptions (Colab Free Tier)

- **GPU**: NVIDIA T4 (~15 GB VRAM)
- **Compute dtype**: FP16 (T4 does not support BF16)
- **Quantization**: 4-bit NF4 with double quantization
- **Batch settings**: micro_batch_size=1, gradient_accumulation_steps=64
- **Dataset subsample**: 500 examples
- **Max sequence length**: 512 tokens
- **Training steps**: 250

These settings are tuned to avoid OOM on T4 Free tier.

---

#### 00. Install dependancies

In [None]:
%%capture

# Core training libraries
!pip install -q --upgrade \
    transformers>=4.38.0  \
    accelerate>=0.26.0 \
    peft \
    trl \
    bitsandbytes>=0.41.0 \
    evaluate \
    datasets==2.20.0 \
    tokenizers==0.19.1

# Utilities
!pip install -q \
    numpy \
    pandas \
    scikit-learn \
    rich \
    pyyaml \
    python-dotenv \
    tqdm

# Evaluation (requires pydantic v2)
!pip install -q --upgrade pydantic
!pip install -q google-genai rouge-score

print(" Installation complete!")
print(" All dependencies compatible (pydantic v2 + google-genai)")

## 1. Setting Up Environment Variables (Secrets)

In [None]:

# Create .env file with API key
import os
from google.colab import userdata
# Write .env file
# with open('.env', 'w') as f:
#     # Add the secrets if needed
#     f.write('GOOGLE_API_KEY=<api_key_here>\n')
#     f.write('HF_TOKEN=<api_key_here>\n')

# print(" .env file created")

with open('.env', 'w') as f:
    # Add the secrets if needed
    f.write(f'GOOGLE_API_KEY={userdata.get('GOOGLE_API_KEY')}\n')
    f.write(f'HF_TOKEN={userdata.get('HF_TOKEN')}\n')

print(" .env file created")



 .env file created


In [None]:
# Verify it's loaded
from dotenv import load_dotenv
load_dotenv()
# Show only key names for security
try:
    with open('.env', 'r') as f:
        print(" Keys in .env file:")
        print("="*60)
        for line in f:
            line = line.strip()
            if line and not line.startswith('#'):
                key = line.split('=')[0]
                value_preview = line.split('=')[1][:10] + "..." if '=' in line else ""
                print(f"  {key} = {value_preview}")
        print("="*60)
except FileNotFoundError:
    print(" .env file not found")

 Keys in .env file:
  GOOGLE_API_KEY = AIzaSyCkD_...
  HF_TOKEN = hf_SyHkPTh...


## 2. Environment & GPU Check

In [None]:
import sys
import torch

print("="*60)
print("ENVIRONMENT CHECK")
print("="*60)
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
    print(f"Device capability: {torch.cuda.get_device_capability(0)}")
    print(f"Total VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print(" WARNING: CUDA not available. Training will be VERY slow on CPU.")

print("="*60)

ENVIRONMENT CHECK
Python version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
PyTorch version: 2.9.0+cu126
CUDA available: True
CUDA version: 12.6
Device name: Tesla T4
Device capability: (7, 5)
Total VRAM: 14.74 GB


## 3. Seeds & Determinism

Setting up random seeds for reproducibility. .

In [None]:
import os
import random
import numpy as np
import torch

SEED = 42

# Set environment variable for Python hash seed
os.environ['PYTHONHASHSEED'] = str(SEED)

# Set seeds
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)
    # Note: These settings may impact performance
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

print(f" Seeds set to {SEED} for reproducibility")

 Seeds set to 42 for reproducibility


## 4. Hugging Face Login

If you want to push your finetuned adapter to the Hugging Face Hub, uncomment and run the login line below.

Hugging Face token with write permissions. Get one at: https://huggingface.co/settings/tokens

In [None]:

os.environ["HF_TOKEN"] = os.getenv("HF_TOKEN")
!hf auth login --token $HF_TOKEN

print("â„¹ Hugging Face login skipped. Uncomment login() to push models to Hub.")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `hf`CLI if you want to set the git credential as well.
Token is valid (permission: write).
The token `Sahas AI` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
â„¹ Hugging Face login skipped. Uncomment login() to push models to Hub.


## 5. Configuration (Single Source of Truth)

All hyperparameters and settings in one place. **Edit here** to customize your training.

In [None]:
import torch
from pprint import pprint

# Auto-detect compute dtype (BF16 requires compute capability >= 8.0)
use_bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8
compute_dtype = torch.bfloat16 if use_bf16 else torch.float16

CONFIG = {
    # Model
    "base_model": "Qwen/Qwen2.5-1.5B-Instruct",
    # Alternative for tighter VRAM: "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    # For GGUF export, prefer: "meta-llama/Llama-3.2-3B-Instruct" or Mistral models

    # Dataset
    "dataset_name": "lavita/AlpaCare-MedInstruct-52k",
    "dataset_split": "train",
    "dataset_subsample": 500,  # Colab-safe: 500 | Local: 1500
    "train_val_split": 0.9,  # 90% train, 10% validation

    # Tokenization
    "max_length": 512,  # Colab: 512 | Local: 1024

    # Training
    "num_train_epochs": 1,
    "max_steps": 250,  # Colab: 250 | Local: 600
    "per_device_train_batch_size": 1,  # Colab: 1 | Local: 2
    "gradient_accumulation_steps": 64,  # Colab: 64 | Local: 32
    "learning_rate": 2e-5,
    "warmup_ratio": 0.03,
    "logging_steps": 10,
    "save_steps": 200,
    "eval_steps": 100,
    "save_total_limit": 2,

    # LoRA
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "lora_target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],

    # Quantization
    "load_in_4bit": True,
    "bnb_4bit_compute_dtype": compute_dtype,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": True,

    # Output
    "output_dir": "outputs/adapter",
    "push_to_hub": False,

    # Generation
    "max_new_tokens": 128,
    "temperature": 0.0,  # Deterministic
    "do_sample": True,

    # HF credentials
    'hf_username': 'p-sahas',
    'hub_model_name': 'sahas-medical-assistant',
}

# Effective batch size
effective_batch_size = CONFIG["per_device_train_batch_size"] * CONFIG["gradient_accumulation_steps"]

print("="*60)
print("CONFIGURATION (COLAB FREE TIER)")
print("="*60)
pprint(CONFIG)
print("="*60)
print(f"Compute dtype: {compute_dtype}")
print(f"Using BF16: {use_bf16}")
print(f"Effective batch size: {effective_batch_size}")
print("="*60)

CONFIGURATION (COLAB FREE TIER)
{'base_model': 'Qwen/Qwen2.5-1.5B-Instruct',
 'bnb_4bit_compute_dtype': torch.float16,
 'bnb_4bit_quant_type': 'nf4',
 'bnb_4bit_use_double_quant': True,
 'dataset_name': 'lavita/AlpaCare-MedInstruct-52k',
 'dataset_split': 'train',
 'dataset_subsample': 500,
 'do_sample': True,
 'eval_steps': 100,
 'gradient_accumulation_steps': 64,
 'hf_username': 'p-sahas',
 'hub_model_name': 'sahas-medical-assistant',
 'learning_rate': 2e-05,
 'load_in_4bit': True,
 'logging_steps': 10,
 'lora_alpha': 32,
 'lora_dropout': 0.05,
 'lora_r': 16,
 'lora_target_modules': ['q_proj',
                         'k_proj',
                         'v_proj',
                         'o_proj',
                         'gate_proj',
                         'up_proj',
                         'down_proj'],
 'max_length': 512,
 'max_new_tokens': 128,
 'max_steps': 250,
 'num_train_epochs': 1,
 'output_dir': 'outputs/adapter',
 'per_device_train_batch_size': 1,
 'push_to_hub': False,


#### FP16 vs BF16

- BF -> Brain Float
- Usually FP16 prioratize precision
    - 5 exponent bits
    - 10 mantissa bits
- But BF prioratize dynamic range
    - 8 exponent bits
    - 7 mantiss bits


## 6. Dataset Loader (+ Fallback)

Load the medical instruction dataset, map fields robustly, and create train/validation splits.

In [None]:
from datasets import load_dataset, Dataset
import json

def load_medical_dataset(dataset_name, split, subsample, seed=42):
    """Load dataset with robust field mapping and fallback."""

    try:
        # Try loading from Hugging Face
        print(f" Loading dataset: {dataset_name}...")
        dataset = load_dataset(dataset_name, split=split)
        dataset = dataset.shuffle(seed=seed).select(range(min(subsample, len(dataset))))
        print(f" Loaded {len(dataset)} examples from Hugging Face")

    except Exception as e:
        print(f" Failed to load from Hugging Face: {e}")
        print(" Creating synthetic fallback dataset...")

        # Create synthetic medical instruction data
        synthetic_data = []
        templates = [
            {
                "instruction": "Explain the following medical term in simple language.",
                "input": "Hypertension",
                "output": "Hypertension, commonly known as high blood pressure, is a condition where the force of blood against artery walls is consistently too high. This can lead to serious health complications if left untreated."
            },
            {
                "instruction": "What are the common symptoms of the following condition?",
                "input": "Type 2 Diabetes",
                "output": "Common symptoms of Type 2 Diabetes include increased thirst, frequent urination, increased hunger, fatigue, blurred vision, slow-healing sores, and frequent infections."
            },
            {
                "instruction": "Provide general advice for managing the following health issue.",
                "input": "Chronic back pain",
                "output": "Managing chronic back pain typically involves: maintaining good posture, regular low-impact exercise like swimming or walking, maintaining a healthy weight, using proper lifting techniques, and consulting with healthcare providers for appropriate treatment options."
            },
        ]

        # Duplicate to reach ~120 examples
        for i in range(40):
            for template in templates:
                synthetic_data.append(template)

        # Save to temporary JSONL
        with open("/tmp/synthetic_medical.jsonl", "w") as f:
            for item in synthetic_data[:subsample]:
                f.write(json.dumps(item) + "\n")

        dataset = load_dataset("json", data_files="/tmp/synthetic_medical.jsonl", split="train")
        print(f" Created synthetic dataset with {len(dataset)} examples")

    return dataset


def map_dataset_fields(example):
    """Robustly map dataset fields to instruction/input/output schema."""

    # Try to find instruction
    instruction = None
    for key in ["instruction", "question", "prompt", "task"]:
        if key in example and example[key]:
            instruction = str(example[key]).strip()
            break

    # Try to find input (optional)
    input_text = ""
    for key in ["input", "context", "passage", "history"]:
        if key in example and example[key]:
            input_text = str(example[key]).strip()
            break

    # Try to find output/target
    output = None
    for key in ["output", "response", "answer", "target", "completion"]:
        if key in example and example[key]:
            output = str(example[key]).strip()
            break

    return {
        "instruction": instruction,
        "input": input_text,
        "output": output
    }


# Load dataset
dataset = load_medical_dataset(
    CONFIG["dataset_name"],
    CONFIG["dataset_split"],
    CONFIG["dataset_subsample"],
    seed=SEED
)

print(f"\n Dataset before cleaning: {len(dataset)} examples")

# Map fields
dataset = dataset.map(map_dataset_fields)

# Drop rows with missing instruction or output
dataset = dataset.filter(lambda x: x["instruction"] is not None and x["output"] is not None)

print(f" Dataset after cleaning: {len(dataset)} examples")
print(f" Dropped {CONFIG['dataset_subsample'] - len(dataset)} examples with missing data\n")

# Split into train/validation
split_dataset = dataset.train_test_split(
    train_size=CONFIG["train_val_split"],
    seed=SEED
)
train_dataset = split_dataset["train"]
val_dataset = split_dataset["test"]

print(f" Train: {len(train_dataset)} | Validation: {len(val_dataset)}")
print("\n Sample example:")
print(train_dataset[0])

 Loading dataset: lavita/AlpaCare-MedInstruct-52k...


Downloading readme:   0%|          | 0.00/944 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/36.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

 Loaded 500 examples from Hugging Face

 Dataset before cleaning: 500 examples


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

 Dataset after cleaning: 500 examples
 Dropped 0 examples with missing data

 Train: 450 | Validation: 50

 Sample example:
{'output': "As a 40-year-old pregnant woman, your age does increase the risk of having a baby with Down syndrome. However, it's important to note that the majority of babies born to women in their 40s are still healthy and do not have Down syndrome. \n\nThe risk of having a baby with Down syndrome at the age of 40 is approximately 1 in 100. This means that out of 100 pregnancies at this age, around 1 will be affected by Down syndrome. \n\nTo get more accurate information about your individual risk, you may consider undergoing prenatal screening or diagnostic tests. These tests can provide more specific information regarding the chance of your baby having Down syndrome. It's advisable to consult with your healthcare provider who can guide you through the appropriate testing options based on your personal medical history and preferences.", 'input': '<noinput>', 'ins

### Preview First 50 Samples

Let's visualize the first 50 samples of the dataset as a dataframe for easy inspection.

In [None]:
import pandas as pd

# Convert first 50 samples to dataframe
df_preview = pd.DataFrame(train_dataset[:50])

# Display with formatting
pd.set_option('display.max_colwidth', 100)  # Limit column width for readability
print(f" Displaying first 50 samples out of {len(dataset)} total examples\n")
df_preview

 Displaying first 50 samples out of 500 total examples



Unnamed: 0,output,input,instruction
0,"As a 40-year-old pregnant woman, your age does increase the risk of having a baby with Down synd...",<noinput>,"Ask about the possible genetic risks your child might face related to Down Syndrome, given that ..."
1,"As a medical expert, I cannot provide a specific treatment recommendation without evaluating the...","Patient information: 55 years old female, with a known family history of essential hypertension ...","Based on the given medical history, which treatment option for essential hypertension would be b..."
2,The heart's electrical system plays a crucial role in making the heart beat and ensuring the con...,"The heart's electrical wiring keeps it beating, which controls the continuous exchange of oxygen...",Simplify the explanation about heart's electrical system and its role in making the heart beat.
3,Chemotherapy is a common treatment for breast cancer and can be effective in destroying cancer c...,I got diagnosed with breast cancer and my doctor said I need chemotherapy. I'm worried about the...,Discuss your concerns about chemotherapy's side effects with an oncologist.
4,Pneumonia is an infection that causes inflammation in the small air sacs called alveoli in one o...,"""Pneumonia is an infection that inflames the alveoli in one or both lungs.",Simplify the following complex medical term into simpler terminologies.
5,"Based on the symptoms and history provided, there are several possible diagnoses to consider. Th...","Patient is 45 female, shortness of breath especially on laying down, fatigue, lower ankle swelli...","Based on the symptoms and history provided, make a probable diagnosis considering multiple disea..."
6,The major type of muscle present at the back region is C) Skeletal muscles.,A) Smooth muscles B) Cardiac muscles C) Skeletal muscles D) Pharyngeal muscle,Choose the major type of muscle present at the back region.
7,Heart failure develops over time as a result of various underlying conditions and factors. Initi...,<noinput>,Write a comprehensive paragraph explaining how heart failure develops over time.
8,How does your muscular system work when you lift a heavy object?,<noinput>,Ask a question related to how your muscular system works when you lift a heavy object.
9,To manage high blood glucose levels in a patient with type-2 diabetes mellitus who is already on...,A 60-year-old man with a history of type-2 diabetes mellitus is using Metformin. Upon routine ch...,"Solve the following USMLE-style question, providing the correct answer supported by reasoning."


### Token Length & Truncation Diagnostics

In [None]:
from transformers import AutoTokenizer
import numpy as np

# Load tokenizer for diagnostics
print(f"Loading tokenizer: {CONFIG['base_model']}")
# Try primary load; fall back if the model's remote code/tokenizer isn't importable
try:
    tokenizer = AutoTokenizer.from_pretrained(CONFIG["base_model"], trust_remote_code=True)
except ModuleNotFoundError as e:
    print("ModuleNotFoundError while loading tokenizer:", e)
    print("Trying again with use_fast=False (fallback to Python tokenizer)...")
    try:
        tokenizer = AutoTokenizer.from_pretrained(CONFIG["base_model"], trust_remote_code=True, use_fast=False)
    except Exception as e2:
        print("Fallback to model tokenizer failed:", e2)
        fallback = "gpt2"
        print(f"Falling back to a generic tokenizer ({fallback}) for diagnostics (lengths will be approximate).")
        tokenizer = AutoTokenizer.from_pretrained(fallback)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Sample up to 500 examples for diagnostics
sample_size = min(500, len(train_dataset))
sample_dataset = train_dataset.select(range(sample_size))

# Compute token lengths
token_lengths = []
for example in sample_dataset:
    # Concatenate instruction + input + output
    text = f"{example['instruction']} {example['input']} {example['output']}"
    tokens = tokenizer(text, add_special_tokens=True)
    token_lengths.append(len(tokens["input_ids"]))

token_lengths = np.array(token_lengths)

print("="*60)
print("TOKEN LENGTH DIAGNOSTICS")
print("="*60)
print(f"Sample size: {sample_size}")
print(f"Average token length: {token_lengths.mean():.1f}")
print(f"Median token length: {np.median(token_lengths):.1f}")
print(f"Min token length: {token_lengths.min()}")
print(f"Max token length: {token_lengths.max()}")
print(f"95th percentile: {np.percentile(token_lengths, 95):.1f}")
print(f"99th percentile: {np.percentile(token_lengths, 99):.1f}")
print()
truncated = (token_lengths > CONFIG["max_length"]).sum()
truncation_rate = truncated / len(token_lengths) * 100
print(f"Truncation at max_length={CONFIG['max_length']}: {truncated}/{len(token_lengths)} ({truncation_rate:.1f}%)")
print("="*60)

Loading tokenizer: Qwen/Qwen2.5-1.5B-Instruct


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

TOKEN LENGTH DIAGNOSTICS
Sample size: 450
Average token length: 233.0
Median token length: 265.0
Min token length: 28
Max token length: 496
95th percentile: 363.5
99th percentile: 423.4

Truncation at max_length=512: 0/450 (0.0%)


### Build SFT Prompts

In [None]:
def chatml_format(user_text, system_text="You are a helpful medical assistant.", assistant_text=None):
    """Format text in ChatML style."""
    messages = [
        {"role": "system", "content": system_text},
        {"role": "user", "content": user_text},
    ]
    if assistant_text:
        messages.append({"role": "assistant", "content": assistant_text})

    #CHatML Formal
    formatted = ""
    for msg in messages:
      formatted += f"<|im_start|>{msg['role']}\n{msg['content']}<|im_end|>"
    return formatted

def build_sft_prompt(row):
    """Build SFT prompt from dataset row."""
    user_text = row['instruction']
    if row['input']:
        user_text += f"\n\nInput: {row['input']}"

    prompt = chatml_format(user_text=user_text,
                           assistant_text= row['output'])

    return {"text": prompt}

# Map datasets to text format
train_dataset = train_dataset.map(build_sft_prompt)
val_dataset = val_dataset.map(build_sft_prompt)

print(" Prompts built for SFT")
print("\n Sample formatted prompt:")
print(train_dataset[0]["text"][:500] + "...")

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

 Prompts built for SFT

 Sample formatted prompt:
<|im_start|>system
You are a helpful medical assistant.<|im_end|><|im_start|>user
Ask about the possible genetic risks your child might face related to Down Syndrome, given that you're a 40years old pregnant woman.

Input: <noinput><|im_end|><|im_start|>assistant
As a 40-year-old pregnant woman, your age does increase the risk of having a baby with Down syndrome. However, it's important to note that the majority of babies born to women in their 40s are still healthy and do not have Down syndrome...


## 7. Baseline Inference (BEFORE Finetuning)

Load the base model in 4-bit and run inference on two medical prompts to establish a baseline.

In [35]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import time
import torch

# Quantization config
# bnb_config = BitsAndBytesConfig(
#     # load_in_4bit=CONFIG["load_in_4bit"],
#     bnb_4bit_compute_dtype=torch.float16,  # Explicitly use torch.float16
#     # bnb_4bit_compute_dtype=CONFIG["bnb_4bit_compute_dtype"],
#     bnb_4bit_quant_type=CONFIG["bnb_4bit_quant_type"],
#     bnb_4bit_use_double_qant=CONFIG["bnb_4bit_use_double_quant"],
# )

# Safe version - always use float16 for T4
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,  # Hardcode to float16
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_use_double_quant=True,
# )

base_model = AutoModelForCausalLM.from_pretrained(
    CONFIG["base_model"],
    torch_dtype=torch.float16,
    # quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Test prompts
test_prompts = [
    {
        "title": "Patient Instruction Clarification",
        "prompt": "Rewrite the following patient instruction in simpler language: Take 500mg of acetaminophen orally every 6 hours as needed for pain, not to exceed 3000mg in 24 hours."
    },
    {
        "title": "Medical Note Summary",
        "prompt": "Summarize this medical note: Patient presents with acute onset of chest pain radiating to left arm, accompanied by dyspnea and diaphoresis. Vitals: BP 145/92, HR 98, RR 22, SpO2 94% on room air."
    },
]

print("\n" + "="*60)
print("BASELINE OUTPUTS (PRE-FINETUNE)")
print("="*60)

if torch.cuda.is_available():
    vram_before = torch.cuda.memory_allocated() / 1024**3
    print(f"VRAM before generation: {vram_before:.2f} GB\n")

for i, test in enumerate(test_prompts, 1):
  # Format as ChatML
  formatted_prompt = chatml_format(test['prompt'])
  inputs = tokenizer(formatted_prompt, return_tensors="pt").to(base_model.device)

  start_time = time.time()
  with torch.no_grad():
    outputs = base_model.generate(
        **inputs,
        max_new_tokens=CONFIG["max_new_tokens"],
        temperature=CONFIG["temperature"] if CONFIG['temperature'] > 0 else None,
        do_sample=CONFIG["do_sample"],
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

  generation_time = time.time() - start_time()

  # Decode and print
  generated_text = tokenizer.batch_decode(outputs[0], skip_special_tokens=False)
  print(generated_text)


if torch.cuda.is_available():
    vram_after = torch.cuda.memory_allocated() / 1024**3
    print(f"\nVRAM after generation: {vram_after:.2f} GB")
    print(f"VRAM delta: {vram_after - vram_before:.2f} GB")

print("\n" + "="*60)


BASELINE OUTPUTS (PRE-FINETUNE)
VRAM before generation: 4.48 GB



TypeError: 'float' object is not callable

## 8. Model + LoRA Setup and Training

Prepare the model for k-bit training, apply LoRA, and train with SFTTrainer.

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig

# Prepare model for k-bit training
print("ðŸ”§ Preparing model for k-bit training...")
"""YOUR CODE HERE"""

# LoRA config
"""YOUR CODE HERE"""

# Apply LoRA
print("ðŸ”§ Applying LoRA adapters...")
"""YOUR CODE HERE"""

# Training arguments
training_args = SFTConfig(
    output_dir=CONFIG["output_dir"],
    num_train_epochs=CONFIG["num_train_epochs"],
    max_steps=CONFIG["max_steps"],
    per_device_train_batch_size=CONFIG["per_device_train_batch_size"],
    per_device_eval_batch_size=CONFIG["per_device_train_batch_size"],
    gradient_accumulation_steps=CONFIG["gradient_accumulation_steps"],
    learning_rate=CONFIG["learning_rate"],
    warmup_ratio=CONFIG["warmup_ratio"],
    logging_steps=CONFIG["logging_steps"],
    save_steps=CONFIG["save_steps"],
    eval_steps=CONFIG["eval_steps"],
    save_total_limit=CONFIG["save_total_limit"],
    fp16=not use_bf16,
    bf16=use_bf16,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    report_to="none",
    max_seq_length=CONFIG["max_length"],
    packing=False,
    dataset_text_field="text",
)

# Create trainer
"""YOUR CODE HERE"""

print("\n" + "="*60)
print("TRAINING CONFIGURATION")
print("="*60)
print(f"Effective batch size: {effective_batch_size}")
print(f"Training steps: {CONFIG['max_steps']}")
print(f"Total samples: {CONFIG['max_steps'] * effective_batch_size}")
print(f"Optimizer: paged_adamw_8bit")
print(f"Learning rate: {CONFIG['learning_rate']}")
print(f"LoRA rank: {CONFIG['lora_r']}")
print("="*60)

# Train
print("\nðŸš€ Starting training...\n")
"""YOUR CODE HERE"""

# To resume from checkpoint, uncomment:
# trainer.train(resume_from_checkpoint=True)

# print("\n" + "="*60)
# print("TRAINING COMPLETE")
# print("="*60)
# print(f"Total training time: {train_result.metrics.get('train_runtime', 0):.2f}s")
# print(f"Samples per second: {train_result.metrics.get('train_samples_per_second', 0):.2f}")
# print(f"Steps per second: {train_result.metrics.get('train_steps_per_second', 0):.4f}")

# # Estimate tokens/sec
# total_tokens = CONFIG['max_steps'] * effective_batch_size * CONFIG['max_length']
# tokens_per_sec = total_tokens / train_result.metrics.get('train_runtime', 1)
# print(f"Approximate tokens/second: {tokens_per_sec:.1f}")
# print("="*60)

In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Clear any existing models from memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Minimal quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

print("Attempting to load model...")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

try:
    base_model = AutoModelForCausalLM.from_pretrained(
        "Qwen/Qwen2.5-1.5B-Instruct",
        quantization_config=bnb_config,
        trust_remote_code=True,
    )
    print("âœ“ Model loaded successfully!")
    print(f"Model device: {base_model.device}")

except Exception as e:
    print(f"âœ— Error during loading: {type(e).__name__}")
    print(f"Error message: {str(e)}")
    import traceback
    traceback.print_exc()

In [None]:
!pip install --upgrade transformers accelerate bitsandbytes

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    CONFIG["base_model"],
    quantization_config=bnb_config,
    # trust_remote_code=True,  # Comment this out temporarily
)