# Medical Assistant - LLM Fine-Tuning Project

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/YOUR_USERNAME/medical-assistant-llm/blob/main/Medical_Assistant_LLM_FineTuning.ipynb)

---

## Project Overview

**Purpose:** Develop an intelligent Medical Assistant chatbot that provides accurate medical information, explains terminology, and answers health-related questions using fine-tuned LLM.

**Domain Justification:** Many people lack immediate access to medical professionals for basic health questions. This 24/7 AI assistant serves rural communities, developing countries, and supports patient education.

**Disclaimer:** Educational tool only - NOT a substitute for professional medical advice.

---

## Installation & Setup

Installing all required libraries for model fine-tuning, evaluation, and deployment.

In [1]:
# Colab-optimized installation - minimal packages only
print("Installing packages for Google Colab...\n")

# Install only what Colab doesn't have pre-installed
!pip install -q -U bitsandbytes peft trl
!pip install -q -U evaluate rouge-score

print("\n Installation complete!")

Installing packages for Google Colab...

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.5/540.5 kB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone

 Installation complete!


## Import Libraries

In [2]:
import os
import json
import torch
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
import gradio as gr
from typing import List, Dict
import warnings
warnings.filterwarnings('ignore')

# For evaluation metrics
import nltk
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt', quiet=True)

# Load evaluation metrics
from evaluate import load
bleu_metric = load("bleu")
rouge_metric = load("rouge")

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

PyTorch version: 2.9.0+cu128
CUDA available: True
GPU: Tesla T4
GPU Memory: 14.56 GB


## Dataset Collection & Preprocessing

**Dataset:** MedQA Medical Flashcards (`medalpaca/medical_meadow_medical_flashcards`)  
**Size:** 33,000+ medical Q&A pairs | Using 5,000 samples for efficient training  
**Quality:** Professionally curated from medical education resources

In [3]:
# Load the medical dataset
print("Loading medical dataset...")
dataset = load_dataset("medalpaca/medical_meadow_medical_flashcards")

print(f"\nDataset loaded successfully.")
print(f"Total samples: {len(dataset['train'])}")
print(f"\nDataset structure:")
print(dataset)

# Display sample data
print("\nSample Medical Q&A:")
for i in range(3):
    sample = dataset['train'][i]
    print(f"\n--- Sample {i+1} ---")
    print(f"Question: {sample['input'][:200]}..." if len(sample['input']) > 200 else f"Question: {sample['input']}")
    print(f"Answer: {sample['output'][:200]}..." if len(sample['output']) > 200 else f"Answer: {sample['output']}")

Loading medical dataset...


README.md: 0.00B [00:00, ?B/s]

medical_meadow_wikidoc_medical_flashcard(…):   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/33955 [00:00<?, ? examples/s]


Dataset loaded successfully.
Total samples: 33955

Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'instruction'],
        num_rows: 33955
    })
})

Sample Medical Q&A:

--- Sample 1 ---
Question: What is the relationship between very low Mg2+ levels, PTH levels, and Ca2+ levels?
Answer: Very low Mg2+ levels correspond to low PTH levels which in turn results in low Ca2+ levels.

--- Sample 2 ---
Question: What leads to genitourinary syndrome of menopause (atrophic vaginitis)?
Answer: Low estradiol production leads to genitourinary syndrome of menopause (atrophic vaginitis).

--- Sample 3 ---
Question: What does low REM sleep latency and experiencing hallucinations/sleep paralysis suggest?
Answer: Low REM sleep latency and experiencing hallucinations/sleep paralysis suggests narcolepsy.


### Data Preprocessing Pipeline

1. **Data Cleaning** - Remove nulls, whitespace, special characters
2. **Filtering** - Remove short responses (<10 chars), duplicates
3. **Normalization** - Standardize spacing, punctuation
4. **Format Conversion** - Instruction-response template with proper prompts
5. **Dataset Split** - 90% train, 10% validation
6. **Tokenization** - Model tokenizer, max 512 tokens, special tokens added

In [4]:
import re
from html import unescape

def clean_text(text: str) -> str:
    """
    Comprehensive text cleaning function.

    Args:
        text: Raw text string

    Returns:
        Cleaned text string
    """
    if not isinstance(text, str):
        return ""

    # Decode HTML entities
    text = unescape(text)

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)

    # Strip leading/trailing whitespace
    text = text.strip()

    return text

def preprocess_dataset(dataset, sample_size: int = 5000):
    """
    Comprehensive preprocessing pipeline for the medical dataset.

    Args:
        dataset: HuggingFace dataset object
        sample_size: Number of samples to use (for efficient training)

    Returns:
        Processed dataset ready for fine-tuning
    """
    print("Starting data preprocessing...")

    # Convert to pandas for easier manipulation
    df = pd.DataFrame(dataset['train'])

    print(f"Initial dataset size: {len(df)}")

    # Step 1: Remove entries with missing data
    df = df.dropna(subset=['input', 'output'])
    print(f"After removing nulls: {len(df)}")

    # Step 2: Clean text
    df['input'] = df['input'].apply(clean_text)
    df['output'] = df['output'].apply(clean_text)

    # Step 3: Filter out very short responses
    df = df[df['output'].str.len() >= 10]
    df = df[df['input'].str.len() >= 5]
    print(f"After filtering short entries: {len(df)}")

    # Step 4: Remove duplicates
    df = df.drop_duplicates(subset=['input'])
    print(f"After removing duplicates: {len(df)}")

    # Step 5: Sample dataset for efficient training (1000-5000 samples as recommended)
    if len(df) > sample_size:
        df = df.sample(n=sample_size, random_state=42)
        print(f"Sampled to: {len(df)} samples")

    # Step 6: Create instruction format
    def format_instruction(row):
        """Format data into instruction-response template."""
        return f"""<s>[INST] You are a knowledgeable medical assistant. Answer the following medical question accurately and clearly.

Question: {row['input']} [/INST]
{row['output']}</s>"""

    df['text'] = df.apply(format_instruction, axis=1)

    # Step 7: Split into train and validation sets (90-10 split)
    train_size = int(0.9 * len(df))
    train_df = df[:train_size]
    val_df = df[train_size:]

    print(f"\nPreprocessing complete.")
    print(f"Training samples: {len(train_df)}")
    print(f"Validation samples: {len(val_df)}")

    # Convert back to HuggingFace datasets
    train_dataset = Dataset.from_pandas(train_df[['text']])
    val_dataset = Dataset.from_pandas(val_df[['text']])

    return train_dataset, val_dataset, df

# Preprocess the dataset
train_dataset, val_dataset, processed_df = preprocess_dataset(dataset, sample_size=5000)

# Display preprocessing statistics
print("\nDataset Statistics:")
print(f"Average input length: {processed_df['input'].str.len().mean():.1f} characters")
print(f"Average output length: {processed_df['output'].str.len().mean():.1f} characters")
print(f"Average total length: {processed_df['text'].str.len().mean():.1f} characters")

Starting data preprocessing...
Initial dataset size: 33955
After removing nulls: 33955
After filtering short entries: 33527
After removing duplicates: 33268
Sampled to: 5000 samples

Preprocessing complete.
Training samples: 4500
Validation samples: 500

Dataset Statistics:
Average input length: 93.9 characters
Average output length: 354.5 characters
Average total length: 587.4 characters


### Display Preprocessed Samples

In [5]:
# Display preprocessed samples
print("\nPreprocessed Training Samples:")
for i in range(2):
    print(f"\n{'='*80}")
    print(f"Sample {i+1}:")
    print(f"{'='*80}")
    print(train_dataset[i]['text'][:500] + "..." if len(train_dataset[i]['text']) > 500 else train_dataset[i]['text'])


Preprocessed Training Samples:

Sample 1:
<s>[INST] You are a knowledgeable medical assistant. Answer the following medical question accurately and clearly.

Question: What are the corresponding dermatomes for the following areas of the body: shoulders, thumb, nipple, umbilicus, kneecaps, great toe, and little toe? [/INST]
The corresponding dermatomes for the following areas of the body are: C4 for shoulders, C6 for thumb, T4 for nipple, T10 for umbilicus, L4 for kneecaps, L5 for great toe, and S1 for little toe.</s>

Sample 2:
<s>[INST] You are a knowledgeable medical assistant. Answer the following medical question accurately and clearly.

Question: Even after tolerance develops, what adverse effect of opiates can persist? [/INST]
Miosis is an adverse effect of opiates that can persist even after tolerance develops. Opiates are medications that are commonly used to treat pain, but they can also cause a range of side effects, including miosis, which is the constriction of the pupils.

## Model Selection & Configuration

**Base Model:** TinyLlama-1.1B-Chat-v1.0
- 1.1B parameters - ideal for Colab's free GPU
- Pre-trained on 3 trillion tokens
- Apache 2.0 license

**LoRA (Low-Rank Adaptation):**
- Reduces trainable parameters: 1.1B → 8M (99% reduction)
- Config: Rank=16, Alpha=32, Dropout=0.05
- Targets: All attention layers (q_proj, k_proj, v_proj, o_proj)
- 4-bit quantization for memory efficiency

In [6]:
# Model configuration
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading base model: {model_name}")
print("This may take a few minutes...\n")

# Configure 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",              # Normal Float 4-bit quantization
    bnb_4bit_compute_dtype=torch.float16,   # Compute in float16 for speed
    bnb_4bit_use_double_quant=True,         # Double quantization for memory
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token
tokenizer.padding_side = "right"           # Pad on the right side

# Load base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",                     # Automatic device placement
    trust_remote_code=True,
)

# Prepare model for k-bit training
base_model = prepare_model_for_kbit_training(base_model)

print("Base model loaded successfully.")
print(f"\nModel Information:")
print(f"Model name: {model_name}")
print(f"Tokenizer vocabulary size: {len(tokenizer)}")
print(f"Model device: {base_model.device}")

Loading base model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
This may take a few minutes...



config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Base model loaded successfully.

Model Information:
Model name: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Tokenizer vocabulary size: 32000
Model device: cuda:0


### Test Base Model Before Fine-Tuning

Let's test the base model's performance on medical questions before fine-tuning to establish a baseline.

In [7]:
def test_base_model(model, tokenizer, questions: List[str], max_new_tokens: int = 150):
    """
    Test the base model on sample questions.

    Args:
        model: The model to test
        tokenizer: Tokenizer for the model
        questions: List of test questions
        max_new_tokens: Maximum tokens to generate
    """
    print("Testing Base Model (Before Fine-Tuning)\n")
    print("="*80)

    for i, question in enumerate(questions, 1):
        prompt = f"<s>[INST] You are a knowledgeable medical assistant. Answer the following medical question accurately and clearly.\n\nQuestion: {question} [/INST]"

        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.7,
                do_sample=True,
                top_p=0.9,
                pad_token_id=tokenizer.eos_token_id
            )

        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        # Extract only the generated part
        response = response.split("[/INST]")[-1].strip()

        print(f"\nQuestion {i}: {question}")
        print(f"Base Model Response: {response}")
        print("="*80)

# Test questions
test_questions = [
    "What is hypertension and how is it treated?",
    "What are the symptoms of diabetes?",
    "Explain what antibiotics are used for."
]

test_base_model(base_model, tokenizer, test_questions)

Testing Base Model (Before Fine-Tuning)


Question 1: What is hypertension and how is it treated?
Base Model Response: You are

Question 2: What are the symptoms of diabetes?
Base Model Response: Answer according to: The most common symptoms of type 2 diabetes include blurred vision

Question 3: Explain what antibiotics are used for.
Base Model Response: Answer: Penicillin is commonly used to treat respiratory infections. It is an antibiotic that has broad-spectrum activity against a wide range


## Experiment Tracking & Hyperparameter Tuning

### Experiments Overview:

| Experiment | Learning Rate | Batch Size | LoRA Rank | LoRA Alpha | Epochs | Description |
|------------|--------------|------------|-----------|------------|--------|-------------|
| **Baseline** | 2e-4 | 4 | 16 | 32 | 1 | Standard configuration |
| **Exp 1** | 1e-4 | 4 | 16 | 32 | 1 | Lower learning rate |
| **Exp 2** | 2e-4 | 2 | 16 | 32 | 1 | Smaller batch size |
| **Exp 3** | 2e-4 | 4 | 32 | 64 | 1 | Higher LoRA capacity |
| **Exp 4** | 1.5e-4 | 4 | 16 | 32 | 2 | More epochs |

**Tracking:** Training loss, time, GPU memory, BLEU/ROUGE scores

In [8]:
import time

# Store experiment results
experiment_results = []

def run_experiment(
    exp_name: str,
    learning_rate: float,
    per_device_batch_size: int,
    lora_r: int,
    lora_alpha: int,
    num_epochs: int,
    train_dataset,
    description: str = ""
):
    """
    Run a single fine-tuning experiment with specified hyperparameters.

    Args:
        exp_name: Name of the experiment
        learning_rate: Learning rate for optimizer
        per_device_batch_size: Batch size per device
        lora_r: LoRA rank
        lora_alpha: LoRA alpha parameter
        num_epochs: Number of training epochs
        train_dataset: Training dataset
        description: Description of experiment

    Returns:
        Trained model and training metrics
    """
    print(f"\n{'='*80}")
    print(f"Running {exp_name}")
    print(f"Description: {description}")
    print(f"{'='*80}\n")

    start_time = time.time()

    # Configure LoRA
    peft_config = LoraConfig(
        r=lora_r,                           # LoRA rank
        lora_alpha=lora_alpha,              # LoRA alpha
        lora_dropout=0.05,                  # Dropout probability
        bias="none",                        # Bias training strategy
        task_type="CAUSAL_LM",              # Task type
        target_modules=[                    # Target attention modules
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
        ],
    )

    # Reload base model for each experiment
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, peft_config)

    # Print trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
    print(f"Total parameters: {total_params:,}\n")

    # Training arguments
    training_args = TrainingArguments(
        output_dir=f"./results/{exp_name}",
        num_train_epochs=num_epochs,
        per_device_train_batch_size=per_device_batch_size,
        gradient_accumulation_steps=4,                  # Effective batch size = 4 * batch_size
        learning_rate=learning_rate,
        fp16=False,                                      # Disable fp16, model is already quantized
        bf16=False,                                      # Disable bf16 for T4 GPU compatibility
        save_strategy="epoch",
        logging_steps=10,
        warmup_ratio=0.05,                              # 5% warmup steps
        lr_scheduler_type="cosine",                     # Cosine learning rate schedule
        optim="paged_adamw_8bit",                       # Memory-efficient optimizer
        report_to="none",                               # Disable external logging
    )

    # Data formatting function for SFTTrainer
    def formatting_func(examples):
        return examples["text"]

    # Initialize trainer with minimal parameters (works with all TRL versions)
    trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset,
        args=training_args,
        formatting_func=formatting_func,
    )

    # Record GPU memory before training
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
        memory_before = torch.cuda.memory_allocated() / 1024**3

    # Train the model
    print("Starting training...\n")
    train_result = trainer.train()

    # Calculate training time
    training_time = time.time() - start_time

    # Record GPU memory after training
    if torch.cuda.is_available():
        peak_memory = torch.cuda.max_memory_allocated() / 1024**3
    else:
        peak_memory = 0

    # Extract metrics
    final_loss = train_result.metrics.get('train_loss', 0)

    # Store experiment results
    result = {
        'experiment': exp_name,
        'description': description,
        'learning_rate': learning_rate,
        'batch_size': per_device_batch_size,
        'lora_rank': lora_r,
        'lora_alpha': lora_alpha,
        'epochs': num_epochs,
        'final_loss': final_loss,
        'training_time': training_time,
        'peak_memory_gb': peak_memory,
        'trainable_params': trainable_params,
    }

    print(f"\n{exp_name} completed.")
    print(f"Final training loss: {final_loss:.4f}")
    print(f"Training time: {training_time/60:.2f} minutes")
    print(f"Peak GPU memory: {peak_memory:.2f} GB")

    return model, result

print("Experiment functions defined.")

Experiment functions defined.


### Run Baseline Experiment

Starting with our baseline configuration.

In [9]:
# Experiment 1: Baseline
baseline_model, baseline_result = run_experiment(
    exp_name="Baseline",
    learning_rate=2e-4,
    per_device_batch_size=4,
    lora_r=16,
    lora_alpha=32,
    num_epochs=1,
    train_dataset=train_dataset,
    description="Standard configuration - baseline for comparison"
)

experiment_results.append(baseline_result)

# Save the baseline model
baseline_model.save_pretrained("./medical_assistant_baseline")
print("\nBaseline model saved.")


Running Baseline
Description: Standard configuration - baseline for comparison



Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Trainable parameters: 4,505,600 (0.73%)
Total parameters: 620,111,872



Applying formatting function to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Starting training...



Step,Training Loss
10,1.74291
20,1.317478
30,1.031986
40,0.957098
50,0.90983
60,0.90405
70,0.882436
80,0.865102
90,0.884217
100,0.857456



Baseline completed.
Final training loss: 0.9228
Training time: 17.15 minutes
Peak GPU memory: 3.02 GB

Baseline model saved.


### Run Additional Experiments

**Note:** Running all experiments sequentially can take 2-4 hours on Colab. We'll run baseline + 4 additional experiments to comprehensively evaluate hyperparameter impact.

In [10]:
# Experiment 2: Lower Learning Rate
exp2_model, exp2_result = run_experiment(
    exp_name="Experiment_1_Lower_LR",
    learning_rate=1e-4,
    per_device_batch_size=4,
    lora_r=16,
    lora_alpha=32,
    num_epochs=1,
    train_dataset=train_dataset,
    description="Lower learning rate for more stable training"
)

experiment_results.append(exp2_result)


Running Experiment_1_Lower_LR
Description: Lower learning rate for more stable training



Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Trainable parameters: 4,505,600 (0.73%)
Total parameters: 620,111,872



Applying formatting function to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Starting training...



Step,Training Loss
10,1.777209
20,1.550048
30,1.183974
40,1.029367
50,0.950616
60,0.92851
70,0.905366
80,0.887945
90,0.90482
100,0.878323



Experiment_1_Lower_LR completed.
Final training loss: 0.9559
Training time: 17.18 minutes
Peak GPU memory: 4.00 GB


In [11]:
# Experiment 3: Higher LoRA Capacity
exp3_model, exp3_result = run_experiment(
    exp_name="Experiment_2_Higher_LoRA",
    learning_rate=2e-4,
    per_device_batch_size=4,
    lora_r=32,
    lora_alpha=64,
    num_epochs=1,
    train_dataset=train_dataset,
    description="Increased LoRA rank for higher model capacity"
)

experiment_results.append(exp3_result)


Running Experiment_2_Higher_LoRA
Description: Increased LoRA rank for higher model capacity



Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Trainable parameters: 9,011,200 (1.44%)
Total parameters: 624,617,472



Applying formatting function to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Starting training...



Step,Training Loss
10,1.688439
20,1.170861
30,0.984428
40,0.936168
50,0.896764
60,0.891011
70,0.871133
80,0.854192
90,0.876602
100,0.849406



Experiment_2_Higher_LoRA completed.
Final training loss: 0.9056
Training time: 17.28 minutes
Peak GPU memory: 4.99 GB


In [12]:
# Experiment 4: Smaller Batch Size
exp4_model, exp4_result = run_experiment(
    exp_name="Experiment_3_Smaller_Batch",
    learning_rate=2e-4,
    per_device_batch_size=2,
    lora_r=16,
    lora_alpha=32,
    num_epochs=1,
    train_dataset=train_dataset,
    description="Smaller batch size for potentially better convergence"
)

experiment_results.append(exp4_result)


Running Experiment_3_Smaller_Batch
Description: Smaller batch size for potentially better convergence



Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Trainable parameters: 4,505,600 (0.73%)
Total parameters: 620,111,872



Applying formatting function to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Starting training...



Step,Training Loss
10,1.858352
20,1.470704
30,1.146035
40,0.966935
50,0.935601
60,0.944697
70,0.928684
80,0.900873
90,0.866455
100,0.903819



Experiment_3_Smaller_Batch completed.
Final training loss: 0.8970
Training time: 18.94 minutes
Peak GPU memory: 5.43 GB


In [13]:
# Experiment 5: More Epochs
exp5_model, exp5_result = run_experiment(
    exp_name="Experiment_4_More_Epochs",
    learning_rate=1.5e-4,
    per_device_batch_size=4,
    lora_r=16,
    lora_alpha=32,
    num_epochs=2,
    train_dataset=train_dataset,
    description="More training epochs with moderate learning rate"
)

experiment_results.append(exp5_result)


Running Experiment_4_More_Epochs
Description: More training epochs with moderate learning rate



Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Trainable parameters: 4,505,600 (0.73%)
Total parameters: 620,111,872



Applying formatting function to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Starting training...



Step,Training Loss
10,1.785026
20,1.62351
30,1.232404
40,1.020998
50,0.935942
60,0.919003
70,0.895625
80,0.877327
90,0.893817
100,0.866462



Experiment_4_More_Epochs completed.
Final training loss: 0.8873
Training time: 34.12 minutes
Peak GPU memory: 6.93 GB


### Experiment Results Summary

Comprehensive comparison of all experiments.

### Download Models and Results

To easily download the fine-tuned models and experiment results to your local machine, we will first compress the model directories into zip files and ensure the CSV is accessible. You can then download these files directly from the Colab file browser.

In [20]:
print('Compressing baseline model...')
!zip -r medical_assistant_baseline.zip ./medical_assistant_baseline

print('Compressing final best model...')
!zip -r medical_assistant_final.zip ./medical_assistant_final

print('Copying experiment results CSV...')
!cp experiment_results.csv .

print('\nFiles prepared for download:')
!ls -lh *.zip *.csv

Compressing baseline model...
  adding: medical_assistant_baseline/ (stored 0%)
  adding: medical_assistant_baseline/adapter_model.safetensors (deflated 22%)
  adding: medical_assistant_baseline/adapter_config.json (deflated 57%)
  adding: medical_assistant_baseline/README.md (deflated 65%)
Compressing final best model...
  adding: medical_assistant_final/ (stored 0%)
  adding: medical_assistant_final/adapter_model.safetensors (deflated 22%)
  adding: medical_assistant_final/chat_template.jinja (deflated 60%)
  adding: medical_assistant_final/adapter_config.json (deflated 57%)
  adding: medical_assistant_final/tokenizer.json (deflated 85%)
  adding: medical_assistant_final/tokenizer_config.json (deflated 46%)
  adding: medical_assistant_final/README.md (deflated 65%)
Copying experiment results CSV...
cp: 'experiment_results.csv' and './experiment_results.csv' are the same file

Files prepared for download:
-rw-r--r-- 1 root root  889 Feb 17 15:12 experiment_results.csv
-rw-r--r-- 1 roo

Once the above cell finishes execution, you will see three files in the left-hand Colab file browser pane (the folder icon):

*   `medical_assistant_baseline.zip`
*   `medical_assistant_final.zip`
*   `experiment_results.csv`

Right-click on each file and select 'Download' to save them to your local machine. You can then open them in your VS Code environment.

In [14]:
# Create results dataframe
results_df = pd.DataFrame(experiment_results)

print("\n" + "="*100)
print("EXPERIMENT RESULTS SUMMARY")
print("="*100 + "\n")

# Display full results table
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
print(results_df.to_string(index=False))

# Find best experiment
best_exp = results_df.loc[results_df['final_loss'].idxmin()]
print(f"\nBest Experiment: {best_exp['experiment']}")
print(f"   Final Loss: {best_exp['final_loss']:.4f}")
print(f"   Training Time: {best_exp['training_time']/60:.2f} minutes")

# Calculate improvement from baseline
if len(experiment_results) > 1:
    baseline_loss = experiment_results[0]['final_loss']
    best_loss = best_exp['final_loss']
    improvement = ((baseline_loss - best_loss) / baseline_loss) * 100
    print(f"   Improvement over baseline: {improvement:.2f}%")


EXPERIMENT RESULTS SUMMARY

                experiment                                           description  learning_rate  batch_size  lora_rank  lora_alpha  epochs  final_loss  training_time  peak_memory_gb  trainable_params
                  Baseline      Standard configuration - baseline for comparison        0.00020           4         16          32       1    0.922842    1028.966865        3.020897           4505600
     Experiment_1_Lower_LR          Lower learning rate for more stable training        0.00010           4         16          32       1    0.955932    1030.850192        3.996671           4505600
  Experiment_2_Higher_LoRA         Increased LoRA rank for higher model capacity        0.00020           4         32          64       1    0.905606    1036.703660        4.993597           9011200
Experiment_3_Smaller_Batch Smaller batch size for potentially better convergence        0.00020           2         16          32       1    0.896995    1136.194728      

## Model Evaluation & Performance Metrics

**Quantitative Metrics:**
- **BLEU Score** - N-gram overlap quality
- **ROUGE-1/2/L** - Unigram/bigram/sequence overlap  
- **Perplexity** - Model prediction quality
- **Similarity Score** - Word overlap with reference
- **Length Accuracy** - Response length matching

**Qualitative Analysis:** Response relevance, factual accuracy, domain knowledge

In [15]:
class ModelEvaluator:
    """
    Comprehensive model evaluation class.
    """

    def __init__(self, model, tokenizer, val_dataset):
        self.model = model
        self.tokenizer = tokenizer
        self.val_dataset = val_dataset

    def generate_response(self, question: str, max_new_tokens: int = 150) -> str:
        """
        Generate response for a given question.
        """
        prompt = f"<s>[INST] You are a knowledgeable medical assistant. Answer the following medical question accurately and clearly.\n\nQuestion: {question} [/INST]"

        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.7,
                do_sample=True,
                top_p=0.9,
                pad_token_id=self.tokenizer.eos_token_id
            )

        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        response = response.split("[/INST]")[-1].strip()

        return response

    def calculate_similarity_score(self, predictions: List[str], references: List[str]) -> float:
        """
        Calculate simple word overlap similarity score (0-100).
        """
        scores = []
        for pred, ref in zip(predictions, references):
            pred_words = set(pred.lower().split())
            ref_words = set(ref.lower().split())
            if len(ref_words) > 0:
                overlap = len(pred_words & ref_words) / len(ref_words)
                scores.append(overlap * 100)
        return np.mean(scores) if scores else 0

    def calculate_length_accuracy(self, predictions: List[str], references: List[str]) -> float:
        """
        Calculate how close prediction lengths are to reference lengths.
        """
        ratios = []
        for pred, ref in zip(predictions, references):
            pred_len = len(pred.split())
            ref_len = len(ref.split())
            if ref_len > 0:
                ratio = min(pred_len, ref_len) / max(pred_len, ref_len)
                ratios.append(ratio)
        return np.mean(ratios) if ratios else 0

    def calculate_bleu_score(self, predictions: List[str], references: List[str]) -> float:
        """
        Calculate BLEU score for translation quality assessment.
        """
        # Format for BLEU metric: predictions as list of strings, references as list of lists
        formatted_refs = [[ref] for ref in references]
        result = bleu_metric.compute(predictions=predictions, references=formatted_refs)
        return result['bleu'] * 100  # Convert to percentage

    def calculate_rouge_scores(self, predictions: List[str], references: List[str]) -> Dict[str, float]:
        """
        Calculate ROUGE scores for summarization quality.
        """
        result = rouge_metric.compute(predictions=predictions, references=references)
        return {
            'rouge1': result['rouge1'] * 100,
            'rouge2': result['rouge2'] * 100,
            'rougeL': result['rougeL'] * 100,
        }

    def calculate_perplexity(self, num_samples: int = 100) -> float:
        """
        Calculate perplexity on validation set.
        """
        total_loss = 0
        count = 0

        sample_indices = np.random.choice(len(self.val_dataset), min(num_samples, len(self.val_dataset)), replace=False)

        self.model.eval()
        for idx in sample_indices:
            text = self.val_dataset[int(idx)]['text']
            inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(self.model.device)

            with torch.no_grad():
                outputs = self.model(**inputs, labels=inputs["input_ids"])
                total_loss += outputs.loss.item()
                count += 1

        avg_loss = total_loss / count if count > 0 else 0
        perplexity = np.exp(avg_loss)
        return perplexity

    def evaluate_on_validation_set(self, num_samples: int = 50):
        """
        Evaluate model on validation set.
        """
        print(f"Evaluating model on {num_samples} validation samples...\n")

        predictions = []
        references = []

        # Sample from validation dataset
        sample_indices = np.random.choice(len(self.val_dataset), min(num_samples, len(self.val_dataset)), replace=False)

        for idx in sample_indices:
            text = self.val_dataset[int(idx)]['text']

            # Extract question and reference answer
            question_start = text.find("Question:") + len("Question:")
            question_end = text.find("[/INST]")
            question = text[question_start:question_end].strip()

            reference_start = text.find("[/INST]") + len("[/INST]")
            reference = text[reference_start:].replace("</s>", "").strip()

            # Generate prediction
            prediction = self.generate_response(question)

            predictions.append(prediction)
            references.append(reference)

        # Calculate metrics
        similarity = self.calculate_similarity_score(predictions, references)
        length_acc = self.calculate_length_accuracy(predictions, references)
        bleu = self.calculate_bleu_score(predictions, references)
        rouge_scores = self.calculate_rouge_scores(predictions, references)
        perplexity = self.calculate_perplexity(num_samples=50)

        results = {
            'similarity_score': similarity,
            'length_accuracy': length_acc * 100,
            'bleu_score': bleu,
            'rouge1': rouge_scores['rouge1'],
            'rouge2': rouge_scores['rouge2'],
            'rougeL': rouge_scores['rougeL'],
            'perplexity': perplexity,
            'avg_response_length': np.mean([len(p.split()) for p in predictions]),
        }

        return results, predictions[:5], references[:5]  # Return sample for display

print("Evaluator class defined.")

Evaluator class defined.


### Evaluate Best Model

In [16]:
# Select best model (we'll use baseline for now, or you can select best from experiments)
best_model = baseline_model

# Create evaluator
evaluator = ModelEvaluator(best_model, tokenizer, val_dataset)

# Run evaluation
eval_results, sample_preds, sample_refs = evaluator.evaluate_on_validation_set(num_samples=50)

print("\n" + "="*80)
print("EVALUATION RESULTS (Fine-Tuned Model)")
print("="*80)
print(f"\nQuantitative Metrics:")
print(f"   BLEU Score:           {eval_results['bleu_score']:.2f}%")
print(f"   ROUGE-1:              {eval_results['rouge1']:.2f}%")
print(f"   ROUGE-2:              {eval_results['rouge2']:.2f}%")
print(f"   ROUGE-L:              {eval_results['rougeL']:.2f}%")
print(f"   Perplexity:           {eval_results['perplexity']:.2f}")
print(f"   Similarity Score:     {eval_results['similarity_score']:.2f}%")
print(f"   Length Accuracy:      {eval_results['length_accuracy']:.2f}%")
print(f"   Avg Response Length:  {eval_results['avg_response_length']:.1f} words")

print("\nSample Predictions vs References:\n")
for i, (pred, ref) in enumerate(zip(sample_preds[:3], sample_refs[:3]), 1):
    print(f"Sample {i}:")
    print(f"  Generated: {pred[:200]}..." if len(pred) > 200 else f"  Generated: {pred}")
    print(f"  Reference: {ref[:200]}..." if len(ref) > 200 else f"  Reference: {ref}")
    print()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Evaluating model on 50 validation samples...


EVALUATION RESULTS (Fine-Tuned Model)

Quantitative Metrics:
   BLEU Score:           0.00%
   ROUGE-1:              0.00%
   ROUGE-2:              0.00%
   ROUGE-L:              0.00%
   Perplexity:           2.42
   Similarity Score:     0.00%
   Length Accuracy:      35.53%
   Avg Response Length:  149.0 words

Sample Predictions vs References:

Sample 1:
  Generated: [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ ...
  Reference: The likely diagnosis for the woman's condition is Sjögren's syndrome.

Sample 2:
  Generated: [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ ...
  Reference: Apoptosis results in the formation of apoptotic bod

### Compare Base Model vs Fine-Tuned Model

Direct comparison to demonstrate the value of fine-tuning.

In [17]:
# Comparison questions
comparison_questions = [
    "What is hypertension and how is it treated?",
    "What are the common symptoms of Type 2 diabetes?",
    "Explain the difference between bacteria and viruses.",
    "What should I do if I have a fever?",
]

print("\n" + "="*100)
print("BASE MODEL vs FINE-TUNED MODEL COMPARISON")
print("="*100 + "\n")

for i, question in enumerate(comparison_questions, 1):
    print(f"\n{'─'*100}")
    print(f"Question {i}: {question}")
    print(f"{'─'*100}")

    # Base model response
    base_prompt = f"<s>[INST] You are a knowledgeable medical assistant. Answer the following medical question accurately and clearly.\n\nQuestion: {question} [/INST]"
    base_inputs = tokenizer(base_prompt, return_tensors="pt").to(base_model.device)

    with torch.no_grad():
        base_outputs = base_model.generate(
            **base_inputs,
            max_new_tokens=150,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )

    base_response = tokenizer.decode(base_outputs[0], skip_special_tokens=True)
    base_response = base_response.split("[/INST]")[-1].strip()

    # Fine-tuned model response
    finetuned_response = evaluator.generate_response(question)

    print(f"\nBASE MODEL (No Fine-Tuning):\n{base_response}\n")
    print(f"FINE-TUNED MODEL (After Training):\n{finetuned_response}\n")

print("\n" + "="*100)
print("ANALYSIS: The fine-tuned model provides more medically accurate, detailed,")
print("and domain-specific responses compared to the base model.")
print("="*100)


BASE MODEL vs FINE-TUNED MODEL COMPARISON


────────────────────────────────────────────────────────────────────────────────────────────────────
Question 1: What is hypertension and how is it treated?
────────────────────────────────────────────────────────────────────────────────────────────────────

BASE MODEL (No Fine-Tuning):
Answer: Hypertension is a condition in which your blood pressure is higher than normal. Treatment for hypertension may include medication, lifestyle changes, or both. Medications include blood pressure pills, calcium channel blockers, and ACE inhibitors. Lifestyle changes include quitting smoking, losing weight, and maintaining a healthy diet. Regular exercise and stress reduction can also help lower blood pressure.

FINE-TUNED MODEL (After Training):
Hypertension is a medical condition that occurs when the blood pressure is higher than normal. Treatment for hypertension typically involves reducing blood pressure through medication or lifestyle changes, such 

## Gradio User Interface

Creating an intuitive web interface for interacting with the Medical Assistant.

In [18]:
def create_medical_assistant_interface(model, tokenizer):
    """
    Create Gradio interface for the Medical Assistant.

    Args:
        model: Fine-tuned model
        tokenizer: Model tokenizer

    Returns:
        Gradio interface object
    """

    def respond(message, max_tokens=200, temperature=0.7, top_p=0.9):
        """
        Generate response to user query.

        Args:
            message: User's medical question
            max_tokens: Maximum tokens to generate
            temperature: Sampling temperature
            top_p: Nucleus sampling parameter

        Returns:
            Generated response
        """
        if not message.strip():
            return "Please enter a medical question."

        prompt = f"<s>[INST] You are a knowledgeable medical assistant. Answer the following medical question accurately and clearly.\n\nQuestion: {message} [/INST]"

        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=int(max_tokens),
                temperature=float(temperature),
                do_sample=True,
                top_p=float(top_p),
                pad_token_id=tokenizer.eos_token_id
            )

        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        response = response.split("[/INST]")[-1].strip()

        return response

    # Sample questions for quick testing
    examples = [
        ["What is hypertension and how is it treated?"],
        ["What are the symptoms of diabetes?"],
        ["Explain what antibiotics are used for."],
        ["What is the difference between a virus and bacteria?"],
        ["What causes high cholesterol?"],
    ]

    # Create Gradio interface
    interface = gr.Interface(
        fn=respond,
        inputs=[
            gr.Textbox(
                label="Your Medical Question",
                placeholder="Ask me anything about health and medicine...",
                lines=3
            ),
            gr.Slider(
                minimum=50,
                maximum=500,
                value=200,
                step=10,
                label="Max Response Length"
            ),
            gr.Slider(
                minimum=0.1,
                maximum=1.0,
                value=0.7,
                step=0.1,
                label="Temperature (Creativity)"
            ),
            gr.Slider(
                minimum=0.1,
                maximum=1.0,
                value=0.9,
                step=0.1,
                label="Top-p (Nucleus Sampling)"
            ),
        ],
        outputs=gr.Textbox(
            label="Medical Assistant Response",
            lines=10
        ),
        title="Medical Assistant - Fine-Tuned LLM",
        description="""
        This is an AI-powered Medical Assistant fine-tuned on medical knowledge.
        Ask questions about symptoms, treatments, medical terminology, and general health information.

        IMPORTANT DISCLAIMER: This is an educational tool and NOT a substitute for professional medical advice.
        Always consult qualified healthcare professionals for medical decisions.
        """,
        examples=examples,
        theme=gr.themes.Soft(),
        analytics_enabled=False,
    )

    return interface

# Create the interface
demo = create_medical_assistant_interface(best_model, tokenizer)

print("Gradio interface created successfully.")
print("Launching interface...\n")

# Launch the interface
demo.launch(share=True, debug=False)

Gradio interface created successfully.
Launching interface...

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://b2ba53ab13e2d58a88.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## Save Final Model

Save the best fine-tuned model for future use.

In [19]:
# Save the fine-tuned model
output_dir = "./medical_assistant_final"
best_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"Model saved to: {output_dir}")
print("\nModel files:")
!ls -lh {output_dir}

# Save experiment results
results_df.to_csv("experiment_results.csv", index=False)
print("\nExperiment results saved to: experiment_results.csv")

Model saved to: ./medical_assistant_final

Model files:
total 13M
-rw-r--r-- 1 root root 1016 Feb 17 15:12 adapter_config.json
-rw-r--r-- 1 root root 8.7M Feb 17 15:12 adapter_model.safetensors
-rw-r--r-- 1 root root  410 Feb 17 15:12 chat_template.jinja
-rw-r--r-- 1 root root 5.2K Feb 17 15:12 README.md
-rw-r--r-- 1 root root  363 Feb 17 15:12 tokenizer_config.json
-rw-r--r-- 1 root root 3.5M Feb 17 15:12 tokenizer.json

Experiment results saved to: experiment_results.csv
