In [1]:
%pip install -q -U transformers

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install -q -U accelerate
%pip install -q -U datasets
%pip install -q -U peft
%pip install -q -i https://pypi.org/simple/ bitsandbytes
%pip install -q -U trl

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Explanation of Key Libraries  

- **`transformers`**: Provides a framework to handle **pre-trained NLP models** for tasks like **text classification** and **question answering**.  

- **`accelerate`**: A distributed training library by Hugging Face designed for **parallelizing training** across multiple **GPUs or CPUs**.  

- **`peft`**: A library for **parameter-efficient fine-tuning (PEFT)** of pre-trained language models, including support for **LoRA (Low-Rank Adaptation)**.  

- **`trl`**: A Hugging Face library for training **transformer models** with **supervised fine-tuning** or **reinforcement learning techniques**.  


In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Use the first GPU
os.environ["TOKENIZERS_PARALLELISM"] = "false" # Disable tokenization parallelism

## Suppressing Warnings

During training, several warnings may appear that do not impact the fine-tuning process but can be distracting. To suppress them:

In [2]:
import warnings
warnings.filterwarnings("ignore")

## Importing Necessary Libraries

The following Python libraries are required for running the fine-tuning process:

In [None]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm

import torch
import torch.nn as nn

import transformers
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          pipeline,
                          logging)

from transformers.models.gemma3 import Gemma3ForCausalLM

from datasets import Dataset
from peft import LoraConfig, PeftConfig, PeftModel
from trl import SFTTrainer, SFTConfig
import bitsandbytes as bnb

from sklearn.metrics import (accuracy_score,
                             classification_report,
                             confusion_matrix)

from sklearn.model_selection import train_test_split

To check the installed version of the transformers library:

In [4]:
print(f"transformers=={transformers.__version__}")

transformers==4.51.3


This function determines the best computing device for running the tutorial:

In [5]:
def define_device():
    """Determine and return the optimal PyTorch device based on availability."""

    print(f"PyTorch version: {torch.__version__}", end=" -- ")

    # Check if MPS (Metal Performance Shaders) is available for macOS
    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        print("using MPS device on macOS")
        return torch.device("mps")

    # Check for CUDA availability
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"using {device}")
    return device

#### If you are using colab

In [None]:
# from google.colab import userdata
# HF_TOKEN = userdata.get('HF_TOKEN')

'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'

### To run please create you hugging face API and store it to HF_TOKEN

In [None]:
from dotenv import load_dotenv
load_dotenv()
HF_TOKEN = os.getenv("HUGGINGFACE_TOKEN")

# Gemma

In [8]:
# Determine optimal computation dtype based on GPU capability
# compute_dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
compute_dtype = torch.float16
print(f"Using compute dtype {compute_dtype}")

# Select the best available device (CPU, CUDA, or MPS)
device = define_device()
print(f"Operating on {device}")

# Path to the pre-trained model
GEMMA_PATH = "google/gemma-3-1b-it"

# Load the model with optimized settings
model = Gemma3ForCausalLM.from_pretrained(
    GEMMA_PATH,
    torch_dtype=compute_dtype,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
    device_map=device
)

# Define maximum sequence length for the tokenizer
max_seq_length = 8192

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    GEMMA_PATH,
    max_seq_length=max_seq_length,
    device_map=device

)

Using compute dtype torch.float16
PyTorch version: 2.4.1+cu121 -- using cuda
Operating on cuda


Before proceeding, let's ensures that the entire model is correctly moved to the GPU.


In [9]:
is_on_gpu = all(param.device.type == 'cuda' for param in model.parameters())
print("Model is on GPU:", is_on_gpu)

Model is on GPU: True


In [10]:
from datasets import load_dataset

def load_healthcare_dataset():
    # Load the medical cases classification dataset from Hugging Face
    dataset = load_dataset("hpe-ai/medical-cases-classification-tutorial")

    # Access the splits
    train_dataset = dataset["train"]
    eval_dataset = dataset["validation"]
    test_dataset = dataset["test"]

    return train_dataset, eval_dataset, test_dataset

# Load the datasets
train_dataset, eval_dataset, test_dataset = load_healthcare_dataset()

# Print some information about the datasets to verify
print(f"Train dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(eval_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")
print(f"Features: {train_dataset.features}")

Using the latest cached version of the dataset since hpe-ai/medical-cases-classification-tutorial couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /home/prince.kumar/.cache/huggingface/datasets/hpe-ai___medical-cases-classification-tutorial/default/0.0.0/a63f962a9f692e2ce048b2f3ce826ea219956870 (last modified on Sat Apr 12 00:32:15 2025).


Train dataset size: 1724
Validation dataset size: 370
Test dataset size: 370
Features: {'description': Value(dtype='string', id=None), 'transcription': Value(dtype='string', id=None), 'sample_name': Value(dtype='string', id=None), 'medical_specialty': Value(dtype='string', id=None), 'keywords': Value(dtype='string', id=None)}


In [11]:
# Check the unique medical specialties
unique_classes = train_dataset.unique("medical_specialty")
print(f"Number of unique specialties (classes): {len(unique_classes)}")
print("Classes:\n", unique_classes)


Number of unique specialties (classes): 13
Classes:
 ['Cardiovascular / Pulmonary', 'Orthopedic', 'Nephrology', 'Obstetrics / Gynecology', 'Ophthalmology', 'Gastroenterology', 'Neurology', 'ENT - Otolaryngology', 'Psychiatry / Psychology', 'Pediatrics - Neonatal', 'Neurosurgery', 'Hematology - Oncology', 'Radiology']


In [12]:
VALID_SPECIALTIES = [
    "Cardiovascular / Pulmonary", "Orthopedic", "Nephrology", "Obstetrics / Gynecology",
    "Ophthalmology", "Gastroenterology", "Neurology", "ENT - Otolaryngology",
    "Psychiatry / Psychology", "Pediatrics - Neonatal", "Neurosurgery",
    "Hematology - Oncology", "Radiology"
]

# Join for the prompt
SPECIALTY_LIST = ", ".join(s for s in VALID_SPECIALTIES)

## Creating prompts and Training data for Zero-shot, few-shot and Finetuning (Gemma)

In [None]:
def generate_test_prompt(data_point):
    instruction = f"""You are a medical expert. Classify the medical case description into one of the following specialties: {SPECIALTY_LIST}

Description:
\"\"\"{data_point["description"]}\"\"\"

Respond with only the most appropriate specialty name from the list above."""

    return f"<start_of_turn>user\n{instruction}\n<end_of_turn>\n<start_of_turn>model\n"


In [14]:
def generate_train_prompt(data_point):
    instruction = f"""You are a medical assistant AI. Classify the following case into its appropriate medical specialties: {SPECIALTY_LIST}

Description:
\"\"\"{data_point['description']}\"\"\"

Respond with only the most appropriate specialty name from the list above."""

    output = f"{data_point['medical_specialty']}"

    return f"<start_of_turn>user\n{instruction}\n<end_of_turn>\n<start_of_turn>model\n{output}<end_of_turn>"


In [15]:
def generate_few_shot_prompt(few_shot_data, test_instance, label_list):
    prompt = f"You are a medical assistant AI. Classify the following medical case descriptions into their correct specialties: {SPECIALTY_LIST}.\n\nHere are some examples:\n\n"

    for example in few_shot_data:
        prompt += f'Description:\n"""{example["description"]}"""\nSpecialty: {example["medical_specialty"]}\n\n'

    prompt += f"Now classify the following description:\n\nDescription:\n\"\"\"{test_instance['description']}\"\"\"\n\n"
    prompt += f"Respond with only the correct specialty from the list:\n{', '.join(label_list)}"

    return f"<start_of_turn>user\n{prompt.strip()}\n<end_of_turn>\n<start_of_turn>model\n"

In [None]:
def create_few_shot_dataset(full_dataset, shots=5, label_column="medical_specialty", seed=42):
    df = full_dataset.to_pandas()
    few_shot_df = (
        df.groupby(label_column)
        .sample(n=shots, random_state=seed)
        .reset_index(drop=True)
    )
    return few_shot_df.to_dict(orient="records")

# Generate prompts for each few-shot setting
def generate_fewshot_test_data(X_test_df, fewshot_df, label_list, shots_name="5shot"):
    prompts = X_test_df.apply(
        lambda row: generate_few_shot_prompt(few_shot_data=fewshot_df, test_instance=row, label_list=label_list),
        axis=1
    )
    dataset = Dataset.from_pandas(pd.DataFrame({"text": prompts}))
    labels = X_test_df["medical_specialty"]
    return dataset.to_pandas(), labels



In [29]:
import pandas as pd
from datasets import Dataset

# Convert Hugging Face dataset to pandas for formatting
X_train = train_dataset.to_pandas()
X_eval = eval_dataset.to_pandas()
X_test = test_dataset.to_pandas()


# Prepare few-shot examples
fewshot_5 = create_few_shot_dataset(train_dataset, shots=5)
fewshot_10 = create_few_shot_dataset(train_dataset, shots=10)
fewshot_15 = create_few_shot_dataset(train_dataset, shots=15)

# Create test datasets with few-shot prompts
X_test_5shot, y_true_5shot = generate_fewshot_test_data(X_test, fewshot_5, VALID_SPECIALTIES, "5shot")
X_test_10shot, y_true_10shot = generate_fewshot_test_data(X_test, fewshot_10, VALID_SPECIALTIES, "10shot")
X_test_15shot, y_true_15shot = generate_fewshot_test_data(X_test, fewshot_15, VALID_SPECIALTIES, "15shot")

# Apply formatting
X_train_fine = pd.DataFrame(X_train.apply(generate_train_prompt, axis=1), columns=["text"])
X_eval_fine = pd.DataFrame(X_eval.apply(generate_train_prompt, axis=1), columns=["text"])

# Store labels
y_true_zero = X_test["medical_specialty"]
X_test_zero = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

# Convert back to Hugging Face Dataset format
train_data = Dataset.from_pandas(X_train_fine)
eval_data = Dataset.from_pandas(X_eval_fine)

In [30]:
# Create dynamic label mapping
all_labels = train_dataset.unique("medical_specialty")
label_mapping = {label: idx for idx, label in enumerate(sorted(all_labels))}
reverse_mapping = {v: k for k, v in label_mapping.items()}


In [31]:
model.config.label2id = label_mapping
model.config.id2label = reverse_mapping

In [32]:
label_mapping

{'Cardiovascular / Pulmonary': 0,
 'ENT - Otolaryngology': 1,
 'Gastroenterology': 2,
 'Hematology - Oncology': 3,
 'Nephrology': 4,
 'Neurology': 5,
 'Neurosurgery': 6,
 'Obstetrics / Gynecology': 7,
 'Ophthalmology': 8,
 'Orthopedic': 9,
 'Pediatrics - Neonatal': 10,
 'Psychiatry / Psychology': 11,
 'Radiology': 12}

In [33]:
from difflib import get_close_matches

def clean_prediction(raw_output, valid_classes):
    # print(raw_output)
    # print(valid_classes)
    match = get_close_matches(raw_output.strip(), valid_classes, n=1, cutoff=0.5)
    return match[0] if match else "Unknown"


In [44]:
import re
def predict(X_test, model, tokenizer, device=device, max_new_tokens=10, temperature=0.1):
    y_pred = []

    for i in tqdm(range(len(X_test)), desc="Predicting Specialties"):
        prompt = X_test.iloc[i]["text"]
        # print(prompt)
        input_ids = tokenizer(prompt, return_tensors="pt").to(device)

        outputs = model.generate(**input_ids,
                                 max_new_tokens=max_new_tokens,
                                 temperature=temperature,
                                  pad_token_id=tokenizer.eos_token_id
                                 )

        result = tokenizer.decode(outputs[0], skip_special_tokens=False).strip()
        # print(result)
        # generated = result[len(prompt):].strip().lower()
        match = re.search(r"<start_of_turn>model\s*(.*?)(<end_of_turn>|$)", result, re.DOTALL | re.IGNORECASE)
        # print(len(result))
        # print(len(prompt))

        generated = match.group(1).strip().lower()
        # print(generated)

        cleaned = clean_prediction(generated, VALID_SPECIALTIES)
        # print(cleaned)

        y_pred.append(cleaned)

    return y_pred


In [45]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_recall_fscore_support

def evaluate(y_true, y_pred, label_list, model_name, training_type):
    # Map labels to IDs
    label_to_id = {label: idx for idx, label in enumerate(label_list)}

    # Convert labels to numeric IDs
    y_true_ids = [label_to_id[label] for label in y_true]
    y_pred_ids = [label_to_id.get(label, -1) for label in y_pred]

    # Filter out unknown predictions
    valid_indices = [i for i, pred in enumerate(y_pred_ids) if pred != -1]
    y_true_filtered = [y_true_ids[i] for i in valid_indices]
    y_pred_filtered = [y_pred_ids[i] for i in valid_indices]

    print(f"✅ Evaluating {len(valid_indices)} valid predictions out of {len(y_pred)} total.")

    # Overall Accuracy
    accuracy = accuracy_score(y_true_filtered, y_pred_filtered)
    print(f"\n🔹 Accuracy: {accuracy:.3f}")

    # Precision, Recall, F1 (per class & macro avg)
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true_filtered, y_pred_filtered, average='macro', zero_division=0
    )
    print(f"🔹 Precision (macro avg): {precision:.3f}")
    print(f"🔹 Recall (macro avg):    {recall:.3f}")
    print(f"🔹 F1 Score (macro avg):  {f1:.3f}")

    # Full classification report
    print("\n🔍 Classification Report:")
    print(classification_report(
        y_true_filtered, y_pred_filtered,
        target_names=label_list,
        zero_division=0
    ))

    # Confusion matrix
    cm = confusion_matrix(y_true_filtered, y_pred_filtered)
    print("\n📊 Confusion Matrix:")
    print(cm)
    return {'model':model_name, 'Training_Type': training_type, 'Accuracy': accuracy, 'Precision (macro avg)': precision, 'Recall (macro avg)': recall, 'F1 Score (macro avg)':f1}


At this stage, we are ready to test the **Gemma-3 1B** model on our dataset **without any fine-tuning**. This initial evaluation provides insights into the model's **inherent performance** and helps establish a **baseline** for comparison with future fine-tuned models.

We use the `predict` function to generate sentiment predictions for the test set:

In [46]:
All_results = []

## Zero-shot prediction and Matrics (Gemma)

In [47]:
y_pred_zero = predict(X_test_zero, model, tokenizer)
res = evaluate(y_true_zero, y_pred_zero, label_list=VALID_SPECIALTIES, model_name="Gemma3-1b", training_type="Zero-shot")
All_results.append(res)

Predicting Specialties: 100%|██████████| 370/370 [01:14<00:00,  4.99it/s]

✅ Evaluating 345 valid predictions out of 370 total.

🔹 Accuracy: 0.606
🔹 Precision (macro avg): 0.589
🔹 Recall (macro avg):    0.479
🔹 F1 Score (macro avg):  0.495

🔍 Classification Report:
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.77      0.76      0.77        89
                Orthopedic       0.78      0.64      0.70        55
                Nephrology       0.25      0.70      0.37        10
   Obstetrics / Gynecology       1.00      0.34      0.51        32
             Ophthalmology       0.83      0.91      0.87        11
          Gastroenterology       1.00      0.73      0.85        41
                 Neurology       0.47      0.63      0.54        46
      ENT - Otolaryngology       0.78      0.50      0.61        14
   Psychiatry / Psychology       0.86      0.67      0.75         9
     Pediatrics - Neonatal       0.00      0.00      0.00         1
              Neurosurgery       0.17      0.15      0.16   




## Few-shot prediction and evaluation (Gemma)

In [48]:
y_pred_few_5 = predict(X_test_5shot, model, tokenizer)
res1 = evaluate(y_true_5shot, y_pred_few_5, label_list=VALID_SPECIALTIES, model_name="Gemma3-1b", training_type="Few-shot5")
All_results.append(res1)

y_pred_few_10 = predict(X_test_10shot, model, tokenizer)
res2 = evaluate(y_true_10shot, y_pred_few_10, label_list=VALID_SPECIALTIES,  model_name="Gemma3-1b", training_type="Few-shot10")
All_results.append(res2)

y_pred_few_15 = predict(X_test_15shot, model, tokenizer)
res3 = evaluate(y_true_15shot, y_pred_few_15, label_list=VALID_SPECIALTIES,  model_name="Gemma3-1b", training_type="Few-shot15")
All_results.append(res3)

Predicting Specialties: 100%|██████████| 370/370 [01:40<00:00,  3.68it/s]


✅ Evaluating 370 valid predictions out of 370 total.

🔹 Accuracy: 0.208
🔹 Precision (macro avg): 0.479
🔹 Recall (macro avg):    0.159
🔹 F1 Score (macro avg):  0.158

🔍 Classification Report:
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.65      0.53      0.59       109
                Orthopedic       0.86      0.11      0.19        57
                Nephrology       0.00      0.00      0.00        10
   Obstetrics / Gynecology       0.00      0.00      0.00        32
             Ophthalmology       1.00      0.09      0.17        11
          Gastroenterology       1.00      0.02      0.05        41
                 Neurology       1.00      0.04      0.08        47
      ENT - Otolaryngology       1.00      0.21      0.35        14
   Psychiatry / Psychology       0.71      0.56      0.62         9
     Pediatrics - Neonatal       0.00      0.00      0.00         3
              Neurosurgery       0.00      0.00      0.00   

Predicting Specialties: 100%|██████████| 370/370 [02:55<00:00,  2.11it/s]


✅ Evaluating 369 valid predictions out of 370 total.

🔹 Accuracy: 0.282
🔹 Precision (macro avg): 0.317
🔹 Recall (macro avg):    0.156
🔹 F1 Score (macro avg):  0.118

🔍 Classification Report:
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.45      0.86      0.59       109
                Orthopedic       1.00      0.05      0.10        56
                Nephrology       0.00      0.00      0.00        10
   Obstetrics / Gynecology       0.00      0.00      0.00        32
             Ophthalmology       1.00      0.09      0.17        11
          Gastroenterology       0.00      0.00      0.00        41
                 Neurology       0.00      0.00      0.00        47
      ENT - Otolaryngology       1.00      0.07      0.13        14
   Psychiatry / Psychology       0.67      0.44      0.53         9
     Pediatrics - Neonatal       0.00      0.00      0.00         3
              Neurosurgery       0.00      0.00      0.00   

Predicting Specialties: 100%|██████████| 370/370 [04:58<00:00,  1.24it/s]

✅ Evaluating 369 valid predictions out of 370 total.

🔹 Accuracy: 0.306
🔹 Precision (macro avg): 0.075
🔹 Recall (macro avg):    0.115
🔹 F1 Score (macro avg):  0.080

🔍 Classification Report:
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.30      1.00      0.46       109
                Orthopedic       0.00      0.00      0.00        57
                Nephrology       0.00      0.00      0.00        10
   Obstetrics / Gynecology       0.00      0.00      0.00        32
             Ophthalmology       0.00      0.00      0.00        11
          Gastroenterology       0.00      0.00      0.00        41
                 Neurology       0.00      0.00      0.00        47
      ENT - Otolaryngology       0.00      0.00      0.00        14
   Psychiatry / Psychology       0.67      0.50      0.57         8
     Pediatrics - Neonatal       0.00      0.00      0.00         3
              Neurosurgery       0.00      0.00      0.00   




## Finetuning (Gemma)

In [None]:
peft_config = LoraConfig(
    lora_alpha=32,
    lora_dropout=0.05,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear",
)

training_arguments = SFTConfig(
    output_dir="./gemma-finetuned",
    num_train_epochs=4,
    gradient_checkpointing=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim="adamw_torch_fused",
    save_strategy="epoch",
    logging_steps=100,
    learning_rate=5e-5,
    warmup_steps=50,
    fp16=True,
    report_to= "none",
    eval_steps=50,  # Instead of `evaluation_strategy`
)


model.config.use_cache = False
model.config.pretraining_tp = 1

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    processing_class=tokenizer,
    args=training_arguments,
)

In [50]:
# Train model
trainer.train()

Step,Training Loss
100,1.8394
200,0.7161
300,0.6641
400,0.6121
500,0.5487
600,0.5375
700,0.4904
800,0.4517


TrainOutput(global_step=860, training_loss=0.7139007856679517, metrics={'train_runtime': 2164.277, 'train_samples_per_second': 3.186, 'train_steps_per_second': 0.397, 'total_flos': 4000930941685248.0, 'train_loss': 0.7139007856679517})

## Prediction and Evaluation (Gemma finetuned model)

In [53]:
model.eval()
y_pred_fine = predict(X_test_zero, model, tokenizer)

res = evaluate(y_true_zero, y_pred_fine, label_list=VALID_SPECIALTIES,  model_name="Gemma3-1b", training_type="FineTuning")
All_results.append(res)

Predicting Specialties: 100%|██████████| 370/370 [01:57<00:00,  3.15it/s]

✅ Evaluating 370 valid predictions out of 370 total.

🔹 Accuracy: 0.776
🔹 Precision (macro avg): 0.654
🔹 Recall (macro avg):    0.640
🔹 F1 Score (macro avg):  0.642

🔍 Classification Report:
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.93      0.90      0.92       109
                Orthopedic       0.75      0.75      0.75        57
                Nephrology       0.60      0.60      0.60        10
   Obstetrics / Gynecology       0.97      0.88      0.92        32
             Ophthalmology       0.83      0.91      0.87        11
          Gastroenterology       0.90      0.88      0.89        41
                 Neurology       0.57      0.70      0.63        47
      ENT - Otolaryngology       1.00      0.79      0.88        14
   Psychiatry / Psychology       0.86      0.67      0.75         9
     Pediatrics - Neonatal       0.25      0.33      0.29         3
              Neurosurgery       0.36      0.25      0.29   




### Clearing GPU cache

In [None]:
import torch
import gc

def clear_gpu_cache():
    print("Clearing GPU cache...")

    # Collect Python garbage (clears any lingering refs)
    gc.collect()

    # Empty PyTorch's cache
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()
        torch.cuda.synchronize()
        print(f"Memory cleared on device: {torch.cuda.get_device_name(0)}")
    else:
        print("CUDA device not available.")

# Usage:
clear_gpu_cache()

# Lamma

In [None]:
# Determine optimal computation dtype based on GPU capability
compute_dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
print(f"Using compute dtype {compute_dtype}")

# Select the best available device (CPU, CUDA, or MPS)
device = define_device()
print(f"Operating on {device}")

# Path to the pre-trained model
lamma_PATH = "meta-llama/Meta-Llama-3-8B-Instruct"

quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16)

# Load the model with optimized settings
model = AutoModelForCausalLM.from_pretrained(
    lamma_PATH,
    force_download=True,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
    quantization_config = quantization_config,
    device_map="auto",
    token = HF_TOKEN
)

# Define maximum sequence length for the tokenizer
max_seq_length = 8192

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    lamma_PATH,
    force_download=True, 
    trust_remote_code=True,
    max_seq_length=max_seq_length,
    token = HF_TOKEN
)

## Creating prompts and Training data for Zero-shot, few-shot and Finetuning (Lamma)

In [113]:
def generate_test_prompt(data_point):
    instruction = f"""You are a medical expert. Classify the medical case description into one of the following specialties: {SPECIALTY_LIST}

Description:
\"\"\"{data_point["description"]}\"\"\"

Respond with only the most appropriate specialty name from the list above."""

    return f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"


def generate_train_prompt(data_point):
    instruction = f"""You are a medical assistant AI. Classify the following case into its appropriate medical specialties: {SPECIALTY_LIST}

Description:
\"\"\"{data_point['description']}\"\"\"

Respond with only the most appropriate specialty name from the list above."""

    output = f"{data_point['medical_specialty']}"

    return f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{instruction}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{output}<|eot_id|>"


def generate_few_shot_prompt(few_shot_data, test_instance, label_list):
    prompt = f"You are a medical assistant AI. Classify the following medical case descriptions into their correct specialties: {SPECIALTY_LIST}.\n\nHere are some examples:\n\n"

    for example in few_shot_data:
        prompt += f'Description:\n"""{example["description"]}"""\nSpecialty: {example["medical_specialty"]}\n\n'

    prompt += f"Now classify the following description:\n\nDescription:\n\"\"\"{test_instance['description']}\"\"\"\n\n"
    prompt += f"Respond with only the correct specialty from the list:\n{', '.join(label_list)}"

    return f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{prompt.strip()}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"


In [None]:

# Convert Hugging Face dataset to pandas for formatting
X_train = train_dataset.to_pandas()
X_eval = eval_dataset.to_pandas()
X_test = test_dataset.to_pandas()


# Prepare few-shot examples
fewshot_5 = create_few_shot_dataset(train_dataset, shots=5)
fewshot_10 = create_few_shot_dataset(train_dataset, shots=10)
fewshot_15 = create_few_shot_dataset(train_dataset, shots=15)

# Create test datasets with few-shot prompts
X_test_5shot, y_true_5shot = generate_fewshot_test_data(X_test, fewshot_5, VALID_SPECIALTIES, "5shot")
X_test_10shot, y_true_10shot = generate_fewshot_test_data(X_test, fewshot_10, VALID_SPECIALTIES, "10shot")
X_test_15shot, y_true_15shot = generate_fewshot_test_data(X_test, fewshot_15, VALID_SPECIALTIES, "15shot")

# Apply formatting
X_train_fine = pd.DataFrame(X_train.apply(generate_train_prompt, axis=1), columns=["text"])
X_eval_fine = pd.DataFrame(X_eval.apply(generate_train_prompt, axis=1), columns=["text"])

# Store labels
y_true_zero = X_test["medical_specialty"]
X_test_zero = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

# Convert back to Hugging Face Dataset format
train_data = Dataset.from_pandas(X_train_fine)
eval_data = Dataset.from_pandas(X_eval_fine)

In [None]:
import re
from tqdm import tqdm

def predict(X_test, model, tokenizer, device=device, max_new_tokens=10, temperature=0.1):
    y_pred = []

    for i in tqdm(range(len(X_test)), desc="Predicting Specialties"):
        prompt = X_test.iloc[i]["text"]
        
        # Tokenize input
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        
        # Define terminators for generation
        eos_token_id = tokenizer.eos_token_id
        eot_token_id = tokenizer.convert_tokens_to_ids("<|eot_id|>") if "<|eot_id|>" in tokenizer.get_vocab() else None
        
        # Set up terminators list
        terminators = [eos_token_id]
        if eot_token_id is not None:
            terminators.append(eot_token_id)
        # Generate prediction under autocast
        with torch.autocast(device_type=device.type, dtype=torch.bfloat16):
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=terminators
            )


        # Decode the result
        result = tokenizer.decode(outputs[0], skip_special_tokens=False).strip()
        # print(result)
        # Extract the assistant's response using regex pattern for Llama 3 format
        match = re.search(r"<\|start_header_id\|>assistant<\|end_header_id\|>\s*(.*?)(<\|eot_id\|>|$)", 
                         result, re.DOTALL | re.IGNORECASE)
        
        if match:
            generated = match.group(1).strip().lower()
        else:
            # Fallback if the pattern isn't found
            # This extracts text after the last occurrence of "assistant"
            assistant_pos = result.lower().rfind("assistant")
            if assistant_pos != -1:
                generated = result[assistant_pos + len("assistant"):].strip().lower()
            else:
                # Last resort: just take the last part of the output
                generated = result[len(prompt):].strip().lower()
        
        # Clean the prediction to match valid specialties

        cleaned = clean_prediction(generated, VALID_SPECIALTIES)
        # print(cleaned)
        y_pred.append(cleaned)

    return y_pred


## Zero-shot prediction and evaluation (Lamma)

In [120]:
model.eval()
y_pred_zero = predict(X_test_zero, model, tokenizer)
res = evaluate(y_true_zero, y_pred_zero, label_list=VALID_SPECIALTIES, model_name="Llama-3-8B-Instruct", training_type="Zero-shot")
All_results.append(res)

Predicting Specialties: 100%|██████████| 370/370 [02:23<00:00,  2.57it/s]

✅ Evaluating 351 valid predictions out of 370 total.

🔹 Accuracy: 0.735
🔹 Precision (macro avg): 0.641
🔹 Recall (macro avg):    0.641
🔹 F1 Score (macro avg):  0.614

🔍 Classification Report:
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.96      0.74      0.83        91
                Orthopedic       0.75      0.91      0.83        57
                Nephrology       1.00      0.40      0.57        10
   Obstetrics / Gynecology       0.94      0.94      0.94        32
             Ophthalmology       0.91      0.91      0.91        11
          Gastroenterology       0.85      0.83      0.84        41
                 Neurology       0.74      0.55      0.63        47
      ENT - Otolaryngology       0.45      1.00      0.62        14
   Psychiatry / Psychology       0.89      0.89      0.89         9
     Pediatrics - Neonatal       0.17      0.50      0.25         2
              Neurosurgery       0.44      0.40      0.42   




## Few-shot prediction and evaluation (Lamma)

In [None]:
y_pred_few_5 = predict(X_test_5shot, model, tokenizer)
res1 = evaluate(y_true_5shot, y_pred_few_5, label_list=VALID_SPECIALTIES, model_name="Llama-3-8B-Instruct", training_type="Few-shot5")
All_results.append(res1)

y_pred_few_10 = predict(X_test_10shot, model, tokenizer)
res2 = evaluate(y_true_10shot, y_pred_few_10, label_list=VALID_SPECIALTIES,  model_name="Llama-3-8B-Instruct", training_type="Few-shot10")
All_results.append(res2)

# Taking too much of time

# y_pred_few_15 = predict(X_test_15shot, model, tokenizer)
# res3 = evaluate(y_true_15shot, y_pred_few_15, label_list=VALID_SPECIALTIES,  model_name="Llama-3-8B-Instruct", training_type="Few-shot15")
# All_results.append(res3)

## Finetuning (Lamma)

In [None]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=32,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear",
)

fp16 = compute_dtype == torch.float16
bf16 = compute_dtype == torch.bfloat16
 

training_arguments = SFTConfig(
    output_dir=None,
    num_train_epochs=2,
    gradient_checkpointing=True,
    fp16=fp16,
    bf16=bf16,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    optim="adamw_torch_fused",
    save_strategy="no",
    logging_steps=100,
    learning_rate=5e-5,
    report_to= "none",
    eval_steps=50,  # Instead of `evaluation_strategy`
)


model.config.use_cache = False
model.config.pretraining_tp = 1

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    processing_class=tokenizer,
    args=training_arguments,
)

In [141]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=430, training_loss=0.5608600350313409, metrics={'train_runtime': 569.0482, 'train_samples_per_second': 6.059, 'train_steps_per_second': 0.756, 'total_flos': 2.63954015576064e+16, 'train_loss': 0.5608600350313409})

## Finetune Prediction and Evaluation (Lamma)

In [151]:
model.eval()
y_pred_fine = predict(X_test_zero, model, tokenizer)

res = evaluate(y_true_zero, y_pred_fine, label_list=VALID_SPECIALTIES,  model_name="Llama-3-8B-Instruct", training_type="FineTuning")
All_results.append(res)

Predicting Specialties: 100%|██████████| 370/370 [03:38<00:00,  1.69it/s]

✅ Evaluating 370 valid predictions out of 370 total.

🔹 Accuracy: 0.816
🔹 Precision (macro avg): 0.700
🔹 Recall (macro avg):    0.685
🔹 F1 Score (macro avg):  0.680

🔍 Classification Report:
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.94      0.94      0.94       109
                Orthopedic       0.74      0.95      0.83        57
                Nephrology       0.88      0.70      0.78        10
   Obstetrics / Gynecology       0.91      0.91      0.91        32
             Ophthalmology       0.91      0.91      0.91        11
          Gastroenterology       0.88      0.88      0.88        41
                 Neurology       0.71      0.64      0.67        47
      ENT - Otolaryngology       0.85      0.79      0.81        14
   Psychiatry / Psychology       0.88      0.78      0.82         9
     Pediatrics - Neonatal       0.29      0.67      0.40         3
              Neurosurgery       0.70      0.35      0.47   




In [152]:
pd.DataFrame(All_results)

Unnamed: 0,model,Training_Type,Accuracy,Precision (macro avg),Recall (macro avg),F1 Score (macro avg)
0,Gemma3-1b,Zero-shot,0.605797,0.588705,0.479389,0.494805
1,Gemma3-1b,Few-shot5,0.208108,0.479,0.158851,0.158072
2,Gemma3-1b,Few-shot10,0.281843,0.316844,0.155595,0.11815
3,Gemma3-1b,Few-shot15,0.306233,0.074573,0.115385,0.079711
4,Gemma3-1b,FineTuning,0.775676,0.653539,0.640009,0.641594
5,Llama-3-8B-Instruct,Zero-shot,0.735043,0.641317,0.641012,0.613964
6,Llama-3-8B-Instruct,Few-shot5,0.752187,0.661484,0.691934,0.64911
7,Llama-3-8B-Instruct,Few-shot10,0.742857,0.671615,0.644975,0.596966
8,Llama-3-8B-Instruct,FineTuning,0.816216,0.699554,0.684936,0.679657


### Clearing cache and vram of GPU

In [None]:
import torch
import gc

def clear_gpu_cache():
    print("Clearing GPU cache...")

    # Collect Python garbage (clears any lingering refs)
    gc.collect()

    # Empty PyTorch's cache
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.ipc_collect()
        torch.cuda.synchronize()
        print(f"Memory cleared on device: {torch.cuda.get_device_name(0)}")
    else:
        print("CUDA device not available.")

# Usage:
clear_gpu_cache()

# Deepseek

In [None]:
# Determine optimal computation dtype based on GPU capability
compute_dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
print(f"Using compute dtype {compute_dtype}")

# Select the best available device (CPU, CUDA, or MPS)
device = define_device()
print(f"Operating on {device}")

# Path to the pre-trained model
deepseek_PATH = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=compute_dtype)

# Load the model with optimized settings
model = AutoModelForCausalLM.from_pretrained(
    deepseek_PATH,
    force_download=True,
    quantization_config = quantization_config,
    device_map="auto",
    token = HF_TOKEN
)

# Define maximum sequence length for the tokenizer
max_seq_length = 8192

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    deepseek_PATH,
    force_download=True, 
    trust_remote_code=True,
    max_seq_length=max_seq_length,
    token = HF_TOKEN
)

In [172]:
print(tokenizer.chat_template)

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['functio

## Creating prompts and Training data for Zero-shot, few-shot and Finetuning (Deepseek)

In [None]:
def generate_test_prompt(data_point):
    """Generate a test prompt for zero-shot inference"""
    messages = [
        {"role": "system", "content": "You are a medical expert specializing in classifying medical cases."},
        {"role": "user", "content": f"""Classify the medical case description into one of the following specialties: {SPECIALTY_LIST}

Description:
\"\"\"{data_point["description"]}\"\"\"

Respond with only the most appropriate specialty name from the list above."""}
    ]
    
    formatted_text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        
    return formatted_text


def generate_train_prompt(data_point):
    """Generate a training prompt with expected output"""
    messages = [
        {"role": "system", "content": "You are a medical expert specializing in classifying medical cases."},
        {"role": "user", "content": f"""Classify the following case into its appropriate medical specialty: {SPECIALTY_LIST}

Description:
\"\"\"{data_point['description']}\"\"\"

Respond with only the most appropriate specialty name from the list above."""},
        {"role": "assistant", "content": f"{data_point['medical_specialty']}"}
    ]

    formatted_text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False
        )
        
    return formatted_text
    
   


def generate_few_shot_prompt(few_shot_data, test_instance, label_list):
    """Generate a few-shot prompt with examples"""
    # Create the few-shot examples content
    few_shot_content = "Here are some examples:\n\n"
    for example in few_shot_data:
        few_shot_content += f'Description:\n"""{example["description"]}"""\nSpecialty: {example["medical_specialty"]}\n\n'
    
    few_shot_content += f"Now classify the following description:\n\nDescription:\n\"\"\"{test_instance['description']}\"\"\"\n\n"
    few_shot_content += f"Respond with only the correct specialty from the list:\n{', '.join(label_list)}"
    
    messages = [
        {"role": "system", "content": "You are a medical expert specializing in classifying medical cases."},
        {"role": "user", "content": f"Classify the following medical case descriptions into their correct specialties: {SPECIALTY_LIST}.\n\n{few_shot_content}"}
    ]

    return messages

def generate_fewshot_test_data(X_test, few_shot_examples, label_list):
    """Generate test data with few-shot examples"""
    messages_list = []
    labels = []
    
    for i, row in X_test.iterrows():
        messages = generate_few_shot_prompt(few_shot_examples, row, label_list)
        messages_list.append(messages)
        labels.append(row["medical_specialty"])
    
    return pd.DataFrame({"messages": messages_list}), labels



# Convert Hugging Face dataset to pandas for formatting
X_train = train_dataset.to_pandas()
X_eval = eval_dataset.to_pandas()
X_test = test_dataset.to_pandas()


# Prepare few-shot examples
fewshot_5 = create_few_shot_dataset(train_dataset, shots=5)
fewshot_10 = create_few_shot_dataset(train_dataset, shots=10)
fewshot_15 = create_few_shot_dataset(train_dataset, shots=15)
print("Formatting training data...")
X_train_fine = pd.DataFrame(X_train.apply(lambda x: generate_train_prompt(x), axis=1), columns=["text"])

print("Formatting evaluation data...")
X_eval_fine = pd.DataFrame(X_eval.apply(lambda x: generate_train_prompt(x), axis=1), columns=["text"])

# Store labels for zero-shot evaluation
y_true_zero = X_test["medical_specialty"]

print("Formatting zero-shot test data...")
X_test_zero = pd.DataFrame(X_test.apply(lambda x: generate_test_prompt(x), axis=1), columns=["messages"])

# Create test datasets with few-shot prompts
X_test_5shot, y_true_5shot = generate_fewshot_test_data(X_test, fewshot_5, VALID_SPECIALTIES)
X_test_10shot, y_true_10shot = generate_fewshot_test_data(X_test, fewshot_10, VALID_SPECIALTIES)
X_test_15shot, y_true_15shot = generate_fewshot_test_data(X_test, fewshot_15, VALID_SPECIALTIES)


# Convert back to Hugging Face Dataset format
train_data = Dataset.from_pandas(X_train_fine)
eval_data = Dataset.from_pandas(X_eval_fine)

Formatting training data...
Formatting evaluation data...
Formatting zero-shot test data...


In [271]:
X_train_fine.iloc[0,0]

'<｜begin▁of▁sentence｜>You are a medical expert specializing in classifying medical cases.<｜User｜>Classify the following case into its appropriate medical specialty: Cardiovascular / Pulmonary, Orthopedic, Nephrology, Obstetrics / Gynecology, Ophthalmology, Gastroenterology, Neurology, ENT - Otolaryngology, Psychiatry / Psychology, Pediatrics - Neonatal, Neurosurgery, Hematology - Oncology, Radiology\n\nDescription:\n"""Pacemaker ICD interrogation.  Severe nonischemic cardiomyopathy with prior ventricular tachycardia."""\n\nRespond with only the most appropriate specialty name from the list above.<｜Assistant｜>Cardiovascular / Pulmonary<｜end▁of▁sentence｜>'

In [264]:
print(f"Using device: {device}")

Using device: cuda


In [None]:
def predict(X_test, model, tokenizer, batch_size=16, max_new_tokens=500, temperature=0.1, device=torch.device("cuda")):
    """Optimized prediction using batching and mixed precision on GPU"""
    model.eval()
    model.to(device)
    y_pred = []

    for i in tqdm(range(0, len(X_test), batch_size), desc="Predicting Specialties"):
        batch = X_test.iloc[i:i+batch_size]
        prompts = batch["messages"].tolist()   
        inputs = tokenizer(prompts, return_tensors="pt", padding=True,truncation=True).to(device)

        
        with torch.no_grad(), torch.autocast(device_type=device.type, dtype=torch.bfloat16):
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id
            )

        for j in range(len(outputs)):
            result = tokenizer.decode(outputs[j], skip_special_tokens=False).strip()
            # print(result)
            # Extract assistant response (adapted for DeepSeek format)
            match = re.search(r"<\｜Assistant\｜>(.*?)(<\｜end▁of▁sentence\｜>|$)", result, re.DOTALL)
            if match:
                generated = match.group(1).strip().lower()
            else:
                assistant_pos = result.lower().rfind("<｜assistant｜>")
                if assistant_pos != -1:
                    generated = result[assistant_pos + len("<｜assistant｜>"):].strip().lower()
                else:
                    input_text = tokenizer.decode(inputs["input_ids"][j], skip_special_tokens=True)
                    generated = result[len(input_text):].strip().lower()

            # Remove "thinking" tags
            generated = re.sub(r"<think>.*?</think>", "", generated, flags=re.DOTALL).strip()

            # Clean and append result
            cleaned = clean_prediction(generated, VALID_SPECIALTIES)
            # print(cleaned)
            y_pred.append(cleaned)

    return y_pred


## Zero-shot Prediction and Evaluation (Deepseek)

In [201]:
model.eval()
y_pred_zero = predict(X_test_zero, model, tokenizer)
res = evaluate(y_true_zero, y_pred_zero, label_list=VALID_SPECIALTIES, model_name="DeepSeek-R1-Distill-Qwen-1.5B", training_type="Zero-shot")
All_results.append(res)

Predicting Specialties: 100%|██████████| 370/370 [1:36:21<00:00, 15.63s/it]

✅ Evaluating 225 valid predictions out of 370 total.

🔹 Accuracy: 0.427
🔹 Precision (macro avg): 0.394
🔹 Recall (macro avg):    0.353
🔹 F1 Score (macro avg):  0.318

🔍 Classification Report:
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.73      0.28      0.41        57
                Orthopedic       0.39      0.79      0.52        43
                Nephrology       0.20      0.80      0.32         5
   Obstetrics / Gynecology       0.83      0.33      0.48        15
             Ophthalmology       0.36      0.82      0.50        11
          Gastroenterology       0.92      0.52      0.67        21
                 Neurology       0.41      0.43      0.42        30
      ENT - Otolaryngology       0.00      0.00      0.00        11
   Psychiatry / Psychology       1.00      0.50      0.67         4
     Pediatrics - Neonatal       0.00      0.00      0.00         1
              Neurosurgery       0.29      0.11      0.16   




## Few-shot Prediction and Evaluation (Deepseek)

In [202]:
y_pred_few_5 = predict(X_test_5shot, model, tokenizer)
res1 = evaluate(y_true_5shot, y_pred_few_5, label_list=VALID_SPECIALTIES, model_name="DeepSeek-R1-Distill-Qwen-1.5B", training_type="Few-shot5")
All_results.append(res1)

# Its taking too much of time

# y_pred_few_10 = predict(X_test_10shot, model, tokenizer)
# res2 = evaluate(y_true_10shot, y_pred_few_10, label_list=VALID_SPECIALTIES,  model_name="DeepSeek-R1-Distill-Qwen-1.5B", training_type="Few-shot10")
# All_results.append(res2)

Predicting Specialties: 100%|██████████| 370/370 [1:05:20<00:00, 10.60s/it]

✅ Evaluating 322 valid predictions out of 370 total.

🔹 Accuracy: 0.326
🔹 Precision (macro avg): 0.364
🔹 Recall (macro avg):    0.287
🔹 F1 Score (macro avg):  0.281

🔍 Classification Report:
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.70      0.31      0.43        96
                Orthopedic       0.40      0.35      0.38        48
                Nephrology       0.12      0.40      0.18        10
   Obstetrics / Gynecology       0.75      0.30      0.43        20
             Ophthalmology       0.46      1.00      0.63        11
          Gastroenterology       0.77      0.29      0.43        34
                 Neurology       0.40      0.43      0.41        47
      ENT - Otolaryngology       0.00      0.00      0.00        13
   Psychiatry / Psychology       1.00      0.43      0.60         7
     Pediatrics - Neonatal       0.00      0.00      0.00         2
              Neurosurgery       0.14      0.21      0.17   




## Finetuning (Deepseek)

In [None]:
# Configure LoRA
lora_config = LoraConfig(
    r=16,  # Rank
    lora_alpha=16,  # Alpha parameter for LoRA scaling
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention modules
        "gate_proj", "up_proj", "down_proj"      # MLP modules
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

fp16 = compute_dtype == torch.float16
bf16 = compute_dtype == torch.bfloat16
 

training_arguments = SFTConfig(
    output_dir=None,
    num_train_epochs=3,
    gradient_checkpointing=True,
    fp16=fp16,
    bf16=bf16,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="adamw_torch_fused",
    save_strategy="no",
    logging_steps=100,
    learning_rate=2e-4,
    report_to= "none",
    eval_steps=50,  # Instead of `evaluation_strategy`
)


model.config.use_cache = False
model.config.pretraining_tp = 1

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    args=training_arguments
)

In [274]:
trainer.train()

Step,Training Loss
100,1.2439
200,0.8399
300,0.7563


TrainOutput(global_step=321, training_loss=0.9321540790926259, metrics={'train_runtime': 474.1346, 'train_samples_per_second': 10.908, 'train_steps_per_second': 0.677, 'total_flos': 8130781332357120.0, 'train_loss': 0.9321540790926259})

## Fintune Prediction and Evaluation (Deepseek)

In [276]:
model.eval()
y_pred_fine = predict(X_test_zero, model, tokenizer)

res = evaluate(y_true_zero, y_pred_fine, label_list=VALID_SPECIALTIES,  model_name="DeepSeek-R1-Distill-Qwen-1.5B", training_type="FineTuning")
All_results.append(res)

Predicting Specialties:   0%|          | 0/24 [00:00<?, ?it/s]

Predicting Specialties: 100%|██████████| 24/24 [17:55<00:00, 44.80s/it]

✅ Evaluating 290 valid predictions out of 370 total.

🔹 Accuracy: 0.445
🔹 Precision (macro avg): 0.418
🔹 Recall (macro avg):    0.412
🔹 F1 Score (macro avg):  0.351

🔍 Classification Report:
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.86      0.39      0.53        83
                Orthopedic       0.36      0.83      0.50        47
                Nephrology       0.12      0.75      0.21         8
   Obstetrics / Gynecology       0.76      0.57      0.65        23
             Ophthalmology       0.67      0.67      0.67         9
          Gastroenterology       0.85      0.33      0.48        33
                 Neurology       0.45      0.36      0.40        36
      ENT - Otolaryngology       0.50      0.09      0.15        11
   Psychiatry / Psychology       0.78      0.88      0.82         8
     Pediatrics - Neonatal       0.00      0.00      0.00         2
              Neurosurgery       0.00      0.00      0.00   




In [277]:
pd.DataFrame(All_results)

Unnamed: 0,model,Training_Type,Accuracy,Precision (macro avg),Recall (macro avg),F1 Score (macro avg)
0,Gemma3-1b,Zero-shot,0.605797,0.588705,0.479389,0.494805
1,Gemma3-1b,Few-shot5,0.208108,0.479,0.158851,0.158072
2,Gemma3-1b,Few-shot10,0.281843,0.316844,0.155595,0.11815
3,Gemma3-1b,Few-shot15,0.306233,0.074573,0.115385,0.079711
4,Gemma3-1b,FineTuning,0.775676,0.653539,0.640009,0.641594
5,Llama-3-8B-Instruct,Zero-shot,0.735043,0.641317,0.641012,0.613964
6,Llama-3-8B-Instruct,Few-shot5,0.752187,0.661484,0.691934,0.64911
7,Llama-3-8B-Instruct,Few-shot10,0.742857,0.671615,0.644975,0.596966
8,Llama-3-8B-Instruct,FineTuning,0.816216,0.699554,0.684936,0.679657
9,DeepSeek-R1-Distill-Qwen-1.5B,Zero-shot,0.426667,0.393508,0.353167,0.317925
