# Fine-Tune DeepSeek R1 for Lab Test Analysis (8GB VRAM)

This notebook demonstrates how to fine-tune the `unsloth/DeepSeek-R1-Distill-Llama-8B` model on a custom medical lab test dataset using Unsloth and LoRA, specifically optimized to run on an 8GB VRAM GPU.

In [None]:
%%capture
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install datasets wandb trl peft

## 1. Authentication
Log in to Hugging Face and Weights & Biases (optional but recommended for tracking).

In [None]:
from huggingface_hub import login
import wandb
import os

# Replace with your actual tokens or set them as environment variables
hf_token = os.environ.get("HUGGINGFACE_TOKEN", "YOUR_HF_TOKEN_HERE")
login(hf_token)

wb_token = os.environ.get("WANDB_API_KEY", "YOUR_WANDB_TOKEN_HERE")
if wb_token != "YOUR_WANDB_TOKEN_HERE":
    wandb.login(key=wb_token)
    run = wandb.init(
        project='Fine-tune-DeepSeek-R1-Lab-Tests', 
        job_type="training", 
        anonymous="allow"
    )

## 2. Load Model & Tokenizer
Using Unsloth's `FastLanguageModel` in 4-bit quantization to fit the 8B model into 8GB VRAM.

In [None]:
from unsloth import FastLanguageModel

max_seq_length = 2048 # Adjust if you encounter OOM errors
dtype = None # Auto-detects bf16/fp16
load_in_4bit = True # CRITICAL for 8GB VRAM

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token, 
)

## 3. Configure LoRA Adapters
We only train a small percentage (LoRA adapters) of the weights to save memory.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # LoRA rank
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = 16,
    lora_dropout = 0, # Dropout = 0 is recommended for Unsloth
    bias = "none",    # Bias = none is recommended
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context memory saving
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

## 4. Prepare the Dataset
Load the local JSONL conversational dataset and format it for the causal language model.

In [None]:
from datasets import load_dataset

# Load the dataset you generated
dataset = load_dataset("json", data_files="fine_tuning_lab_tests.jsonl", split="train")

# DeepSeek-R1 specific formatting:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
{system}

### Question:
{user}

### Response:
<think>
{thought}</think>
{response}"""

def formatting_prompts_func(examples):
    texts = []
    for messages in examples['messages']:
        system = messages[0]['content']
        user = messages[1]['content']
        assistant = messages[2]['content']
        
        # For generating synthetic thoughts based on the final answer, 
        # we provide a generic template thought if we don't have explicit CoT data.
        # DeepSeek values the <think> tags heavily.
        thought_process = "Analyzing the lab result against the reference range... Determining if the value is high, low, or normal... Formulating clinical recommendations based on standard medical guidelines for this parameter."
        
        text = prompt_style.format(system=system, user=user, thought=thought_process, response=assistant)
        texts.append(text)
    return { "text" : texts }

dataset = dataset.map(formatting_prompts_func, batched = True)

## 5. Training Setup
Execute the fine-tuning process. The `per_device_train_batch_size=2` and `gradient_accumulation_steps=4` are tuned for 8GB VRAM.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # Increase for full training (e.g., num_train_epochs=3)
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit", # 8-bit Adam optimizer saves VRAM
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

# Start the training
trainer_stats = trainer.train()

## 6. Inference / Testing
Test the newly fine-tuned model.

In [None]:
FastLanguageModel.for_inference(model) # Enable 2x faster inference!

test_system = "You are a medical laboratory assistant. Your task is to analyze lab results, identify abnormalities based on reference ranges, and provide brief, informative explanations for healthcare professionals."
test_user = "Glucose (Fasting). Result: 155 mg/dL. Ref Range: 70-99 mg/dL."

# Prepare prompt (without the thought and response part, so the model generates it)
inference_prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
{test_system}

### Question:
{test_user}

### Response:
<think>"""

inputs = tokenizer([inference_prompt], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=512,
    use_cache=True,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(response[0].split("### Response:")[1])

## 7. Save and Push to Hub
Save the LoRA adapters locally, or push them to Hugging Face.

In [None]:
new_model_local = "DeepSeek-R1-Lab-Test-Assistant-LoRA"

# Save locally
model.save_pretrained(new_model_local) 
tokenizer.save_pretrained(new_model_local)

# Optional: Save merged 16-bit model 
# model.save_pretrained_merged(new_model_local + "-merged", tokenizer, save_method = "merged_16bit")

# Optional: Push to Hub
# new_model_online = "your_username/DeepSeek-R1-Lab-Test-Assistant-LoRA"
# model.push_to_hub(new_model_online)
# tokenizer.push_to_hub(new_model_online)
