In [1]:
# ================================================
# 1Ô∏è‚É£ Env Prepare Install Required Packages
# ================================================

#%%capture
import os, importlib.util
!pip install --upgrade -qqq uv
if importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):    
    try: import numpy, PIL; get_numpy = f"numpy=={numpy.__version__}"; get_pil = f"pillow=={PIL.__version__}"
    except: get_numpy = "numpy"; get_pil = "pillow"
    !uv pip install -qqq \
        "torch>=2.8.0" "triton>=3.4.0" {get_numpy} {get_pil} torchvision bitsandbytes "transformers==4.56.2" \
        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
        git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels
elif importlib.util.find_spec("unsloth") is None:
    !uv pip install -qqq unsloth
!uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo huggingface-hub==0.34.4 datasets==4.3.0 numpy==2.3.4 pandas==2.3.3 pyarrow==22.0.0 tqdm==4.67.1

[2mUsing Python 3.12.11 environment at: /home/zeus/miniconda3/envs/cloudspace[0m
[2K[2mResolved [1m11 packages[0m [2min 55ms[0m[0m                                         [0m
[2mAudited [1m11 packages[0m [2min 0.26ms[0m[0m


In [None]:
#!uv pip install --upgrade --force-reinstall --no-cache-dir transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo

In [None]:
#!uv pip install --upgrade --force-reinstall --no-cache-dir numpy==2.3.4 scipy scikit-learn pandas numba  statsmodels  joblib 

# SFT Config Parameters

Following the main parameters which affect the fine tune speed a lot.  The main affect is the memory usage like:

*How long sequence* will be seen by the model which directly decide attention matrix manapulation. Standard attention need O(N^2), state of art flash attention may bring linear complex. 

*Adapter rank size*  LoRA/QLoRA add two small matrices  s (A and B) to existing weight matrices (W) where  `W' = W + BA` 

B` (dim: `d x r`) and `A` (dim: `r x k`) have rank `r << d,k`, drastically reducing parameters.

The rank size directly affected new introduced parameters, how much compute needed during the training.

## sequence length 

‚úÖ  MAX_SEQ_LEN = 4096

Each sequence is padded/truncated to 4096 tokens (context length).

## Batch size

Two different batches exsits. When working with an LLM training pipeline, you‚Äôll encounter: Dataloader batch and Training batch

### Dataloader batch

Examples:

PyTorch DataLoader(batch_size=...)
HuggingFace Dataset.map() ‚Üí grouped into batches
Tokenizers producing ‚Äúbatch of size X‚Äù
These batches are usually used for:
- faster preprocessing
- efficient tokenization
- efficient I/O

dataset.map(preprocess, batched=True, batch_size=1000)  # CPU-side

This ‚Äúbatch_size=1000‚Äù has nothing to do with training batch!

But they do NOT necessarily equal the batch the model trains on.

### Training batch (micro-batch / per-GPU batch)

This is the batch size that:

- fits in GPU memory
- goes into model(input)
- is used in each forward/backward pass
- is included in the global batch calculation

This is what frameworks mean when they say micro_batch_size or per_device_train_batch_size.

per_device_train_batch_size = 4

‚úÖ Each GPU receives 4 sequences per forward/backward pass (this is the micro-batch size).

gradient_accumulation_steps = 8

‚úÖ  model will NOT update weights every micro-batch, Run 8 forward passes, Run 8 backward passes, Accumulate (add together) all 8 gradients
Then perform 1 optimizer update, This lets you simulate a larger batch size without needing more GPU memory.

So the actual global batch size is: 4 (micro) √ó 8 (accumulation) √ó 8 (gpus) = 256

# System Prompt as activation key for DPO

The success of the alignment relies on using an identical system prompt during both the data generation phase and production inference. This prompt is the conditional signal that activates the model's professional persona.

*Include the injecting system prompt in DPO custom training data  NOT in SFT stage.*

Injecting the prompt into SFT breaks things:

prompt is a persona + behavioral rule set (tone, structure, domain focus). SFT‚Äôs job is to teach the model skills (reasoning, instruction following) across many styles and tasks. If you prepend the persona to every training example you will:

*Overwrite diversity* ‚Äî the model learns to always speak in that persona, even where it‚Äôs inappropriate (stories, casual chat, step-by-step math explanations that need verbosity).

*Create optimization conflicts* ‚Äî some datasets (GSM8K, OpenThoughts) require long, explicit reasoning. A ‚Äúbe concise‚Äù rule fights that, reducing reasoning quality.

*Remove the option to ‚Äúactivate‚Äù the persona at inference* ‚Äî you lose modularity and controllability.

*Break DPO effectiveness* ‚Äî DPO needs contrasts (bad vs good); if SFT already forces the good persona everywhere, DPO has nothing to teach.

This is why the standard sequence used by labs is:

- SFT = learn how to answer (neutral instructions)
- DPO/RLHF = learn which answers are preferred (persona & tone),
- Runtime system prompt = activate persona when needed.



In [1]:
# ================================================
# 2Ô∏è‚É£ Basic Fine Tune Config
# ================================================

# Define your custom system prompt
CUSTOM_SYSTEM_PROMPT = """\
You are a highly professional, concise technical expert across modern computing domains ‚Äî 
including software architecture, cloud infrastructure, data systems, machine learning, and applied AI.

Your task is to:
- Answer the user‚Äôs question using the provided CONTEXT as your primary source.
- If the CONTEXT does not contain enough information, use your own knowledge,
  but clearly distinguish between context-based and general reasoning.

Your responses must be:
- Structured ‚Äî use clear formatting and logical reasoning.
- Contextual ‚Äî rely only on the information available.
- Concise ‚Äî eliminate filler words while preserving precision.
- Aligned with industry best practices ‚Äî modern, reproducible, and standards-based.
"""

# Retain the VRAM safety length
MAX_SEQ_LEN = 4096    
# MAX_SEQ_LEN = 1024

LORA_RANK = 32

# per_device_train_batch_size
DEVICE_BATCH = 32

# gradient_accumulation_steps
GRADIENT_ACCUMULATION = 32

LEARNING_RATE = 1.5e-4

OUTPUT_DIR = "gpt-oss-20b-sft-qlora-adapter"

#SFT_TEST_SIZE = 100 # Using 100 rows for a quick test run

# Understand LLM training 

## 4-Layer Universal LLM Training Stack Conceptual

- LAYER 1 ‚Äî Model Definition/Architecture
  Define the model‚Äôs structure: layers, attention mechanism, feedforward (dense) blocks, embeddings, hyperparameters. Abstract
  representation of the neural network.
  
- LAYER 2 ‚Äî Model Loading & Preparation
  Load the model parameters into memory and prepare them for training. Includes choosing which parts are trainable, adjusting
  precision, and integrating adapters if needed.

  
- LAYER 3 ‚Äî Training Loop
  Execute optimization: forward pass ‚Üí compute loss ‚Üí backward pass ‚Üí optimizer step ‚Üí update parameters. Handle batching,
  gradient accumulation, evaluation, and logging.

  
- LAYER 4 ‚Äî System & Distributed Backend

  Efficiently manage hardware and scaling: memory optimization, multi-GPU/multi-node coordination, data /tensor/pipeline parallelism,
  mixed precision, and offloading if necessary.

## 4-Layer Universal LLM Training Stack Practical

> Layer 2‚Äì4 are highly framework-dependent, but the conceptual responsibilities remain the same.

- LAYER 1 ‚Äî Model Definition/Architecture
  - Choose model type: Transformer, RWKV, Mamba
  - Decide hyperparameters: layers, vocab size, hidden size,attention structure, attention heads, context length, feedforward design

  - Ensure the architecture matches the intended task (e.g., GPT-style for text generation)
  - For research, consider memory efficiency vs expressivity

  
- LAYER 2 ‚Äî Model Loading & Preparation
  - Load pretrained weights (HF Transformers, Unsloth, raw PyTorch)
  - Apply LoRA / adapters for parameter-efficient tuning
  - Set dtype / quantization (fp16, bf16, 4-bit, 8-bit)
  - Freeze layers if using adapters

  - Ensure checkpoint matches architecture
  - Choose precision & device mapping based on GPU memory
  - Decide which parameters are trainable now vs later
  

- LAYER 3 ‚Äî Training Loop
  - Use Trainer frameworks: HF Trainer, TRL (for RLHF / LoRA), Lightning, or raw PyTorch loops
  - Implement forward ‚Üí loss ‚Üí backward ‚Üí optimizer step
  - Handle gradient accumulation, logging, evaluation
    
  - Choose trainer based on flexibility vs simplicity
  - Use mixed precision and gradient checkpointing if memory-limited
  - For RLHF, specialized trainers like TRL or trlx are recommended

 
  
- LAYER 4 ‚Äî System & Distributed Backend
  - Select framework for scaling: Accelerate, FSDP, DeepSpeed, Colossal-AI
  - Configure memory optimization: ZeRO, offloading, sharding
  - Choose parallelism strategy: data, tensor, pipeline

  - Handles scaling and hardware orchestration, makes training efficient at large scale.
  - Start with single-GPU / small scale before multi-GPU
  - Understand how model weights are partitioned for large-scale training
  - Plan for checkpoints and resuming training across devices
  - Choice of framework depends on model size and hardware:
      - Single GPU / small model: HF Accelerate or Lightning
      - Medium multi-GPU model: FSDP / DeepSpeed Stage 1-2
      - Huge multi-node model: DeepSpeed Stage 3, Colossal-AI 3D parallelism


# Raw pytorch Code snippet
```python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# -------------------------------
# Layer 1: Model Definition
# -------------------------------
class SimpleDecoder(nn.Module):
    def __init__(self, vocab_size=1000, embed_dim=128, hidden_dim=256):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.GRU(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, x):
        x = self.embedding(x)
        out, _ = self.rnn(x)
        return self.fc(out)

# -------------------------------
# Layer 2: Model Loading & Preparation
# -------------------------------
model = SimpleDecoder()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Dummy dataset
data = torch.randint(0, 1000, (64, 20))
target = torch.randint(0, 1000, (64, 20))
dataset = TensorDataset(data, target)
loader = DataLoader(dataset, batch_size=8)

# -------------------------------
# Layer 3: Training Loop
# -------------------------------
model.train()
for epoch in range(2):
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch} Loss: {loss.item():.4f}")

# -------------------------------
# Layer 4: System / Distributed Backend
# -------------------------------
# Single GPU here; for multi-GPU, wrap with torch.nn.DataParallel or torch.distributed
# Example:
# model = nn.DataParallel(model)
```

# Hugging Face Transformer code Snippet
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

# -------------------------------
# Layer 1: Model Definition
# -------------------------------
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# -------------------------------
# Layer 2: Model Loading & Preparation
# -------------------------------
model = model.to("cuda")
# Optionally, freeze layers for fine-tuning
for param in model.transformer.h[:6].parameters():
    param.requires_grad = False

# Prepare dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=64)
dataset = dataset.map(tokenize, batched=True)
dataset.set_format(type="torch", columns=["input_ids"])

# -------------------------------
# Layer 3: Training Loop (HF Trainer)
# -------------------------------
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    fp16=True,
    logging_steps=10,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

# -------------------------------
# Layer 4: System / Distributed Backend
# -------------------------------
# HF Trainer integrates Accelerate for multi-GPU automatically
# Example: set environment variable CUDA_VISIBLE_DEVICES=0,1,2

```

# Unsloth Code Snippet
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from torch.utils.data import DataLoader

# -------------------------------
# Layer 1: Model Definition
# -------------------------------
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# -------------------------------
# Layer 2: Model Loading & Preparation
# -------------------------------
# Prepare LoRA adapters
lora_config = LoraConfig(
    r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none"
)
model = get_peft_model(model, lora_config)
model = model.to("cuda")

# Dummy dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1%]")
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=64)
dataset = dataset.map(tokenize, batched=True)
dataset.set_format(type="torch", columns=["input_ids"])
loader = DataLoader(dataset, batch_size=2)

# -------------------------------
# Layer 3: Training Loop
# -------------------------------
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
model.train()
for epoch in range(1):
    for batch in loader:
        input_ids = batch["input_ids"].to("cuda")
        optimizer.zero_grad()
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch} Loss: {loss.item():.4f}")

# -------------------------------
# Layer 4: System / Distributed Backend
# -------------------------------
# For larger models, integrate with DeepSpeed / FSDP
# Unsloth / PEFT is memory efficient due to LoRA adapters

```

In [1]:
# ================================================
# 3Ô∏è‚É£  Load FastLanguageModel + Tokenizer
# ================================================

from unsloth import FastLanguageModel, is_bfloat16_supported
import torch

print(is_bfloat16_supported())

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization
    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
    "unsloth/gpt-oss-20b", # 20B model using MXFP4 format
    "unsloth/gpt-oss-120b",
] # More models at https://huggingface.co/unsloth

dtype=None

# Unsloth recommended: returns both model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    
    # Hopper GPUs BF16 optimization, None for auto detection
    dtype=dtype, 
    
    # The model‚Äôs internal attention window ‚Äì i.e. how many tokens it can actually process at once during forward/backward passes
    max_seq_length = MAX_SEQ_LEN,

    # 4 bit quantization to reduce memory
    load_in_4bit = True,
    
    # False means with QLoRA/LoRA
    # [NEW!] unsloth have full finetuning now!
    full_finetuning = False,
    
    # token = "hf_...",              # use one if using gated models
)

print("\n‚úÖ FastLanguageModel + tokenizer loaded successfully")


NotImplementedError: Unsloth cannot find any torch accelerator? You need a GPU.

In [6]:
# ================================================
# 4Ô∏è‚É£ Load Dataset, Split Dataset Train/Validation
# ================================================

from datasets import load_dataset


dataset_path = "./train_sft_final.jsonl"
raw_dataset = load_dataset("json", data_files={"train": dataset_path})

full_dataset = raw_dataset["train"]

# for small dataset smoke test on T4 
# full_dataset = full_dataset.select(range(100))

print(f"\n‚úÖ Total samples: {len(full_dataset)}")
print(f"\n‚úÖ Inspect the first entry of the data:\n\n {full_dataset[0]}")


# 95% train, 5% validation
split_dataset = full_dataset.train_test_split(test_size=0.05, seed=42)
train_dataset = split_dataset["train"]
val_dataset = split_dataset["test"]

print(f"\n‚úÖ Train samples: {len(train_dataset)}")
print(f"\n‚úÖ Validation samples: {len(val_dataset)}")

def inspect_message_with_chat_template(example, tokenizer):
    messages = [
        #{"role": "system", "content": CUSTOM_SYSTEM_PROMPT},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["response"]},
    ]
    formatted_text = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
    print("-" * 50)
    print("\n‚úÖ Inspect data after apply chat template\n")
    print(formatted_text[:500])
    print("-" * 50)
    
inspect_message_with_chat_template(train_dataset[0], tokenizer)
inspect_message_with_chat_template(val_dataset[0], tokenizer)



‚úÖ Total samples: 29692

‚úÖ Inspect the first entry of the data:

 {'instruction': 'Compile a visually appealing list of at least ten distinctive dips and gravies that can be paired with a range of dishes prepared on your electric griddle. Make sure to include both sweet and savory options, as well as dips that cater to different dietary restrictions such as vegan, gluten-free, or low-fat. Additionally, provide brief descriptions of each dip or gravy highlighting its key ingredients, flavor profile, and suggested griddle dishes to accompany it.', 'response': '1. Caramelized Onion and Garlic Dip - This savory dip pairs perfectly with breakfast dishes such as eggs, bacon, and pancakes or can be used as a topping for burgers and sandwiches. It is made with caramelized onions, garlic, cream cheese, and sour cream, and has a sweet and tangy flavor.\n \n2. Spicy Avocado Dip - This vegan option is perfect for those looking for a healthy dip option. Made with ripe avocados, jalapenos, lime 

#  How translate the message into tokens and fed to the model, Why tokenized data will be failed ???

## Principle: Keep tokenization responsibilities consistent
If you want the trainer to handle packing, batching, padding-free training etc ‚Äî give it raw "text" strings and set dataset_text_field="text" in your SFTConfig. This is the easiest path and avoids the shape errors you saw.

If you pre-tokenize, return Python lists (list of input_ids lists and attention masks) and do not return PyTorch tensors from map. Also decide whether you pad or not: if you plan to use packing, do not pad during preprocessing ‚Äî leave padding to trainer.

## Option A ‚Äî Let trainer tokenize (HIGHLY RECOMMENDED)
Why this is safe: TRL/Unsloth will call the tokenizer inside the data collator in a consistent, batch-wise manner, and will manage packing/padding in the way expected by the model/attention implementation. No tensors leak into the dataset; no shape surprises.

## Option B ‚Äî Pre-tokenize correctly (ADVANCED)
If you must pre-tokenize (e.g., offline processing, caching), do it this way:

Use tokenize=False on apply_chat_template to get strings.

*Tokenize the batch in a vectorized call: tokenizer(texts, truncation=True, padding=False, return_attention_mask=True, return_tensors=None) ‚Äî return_tensors must be None so HF dataset gets Python lists.*

*Do not use return_tensors="pt" in map.*

If you plan to use packing later, set padding=False and truncation=True (or False if you filtered earlier). Trainer can pack them.

## Debugging checklist (TBD)

### After map, inspect dataset sample types:

```python
sample = train_dataset[0]
print(type(sample["input_ids"]), isinstance(sample["input_ids"][0], int), len(sample["input_ids"]))
print(type(sample["attention_mask"]), isinstance(sample["attention_mask"][0], int))
```
You should see list and first element is int (not torch.Tensor, not list-of-list nested weirdness).

### Inspect collated batch shape used by trainer (simulate a collate):
```python
from transformers import default_data_collator
batch = [train_dataset[i] for i in range(4)]
collated = default_data_collator(batch)
print({k: (type(v), getattr(v, "shape", None)) for k,v in collated.items()})
```
This should show input_ids/attention_mask as Torch tensors of shape (batch, seq_len).

### Quick model forward sanity check (very small test):
```python
batch = default_data_collator([train_dataset[0], train_dataset[1]])
# move to device if needed
out = model(**{k: torch.tensor(v).to(model.device) for k,v in batch.items() if k in ("input_ids","attention_mask")})
print("last hidden:", getattr(out, "last_hidden_state", None) and out.last_hidden_state.shape)
```



In [1]:
# ================================================
# 5Ô∏è‚É£  Tokenize Datasets with chat template applyied
# ================================================

def tokenize_fn_old(example, tokenizer):
    
    messages = [
        {"role": "system", "content": CUSTOM_SYSTEM_PROMPT},
        {"role": "user", "content": example.get("instruction", "")},
        {"role": "assistant", "content": example.get("response", "")},
    ]

    tokenized_chat_wrapped = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=False,
        tokenize=True,
    )

    #return tokenized_chat_wrapped
    # Return a dictionary so Hugging Face can build an Arrow table
    return {"input_ids": tokenized_chat_wrapped, 
            "attention_mask": [1] * len(tokenized_chat_wrapped)}


def tokenize_fn_problem(batch, tokenizer):
    # build texts
    texts = [
        tokenizer.apply_chat_template(
            [
                {"role": "system", "content": CUSTOM_SYSTEM_PROMPT},
                {"role": "user", "content": instr},
                # {"role": "assistant", "content": resp},
            ],
            tokenize=False,
            add_generation_prompt=False,
        )
        for instr, resp in zip(batch["instruction"], batch["response"])
    ]

    # vectorized tokenizer call
    tokenized = tokenizer(
        texts,
        #truncation=True,
        #padding="max_length",   # or padding=False to let Trainer handle dynamic padding
        #padding_side = "right",
        truncation=False,  # <--- CHANGED: Set to False
        padding=False,     # <--- CHANGED: Set to False
        #max_length=MAX_SEQ_LEN,
        return_attention_mask=True,
        return_tensors=None,    # keep Python lists, HF Dataset friendly
    )

    return {
        "input_ids": tokenized["input_ids"],
        "attention_mask": tokenized["attention_mask"]
    }


def tokenize_fn(batch):
    # build texts
    texts = [
        tokenizer.apply_chat_template(
            [
                # {"role": "system", "content": CUSTOM_SYSTEM_PROMPT},
                {"role": "user", "content": instr},
                {"role": "assistant", "content": resp},
            ],
            tokenize=False,
            add_generation_prompt=False,
        )
        for instr, resp in zip(batch["instruction"], batch["response"])
    ]

    return { "text" : texts, }


from unsloth.chat_templates import standardize_sharegpt

train_dataset = train_dataset.map(tokenize_fn, batched = True)

val_dataset = val_dataset.map(tokenize_fn, batched = True)

# Apply the formatting using a lambda function to pass the tokenizer
# map() can only pass the dataset batch, not extra arguments.
#train_dataset = train_dataset.map(
#    lambda x: tokenize_fn(x, tokenizer),
#    remove_columns=train_dataset.column_names,
#    num_proc=4, # Use multiple cores for fast processing
#    desc="Mapping self dataet for SFT train"
#)
#val_dataset = val_dataset.map(
#    lambda x: tokenize_fn(x, tokenizer),
#    remove_columns=val_dataset.column_names,
#    num_proc=4, # Use multiple cores for fast processing
#    desc="Mapping self dataet for SFT validation"
#)   

print("\n‚úÖTokenization complete")

#sample = val_dataset[0]
#print("input_ids (first 1 tokens):", sample["input_ids"][:1])
#print("attention_mask (first 1 tokens):", sample["attention_mask"][:1])
val_dataset
print(val_dataset)

NotImplementedError: Unsloth cannot find any torch accelerator? You need a GPU.

In [8]:
# ================================================
# 6Ô∏è‚É£   PEFT settting
# ================================================
model = FastLanguageModel.get_peft_model(
    model,
    r = LORA_RANK, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 2*LORA_RANK,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

print("--- 1. Model and Adapter Check ---")
# This print statement now shows the doubled number of trainable parameters
print(f"\n‚úÖBase Model Parameters: {model.num_parameters()}\n (Trainable: {model.get_nb_trainable_parameters()})\n")

Unsloth: Making `model.base_model.model.model` require gradients
--- 1. Model and Adapter Check ---

‚úÖBase Model Parameters: 20930682432
 (Trainable: (15925248, 20930682432))



With Following unsloth trainer, make it run. **BUT, the batch size is not right as expected.**

==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 28,207 | Num Epochs = 3 | Total steps = 10,578 O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 2 \ / Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8 "-____-" Trainable parameters = 15,925,248 of 20,930,682,432 (0.08% trained) Unsloth: Will smartly offload gradients to save VRAM!

I set per_device_train_batch_size = 64 in your trainer_args, but Unsloth is still only using a batch of 4 per device.

Batch size per device = 4 | Gradient accumulation steps = 2
Total batch size (4 x 2 x 1) = 8

```python
from unsloth.trainer import SFTTrainer
from unsloth.trainer import SFTTrainingArguments

# set attention implementation **after loading**
model.config.attn_implementation = "flash_attention_2"

# Create SFTTrainingArguments object
training_args = SFTTrainingArguments(
    output_dir=OUTPUT_DIR,
    max_seq_length=MAX_SEQ_LEN,
    per_device_train_batch_size=64,   # micro-batch
    gradient_accumulation_steps=4,    # effective batch = 256
    num_train_epochs=3,
    learning_rate=LEARNING_RATE,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,
    bf16=True,
    fp16=False,
    optim="paged_adamw_32bit",
    dataloader_num_workers=12,
    evaluation_strategy="steps",
    eval_steps=100,
    report_to="none",
)

# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=model.peft_config
)

# Train
trainer.train()
```


# explain Why from ChatGPT (NOT Verified)

Short diagnosis ‚Äî why ‚ÄúBatch size per device = 4‚Äù even though you set 64

Unsloth will override/adjust the micro-batch in certain situations:

It has internal safety/defaults (common default micro-batch = 1‚Äì4) and may clamp your requested micro-batch to something that fits the runtime/packaging/num_generations constraints. Unsloth docs and issues recommend small defaults (e.g. per_device_train_batch_size = 2 often recommended). 
Unsloth Docs
+1

per_device_train_batch_size might not actually be passed correctly into Unsloth‚Äôs trainer if you pass args with wrong calling style. The SFTTrainer wrapper expects an args object/dict (passed as args=), not **trainer_args into SFTTrainer init (that will raise unexpected-keyword errors). If you accidentally passed trainer args incorrectly earlier, the trainer used default values. (You already saw errors like unexpected output_dir when using **trainer_args.) 
GitHub

num_generations constraint: Unsloth has code that forces per_device_train_batch_size to be a multiple of num_generations (see issue/commit wording, it will change the batch to num_generations if it doesn‚Äôt match). If your per_device_train_batch_size is not aligned it may change it to some lower multiple. (I found this referenced in Unsloth issue logs.) 
GitHub

Memory / quantization / PEFT interactions: depending on full_finetuning vs QLoRA vs LoRA, bf16/fp16 settings, and use_gradient_checkpointing, the trainer may reduce micro-batch size to ensure no OOM. Unsloth prints ‚ÄúWill smartly offload gradients to save VRAM!‚Äù, meaning it‚Äôs actively making tradeoffs. 
Unsloth Docs

Net: either your args were not passed in the correct API shape, or Unsloth intentionally lowered the micro-batch for safety/compatibility (or both).


Evidence & sources

Unsloth Fine-tuning guide / defaults: per_device_train_batch_size recommended small, use grad accumulation. 
Unsloth Docs

GitHub issues referencing per_device_train_batch_size being changed by Unsloth/compat constraints and the num_generations behavior. 
GitHub
+1




In [None]:
# ================================================
# 7Ô∏è‚É£ Training Arguments
# ================================================

from trl import SFTConfig, SFTTrainer
# set attention implementation **after loading**
#model.config.attn_implementation = "flash_attention_2"



training_args = SFTConfig(
    # TRL-Specific Args
    max_seq_length=MAX_SEQ_LEN,
    packing=True,                  # üöÄ CRITICAL for Unsloth/Flash Attention efficiency
    dataset_text_field="text",     # The column containing the formatted data

    # Core Training Args (Batching, Learning)
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=32, # Effective batch = 8
    num_train_epochs=3,
    learning_rate=LEARNING_RATE,
    optim="paged_adamw_32bit",     # Recommended optimizer for QLoRA

    # Logging and Saving
    logging_steps=5,
    save_strategy="steps",
    save_steps=30,
    save_total_limit=3,
    
    # Precision (auto-detects bfloat16 if hardware supports it)
    bf16=is_bfloat16_supported(), 
    fp16=not is_bfloat16_supported(),
    
    # Evaluation
    eval_strategy="steps",
    eval_steps=30,
    report_to="none",
)

# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    
    args=training_args,
    
    train_dataset=train_dataset,
    
    eval_dataset=val_dataset,
    
    peft_config=None,            # LoRA already applied

    formatting_func=None         # Optional: custom formatting
)

import inspect
print(inspect.signature(SFTTrainer.__init__))

# Train
trainer.train()




(self, model, args=None, data_collator=None, train_dataset=None, eval_dataset=None, processing_class=None, compute_loss_func=None, compute_metrics=None, callbacks=None, optimizer_cls_and_kwargs=None, preprocess_logits_for_metrics=None, peft_config=None, formatting_func=None, **kwargs)


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 8,312 | Num Epochs = 3 | Total steps = 27
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 32
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 32 x 1) = 1,024
 "-____-"     Trainable parameters = 15,925,248 of 20,930,682,432 (0.08% trained)


Unsloth: Will smartly offload gradients to save VRAM!


In [None]:
# ================================================
# 8Ô∏è‚É£ Save Fine-Tuned Model
# ================================================
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"SFT model with validation saved to {output_dir}")
model.push_to_hub("ospost/gpt-oss-20b-sft-qlora-adapter", token = "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") # Save to HF


1Ô∏è‚É£
2Ô∏è‚É£
3Ô∏è‚É£
4Ô∏è‚É£
5Ô∏è‚É£
6Ô∏è‚É£
7Ô∏è‚É£
8Ô∏è‚É£
9Ô∏è‚É£
üîü