# Qwen adapters: Soft Prompt → LoRA SFT → KL‑anchored SFT 

This workshop notebook uses the open, commercially usable Databricks Dolly 15k dataset (`databricks/databricks-dolly-15k`).

Run order: 1) Setup → 2) Data → 3) Template → 4) KL toy → 5) Soft Prompt → 6) LoRA SFT → 7) KL‑anchored SFT → 8) Inference



### Workshop Goals
- Show minimal, runnable examples of three adapter techniques: Soft Prompt token tuning, LoRA SFT, and KL-anchored SFT on an open dataset (Dolly 15k).
- Emphasize how to preprocess with chat templates, why/when to use Soft Prompt vs LoRA, and how a small KL penalty anchors behavior to the base model.
- Keep compute small: tiny instruct model, small sample sizes, short training with frequent eval to illustrate loss going down.


In [1]:
# If running on a fresh environment, you may need:
# !pip install -U transformers accelerate datasets trl peft bitsandbytes einops pandas safetensors

import os, math, json, random, logging
from pathlib import Path
import torch

# Global switches
USE_4BIT  = False   # Enable if bitsandbytes is available and you have a GPU
SEED      = 42
random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# Choose a small instruct model to keep things light.
# You can switch to Qwen3 if you have a larger GPU (e.g., 'Qwen/Qwen3-1.7B-Instruct').
MODEL_ID = "HuggingFaceTB/SmolLM-135M-Instruct"
print("PyTorch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
from datasets import disable_caching
disable_caching()


PyTorch: 2.9.0+cpu
CUDA available: False


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load Dolly 15k and render to `text` using the tokenizer's chat template.
# We keep only the `text` field to be consumed by the trainer/tokenizer.
from typing import Optional, List, Dict
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import load_dataset, Dataset

def load_tokenizer():
    return AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

def load_base_model():
    if USE_4BIT:
        bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
        return AutoModelForCausalLM.from_pretrained(
            MODEL_ID, trust_remote_code=True, quantization_config=bnb, device_map="auto"
        )
    return AutoModelForCausalLM.from_pretrained(MODEL_ID, trust_remote_code=True, device_map="auto")

def to_chatml(messages: List[Dict[str,str]], tok):
    return tok.apply_chat_template(messages, tokenize=False)

def build_messages(instruction: str, context: str, response: str) -> List[Dict[str,str]]:
    sys = "You are a helpful assistant."
    if context:
        user = f"Instruction: {instruction}\nContext: {context}"
    else:
        user = f"Instruction: {instruction}"
    return [
        {"role": "system", "content": sys},
        {"role": "user",   "content": user},
        {"role": "assistant", "content": (response or "")}
    ]

def load_dolly_as_sft(tok, max_samples: Optional[int] = None) -> Dataset:
    ds = load_dataset("databricks/databricks-dolly-15k", split="train")
    if max_samples is not None:
        ds = ds.select(range(min(int(max_samples), len(ds))))

    def to_text(ex):
        msgs = build_messages(ex.get("instruction", ""), ex.get("context", ""), ex.get("response", ""))
        return {"text": to_chatml(msgs, tok)}

    ds = ds.map(to_text, remove_columns=ds.column_names)
    return ds

import random as _rnd
def split_small(ds, test_size=0.1, seed=42):
    n = len(ds)
    idx = list(range(n))
    _rnd.Random(seed).shuffle(idx)
    cut = max(1, int(n * (1 - test_size)))
    train_idx, test_idx = idx[:cut], idx[cut:]
    train_texts = [ds[i]['text'] for i in train_idx]
    test_texts  = [ds[i]['text'] for i in test_idx]
    from datasets import Dataset as _DS
    return _DS.from_dict({'text': train_texts}), _DS.from_dict({'text': test_texts})


## 1) Data — Databricks Dolly 15k (open source)


#### Dataset: Databricks Dolly 15k
- Open, instruction-following dataset. We use a small subset to keep runs fast.
- Mapping: we render each (instruction, optional context, response) as chat messages and apply the model's chat template to produce `text`.
- Always preprocess with `apply_chat_template` to keep training/inference consistent.


In [3]:
tok = load_tokenizer()
preview = load_dolly_as_sft(tok, max_samples=3)
for i, t in enumerate(preview['text']):
    print(f'Example {i+1}')
    print(t[:200])
    print('---')

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Map: 100%|██████████| 3/3 [00:00<00:00, 483.62 examples/s]

Example 1
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Instruction: When did Virgin Australia start operating?
Context: Virgin Australia, the trading name of Virgin Australia Airli
---
Example 2
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Instruction: Which is a species of fish? Tope or Rope<|im_end|>
<|im_start|>assistant
Tope<|im_end|>

---
Example 3
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Instruction: Why can camels survive for long without water?<|im_end|>
<|im_start|>assistant
Camels use the fat in their humps
---





## 2) Template practice — render chat with the tokenizer


#### Chat Template
- The tokenizer's chat template inserts system/user/assistant markers and special tokens.
- For supervised fine-tuning, include the assistant answer in `text` so the loss is computed on assistant targets.


In [4]:
from transformers import AutoTokenizer
tok = load_tokenizer()
messages = [
    {"role":"system","content":"You are a helpful assistant."},
    {"role":"user","content":"Give three bullet tips for staying focused when studying."},
    {"role":"assistant","content":"1) Set a clear goal. 2) Use short sprints. 3) Remove distractions."},
]
rendered = tok.apply_chat_template(messages, tokenize=False)
print(rendered[:500])

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Give three bullet tips for staying focused when studying.<|im_end|>
<|im_start|>assistant
1) Set a clear goal. 2) Use short sprints. 3) Remove distractions.<|im_end|>



## 3) KL divergence — tiny 3–5 token example
We estimate token‑wise KL as KL(P‖Q) = ∑ p_i · (log p_i − log q_i).


### KL Divergence - Intuition
- Measures how one token distribution (student) diverges from a reference (base).
- For next-token training, we add a small KL term per position: KL(P||Q) = sum p_i * (log p_i - log q_i).
- Keeping KL small helps retain base model behavior while adapting.


#### KL Toy Example Notes
- We compute entropy `H(P)`, cross-entropy `H(P,Q)`, and KL `KL(P||Q)` and print per-token contributions.
- Positive `KL_i` means the student over-weights a token vs reference; negative means under-weights.


In [5]:
import torch
tokens = ['A','B','C','D']
# Two categorical distributions over the same 4 tokens
p = torch.tensor([0.70, 0.10, 0.15, 0.05], dtype=torch.float64)  # student
q = torch.tensor([0.60, 0.20, 0.15, 0.05], dtype=torch.float64)  # reference

def dkl(p, q):
    return torch.sum(p * (torch.log(p) - torch.log(q)))

# Per-token contributions
logp = torch.log(p)
logq = torch.log(q)
H_p_i  = -p * logp
H_pq_i = -p * logq
KL_i   = p * (logp - logq)

H_p  = H_p_i.sum().item()
H_pq = H_pq_i.sum().item()
KL   = KL_i.sum().item()

print('Token  p        q        -p*log p  -p*log q   KL_i')
for i, t in enumerate(tokens):
    print('{:>5}  {:.6f}  {:.6f}  {:.6f}  {:.6f}  {:.6f}'.format(t, p[i].item(), q[i].item(), H_p_i[i].item(), H_pq_i[i].item(), KL_i[i].item()))

print('H(P)={:.6f}, H(P,Q)={:.6f}, KL(P||Q)={:.6f}'.format(H_p, H_pq, KL))


Token  p        q        -p*log p  -p*log q   KL_i
    A  0.700000  0.600000  0.249672  0.357578  0.107905
    B  0.100000  0.200000  0.230259  0.160944  -0.069315
    C  0.150000  0.150000  0.284568  0.284568  0.000000
    D  0.050000  0.050000  0.149787  0.149787  0.000000
H(P)=0.914286, H(P,Q)=0.952876, KL(P||Q)=0.038591


## 4) Soft prompt tuning (Prompt Tuning) — no base weight updates
We train a small set of virtual tokens. Start with few steps/epochs for speed.


### Soft Prompt Token Tuning — What and Why
- Learns a small set of virtual tokens prepended to every input; base weights stay frozen.
- Good for steering style/format quickly with tiny memory and training cost.
- We initialize with a short instruction text and optimize the prompt embeddings.
- Below we show pre-train loss/perplexity, then train with eval every few steps to show the curve.


### Soft Prompt Token Tuning (Prompt Tuning)
- Learns `num_virtual_tokens` embeddings prepended to every input; base model weights are frozen.
- Great for style/format steering or small task bias with minimal memory/latency overhead.
- Key knobs: `num_virtual_tokens` (capacity), LR (~5e-3 typical), sequence length.
- We evaluate every few steps to show the validation loss curve trending down.


In [6]:
# Configure Soft Prompt: number of virtual tokens, init from a short text,
# and ensure task type matches causal LM.
from peft import PromptTuningConfig, PromptTuningInit, TaskType, get_peft_model
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

tok = load_tokenizer()
base = load_base_model()

ds = load_dolly_as_sft(tok, max_samples=60)
train_ds, val_ds = split_small(ds, test_size=0.1, seed=SEED)

peft_cfg = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=32,
    prompt_tuning_init=PromptTuningInit.TEXT,
    prompt_tuning_init_text='You are a helpful assistant.',
    tokenizer_name_or_path=MODEL_ID,
)
soft_model = get_peft_model(base, peft_cfg)

def tokenize_fn(batch):
    enc = tok(batch['text'], truncation=True, max_length=256)
    # labels created by collator
    return enc
train_tok = train_ds.map(tokenize_fn, batched=True, remove_columns=['text'])
val_tok = val_ds.map(tokenize_fn, batched=True, remove_columns=['text'])

args = TrainingArguments(
    output_dir='out_soft_prompt',
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    num_train_epochs=1,
    learning_rate=5e-3,
    logging_steps=5,
    eval_steps=5,
    eval_strategy='steps',
    save_strategy='no',
    report_to=[], max_steps=20,
    fp16=False,
)
collator = DataCollatorForLanguageModeling(tokenizer=tok, mlm=False)
trainer = Trainer(model=soft_model, args=args, train_dataset=train_tok, eval_dataset=val_tok, data_collator=collator)
pre = trainer.evaluate()
pre_ppl = math.exp(pre['eval_loss']) if pre['eval_loss'] < 20 else float('inf')
print('Soft-prompt pre-train eval_loss:', pre['eval_loss'], 'ppl:', pre_ppl)
trainer.train()
metrics = trainer.evaluate()
ppl = math.exp(metrics['eval_loss']) if metrics['eval_loss'] < 20 else float('inf')
print('Soft-prompt post-train eval_loss:', metrics['eval_loss'], 'ppl:', ppl)
print('Eval losses over steps:', [(h.get('step'), h.get('eval_loss')) for h in trainer.state.log_history if 'eval_loss' in h])
trainer.save_model('out_soft_prompt')


Map:   0%|          | 0/60 [00:00<?, ? examples/s]

Map: 100%|██████████| 60/60 [00:00<00:00, 3967.68 examples/s]




Map:   0%|          | 0/54 [00:00<?, ? examples/s]

Map: 100%|██████████| 54/54 [00:00<00:00, 6995.69 examples/s]




Map:   0%|          | 0/6 [00:00<?, ? examples/s]

Map: 100%|██████████| 6/6 [00:00<00:00, 1701.20 examples/s]


The model is already on multiple devices. Skipping the move to device specified in `args`.




Soft-prompt pre-train eval_loss: 2.9554014205932617 ppl: 19.20943223403436


Step,Training Loss,Validation Loss,Model Preparation Time
5,2.5274,2.712756,0.0016
10,2.6611,2.620221,0.0016
15,2.3898,2.559868,0.0016
20,2.4199,2.547457,0.0016


Soft-prompt post-train eval_loss: 2.547456979751587 ppl: 12.774576434801709
Eval losses over steps: [(5, 2.712756395339966), (10, 2.620220899581909), (15, 2.559868097305298), (20, 2.547456979751587), (20, 2.547456979751587)]


## 5) LoRA SFT — parameter‑efficient fine‑tuning


### LoRA SFT
- Inserts low-rank adapters on attention projections (here q_proj/v_proj).
- Key knobs: rank `r`, `lora_alpha` (scale), `lora_dropout`.
- Enables learning beyond prompt tokens while remaining parameter-efficient.


In [7]:
# LoRA on attention projections: r (rank), alpha (scale), dropout (regularization).
# We enable gradient checkpointing and input grads for stability with small VRAM.
from peft import LoraConfig, get_peft_model
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

tok = load_tokenizer()
base = load_base_model()
lora_cfg = LoraConfig(
    r=16, lora_alpha=16, lora_dropout=0.05, bias='none', task_type='CAUSAL_LM',
    target_modules=['q_proj','v_proj']
)
model = get_peft_model(base, lora_cfg)
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
model.config.use_cache = False

ds = load_dolly_as_sft(tok, max_samples=120)
train_ds, val_ds = split_small(ds, test_size=0.1, seed=SEED)

def tokenize_fn(batch):
    enc = tok(batch['text'], truncation=True, max_length=256)
    # labels created by collator
    return enc
train_tok = train_ds.map(tokenize_fn, batched=True, remove_columns=['text'])
val_tok = val_ds.map(tokenize_fn, batched=True, remove_columns=['text'])

args = TrainingArguments(
    output_dir='out_lora_sft',
    per_device_train_batch_size=2, gradient_accumulation_steps=1, num_train_epochs=1,
    learning_rate=1e-4, fp16=False, logging_steps=5,
    eval_steps=5, eval_strategy='steps', save_strategy='no', report_to=[], max_steps=20,
)
collator = DataCollatorForLanguageModeling(tokenizer=tok, mlm=False)
trainer = Trainer(model=model, args=args, train_dataset=train_tok, eval_dataset=val_tok, data_collator=collator)
pre = trainer.evaluate()
pre_ppl = math.exp(pre['eval_loss']) if pre['eval_loss'] < 20 else float('inf')
print('LoRA pre-train eval_loss:', pre['eval_loss'], 'ppl:', pre_ppl)
trainer.train()
metrics = trainer.evaluate()
ppl = math.exp(metrics['eval_loss']) if metrics['eval_loss'] < 20 else float('inf')
print('LoRA post-train eval_loss:', metrics['eval_loss'], 'ppl:', ppl)
print('Eval losses over steps:', [(h.get('step'), h.get('eval_loss')) for h in trainer.state.log_history if 'eval_loss' in h])
trainer.save_model('out_lora_sft')


Map:   0%|          | 0/120 [00:00<?, ? examples/s]

Map: 100%|██████████| 120/120 [00:00<00:00, 7102.67 examples/s]




Map:   0%|          | 0/108 [00:00<?, ? examples/s]

Map: 100%|██████████| 108/108 [00:00<00:00, 11582.62 examples/s]




Map:   0%|          | 0/12 [00:00<?, ? examples/s]

Map: 100%|██████████| 12/12 [00:00<00:00, 4322.91 examples/s]


The model is already on multiple devices. Skipping the move to device specified in `args`.


LoRA pre-train eval_loss: 2.8462913036346436 ppl: 17.223785451830402


Step,Training Loss,Validation Loss,Model Preparation Time
5,2.9827,2.819744,0.0032
10,2.6568,2.800092,0.0032
15,2.621,2.787484,0.0032
20,2.6466,2.782698,0.0032


LoRA post-train eval_loss: 2.7826976776123047 ppl: 16.162563575550376
Eval losses over steps: [(5, 2.8197438716888428), (10, 2.8000917434692383), (15, 2.7874844074249268), (20, 2.7826976776123047), (20, 2.7826976776123047)]


## 6) KL‑anchored SFT — keep adapter near the base
Add a small token‑wise KL term between the LoRA student and a frozen base reference.


### KL-Anchored SFT
- Adds a small token-wise KL penalty between the LoRA student and a frozen base reference.
- Warm up `kl_coef` over a few hundred steps, optionally limit to tail tokens, and mask padding with `-100`.
- Helps retain base capabilities while adapting to new data.


In [8]:
# Wrap HF Trainer to inject a token-wise KL term vs a frozen reference model.
# We compute next-token KL by shifting logits, mask out ignored labels, and average per token.
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
import torch

class KLTrainer(Trainer):
    def __init__(self, *args, ref_model=None, kl_coef=0.05, kl_warmup_steps=300, kl_limit_seq=0, **kwargs):
        super().__init__(*args, **kwargs)
        self.ref_model = ref_model
        self.kl_coef = float(kl_coef)
        self.kl_warmup_steps = int(max(1, kl_warmup_steps))
        self.kl_limit_seq = int(max(0, kl_limit_seq))

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        outputs = model(**inputs)
        loss = outputs.loss
        if self.ref_model is not None and self.kl_coef > 0 and model.training:
            step = int(getattr(self.state, 'global_step', 0) or 0)
            coef = self.kl_coef * min(1.0, step / float(self.kl_warmup_steps))
            with torch.no_grad():
                ref = self.ref_model(input_ids=inputs['input_ids'], attention_mask=inputs.get('attention_mask'))
            logits_s = outputs.logits[:, :-1, :]
            logits_r = ref.logits[:, :-1, :]
            labels = inputs.get('labels', inputs['input_ids'])[:, 1:]
            if self.kl_limit_seq > 0:
                t = min(self.kl_limit_seq, logits_s.shape[1])
                logits_s = logits_s[:, -t:, :]
                logits_r = logits_r[:, -t:, :]
                labels   = labels[:, -t:]
            mask = (labels != -100).float()
            logp_s = torch.log_softmax(logits_s, dim=-1)
            logp_r = torch.log_softmax(logits_r, dim=-1)
            p_s    = torch.exp(logp_s)
            kl_tok = (p_s * (logp_s - logp_r)).sum(-1) * mask
            denom  = mask.sum().clamp_min(1.0)
            kl_mean = kl_tok.sum() / denom
            loss = loss + coef * kl_mean
        return (loss, outputs) if return_outputs else loss

tok = load_tokenizer()
base = load_base_model()
from peft import LoraConfig, get_peft_model
lora_cfg = LoraConfig(r=16, lora_alpha=16, lora_dropout=0.05, bias='none', task_type='CAUSAL_LM', target_modules=['q_proj','v_proj'])
student = get_peft_model(base, lora_cfg)
student.gradient_checkpointing_enable()
student.enable_input_require_grads()
student.config.use_cache = False
ref_model = load_base_model().eval()
for p in ref_model.parameters(): p.requires_grad_(False)

ds = load_dolly_as_sft(tok, max_samples=120)
train_ds, val_ds = split_small(ds, test_size=0.1, seed=SEED)

def tokenize_fn(batch):
    enc = tok(batch['text'], truncation=True, max_length=256)
    # labels created by collator
    return enc
train_tok = train_ds.map(tokenize_fn, batched=True, remove_columns=['text'])
val_tok = val_ds.map(tokenize_fn, batched=True, remove_columns=['text'])

args = TrainingArguments(
    output_dir='out_kl_sft', per_device_train_batch_size=2, gradient_accumulation_steps=1,
    num_train_epochs=1, learning_rate=1e-4, fp16=False, logging_steps=5,
    eval_steps=5,
    eval_strategy='steps', save_strategy='no', report_to=[], max_steps=20,
)
collator = DataCollatorForLanguageModeling(tokenizer=tok, mlm=False)
trainer = KLTrainer(model=student, args=args, train_dataset=train_tok, eval_dataset=val_tok, data_collator=collator, ref_model=ref_model, kl_coef=0.05, kl_warmup_steps=300, kl_limit_seq=128)
pre = trainer.evaluate()
pre_ppl = math.exp(pre['eval_loss']) if pre['eval_loss'] < 20 else float('inf')
print('KL-SFT pre-train eval_loss:', pre['eval_loss'], 'ppl:', pre_ppl)
trainer.train()
metrics = trainer.evaluate()
ppl = math.exp(metrics['eval_loss']) if metrics['eval_loss'] < 20 else float('inf')
print('KL-SFT post-train eval_loss:', metrics['eval_loss'], 'ppl:', ppl)
print('Eval losses over steps:', [(h.get('step'), h.get('eval_loss')) for h in trainer.state.log_history if 'eval_loss' in h])
trainer.save_model('out_kl_sft')


Map:   0%|          | 0/120 [00:00<?, ? examples/s]

Map: 100%|██████████| 120/120 [00:00<00:00, 6363.68 examples/s]




Map:   0%|          | 0/108 [00:00<?, ? examples/s]

Map: 100%|██████████| 108/108 [00:00<00:00, 9763.66 examples/s]




Map:   0%|          | 0/12 [00:00<?, ? examples/s]

Map: 100%|██████████| 12/12 [00:00<00:00, 4604.91 examples/s]


The model is already on multiple devices. Skipping the move to device specified in `args`.


KL-SFT pre-train eval_loss: 2.874230146408081 ppl: 17.711783391389677


Step,Training Loss,Validation Loss,Model Preparation Time
5,3.0165,2.847374,0.0031
10,2.6862,2.827473,0.0031
15,2.6471,2.8147,0.0031
20,2.668,2.809862,0.0031


KL-SFT post-train eval_loss: 2.8098623752593994 ppl: 16.60763244039388
Eval losses over steps: [(5, 2.847374200820923), (10, 2.8274734020233154), (15, 2.8147003650665283), (20, 2.8098623752593994), (20, 2.8098623752593994)]


## 7) Inference — compare adapters
Set which adapter to load: 'base' | 'soft_prompt' | 'lora_sft' | 'kl_sft'.


In [9]:
ADAPTER = 'soft_prompt'
from transformers import GenerationConfig
from peft import PeftModel

def load_for_inference(adapter: str):
    tok = load_tokenizer()
    model = load_base_model()
    if adapter == 'soft_prompt':
        model = PeftModel.from_pretrained(model, 'out_soft_prompt')
    elif adapter == 'lora_sft':
        model = PeftModel.from_pretrained(model, 'out_lora_sft')
    elif adapter == 'kl_sft':
        model = PeftModel.from_pretrained(model, 'out_kl_sft')
    return tok, model

tok, model = load_for_inference(ADAPTER)
chat = [
    {"role":"system","content":"You are a helpful assistant."},
    {"role":"user",  "content":"Write a short friendly welcome message for a workshop."},
]
text = tok.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors='pt').to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=120, do_sample=False, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True))



Here is a short friendly welcome message for a workshop:

**Welcome to our workshop!**

**Hi there! I'm excited to be here with you today to learn about the amazing world of plants and how to grow them.

**Welcome to our workshop!**

**Hi there! I'm here to help you learn about the amazing world of plants.

**What is a plant?**

A plant is a living organism that grows and grows in the ground or in water. It's made up of roots, stems, leaves, flowers, and seeds.



## Conclusion
- Soft Prompt: val loss 2.9554 → 2.5475 (Δ -0.4079); steps: [(5, 2.712756395339966), (10, 2.620220899581909), (15, 2.559868097305298), (20, 2.547456979751587), (20, 2.547456979751587)]
- LoRA SFT: val loss 2.8463 → 2.7827 (Δ -0.0636); steps: [(5, 2.8197438716888428), (10, 2.8000917434692383), (15, 2.7874844074249268), (20, 2.7826976776123047), (20, 2.7826976776123047)]
- KL-SFT: val loss 2.8742 → 2.8099 (Δ -0.0644); steps: [(5, 2.847374200820923), (10, 2.8274734020233154), (15, 2.8147003650665283), (20, 2.8098623752593994), (20, 2.8098623752593994)]

- Soft Prompt showed the largest drop under this tiny budget, which is plausible when style/format dominates.
- KL-SFT trades a bit of task fit for staying closer to the base (anchoring), consistent with a slightly higher final loss vs plain LoRA.
- If general abilities or safety regress after SFT, increase kl_coef or LR-warmup and keep eval curves trending down.