# LoRA Fine-Tuning: 3B Instruction-Only Model

This Kaggle-ready workflow fine-tunes `Qwen/Qwen2.5-3B-Instruct` with LoRA adapters on an offline dataset **WITHOUT RAG contexts** for the 3B experiment series.

**Important**: This notebook trains the model WITHOUT RAG contexts (instruction only). For the complete 6-condition experiment, you would need TWO models:
- **Instruction-only model** (this notebook): Used for FT-only and FT+RAG instruction-only conditions
- **RAG-trained model**: Trained WITH RAG contexts, used for RAG-trained FT-only and RAG-trained FT+RAG conditions

**Note**: This notebook is for future 3B experiments. For current 0.5B experiments, use `lora_science_v1_instruction_only.ipynb` (instruction-only) and `lora_science_v1.ipynb` (RAG-trained).

### 0. Runtime checklist
* In the Kaggle notebook settings, enable a GPU accelerator (T4 minimum; L4/A100 improve throughput) and turn on internet access if you plan to download artifacts.
* Kaggle provides 30 GB of RAM and ample disk under `/kaggle/working`; still monitor VRAM usage (14 GB on a T4 with these hyperparameters) and adjust max sequence length or gradient accumulation if you encounter OOM errors.

In [None]:
!nvidia-smi

### 1. Install Python dependencies
We pin versions compatible with Transformers 4.46+ and TRL's SFTTrainer.

In [None]:
%%capture
!pip install --upgrade pip
!pip install --quiet "transformers>=4.46.0" "accelerate>=0.34.0" "datasets>=3.0.0" peft trl bitsandbytes sentencepiece evaluate huggingface_hub pynvml

### 2. Authenticate (optional but recommended)
Set Hugging Face and GitHub tokens if you plan to pull private assets or push adapters. Tokens are stored in-memory only for this runtime.

In [None]:
import os
import subprocess
from getpass import getpass


def _get_kaggle_secret(key: str) -> str | None:
    try:
        from kaggle_secrets import UserSecretsClient

        secret = UserSecretsClient().get_secret(key)
        if secret:
            return secret.strip() or None
    except Exception:
        return None
    return None


hf_token = os.environ.get("HF_TOKEN") or _get_kaggle_secret("HF_TOKEN")
if hf_token is None:
    entered = getpass("Enter Hugging Face token (leave blank to skip): ")
    hf_token = entered.strip() or None
if hf_token:
    os.environ["HF_TOKEN"] = hf_token
    subprocess.run(
        ["huggingface-cli", "login", "--token", hf_token, "--add-to-git-credential"],
        check=False,
    )
else:
    print("Skipping Hugging Face login.")

if os.environ.get("GITHUB_TOKEN") is None:
    gh_token = _get_kaggle_secret("GITHUB_TOKEN")
    if gh_token is None:
        gh_token = getpass(
            "Enter GitHub token for private repo access (leave blank to skip): "
        ).strip()
    if gh_token:
        os.environ["GITHUB_TOKEN"] = gh_token
        print("Stored GitHub token in this session.")
    else:
        print("Skipping GitHub token setup. Upload the dataset manually if download fails.")

### 3. Configure working directories (Kaggle-friendly)
Kaggle automatically exposes datasets under `/kaggle/input` and persists notebook outputs under `/kaggle/working`. Use the cell below to point the workflow at those locations, or override the paths if you run this notebook elsewhere.

In [None]:
import os
from pathlib import Path

IN_KAGGLE = bool(os.environ.get("KAGGLE_KERNEL_RUN_TYPE"))
DEFAULT_BASE_DIR = Path("/kaggle/working") / "beyond-the-cutoff" if IN_KAGGLE else Path("/content")
BASE_DIR = Path(os.environ.get("BTC_BASE_DIR", DEFAULT_BASE_DIR))
BASE_DIR.mkdir(parents=True, exist_ok=True)

# Configuration: Update these for your 3B experiment
RUN_ID = "cog-psych-2025-run01"  # kaggle changes _ (underscores) for - (hyphens) in datasets
EXPERIMENT_NAME = "lora_science_3b_instruction_only"  # Updated to match experiment naming
DATASET_FILENAME = "offline_dataset.jsonl"  # Use standard offline dataset name
SPLIT_SEED = 20251107
SPLIT_PREFIX = RUN_ID

KAGGLE_DATASET_SUBDIR = os.environ.get("BTC_KAGGLE_DATASET_SUBDIR", RUN_ID)
DATASET_SOURCE_CANDIDATE = None
if IN_KAGGLE:
    kaggle_input_root = Path("/kaggle/input")
    kaggle_candidates = [
        kaggle_input_root / DATASET_FILENAME,
        kaggle_input_root / RUN_ID / DATASET_FILENAME,
        kaggle_input_root / KAGGLE_DATASET_SUBDIR / DATASET_FILENAME,
    ]
    for path in kaggle_candidates:
        if path.exists():
            DATASET_SOURCE_CANDIDATE = path
            break

print(f"Working directory base: {BASE_DIR}")
if IN_KAGGLE:
    print("Detected Kaggle runtime.")
    if DATASET_SOURCE_CANDIDATE:
        print(f"Found dataset candidate at {DATASET_SOURCE_CANDIDATE}")
    else:
        print(
            "Attach the dataset under /kaggle/input or set BTC_KAGGLE_DATASET_SUBDIR/DATASET_SOURCE_PATH."
        )
else:
    print("Running outside Kaggle; override BTC_BASE_DIR if needed.")

print(f"Run ID: {RUN_ID} | Experiment: {EXPERIMENT_NAME}")
print(f"Dataset file: {DATASET_FILENAME}")

### 4. Retrieve the offline dataset
Add the `cog_psych_offline_dataset.jsonl` file as an attached Kaggle Dataset (for example under `/kaggle/input/cog_psych_2025_run01/`). If internet access is enabled, the cell below can also download directly from GitHub; otherwise the Kaggle attachment is required. NOTE: kaggle changes _ (underscores) for - (hyphens) in DATASETS

In [None]:
import json
import os
import shutil
from pathlib import Path

DATA_DIR = Path(BASE_DIR) / "data" / RUN_ID
DATA_DIR.mkdir(parents=True, exist_ok=True)
DATASET_PATH = DATA_DIR / DATASET_FILENAME

if DATASET_PATH.exists():
    print(f"Dataset already available: {DATASET_PATH}")
else:
    dataset_source_override = os.environ.get("BTC_DATASET_SOURCE")
    candidate_paths = []
    if dataset_source_override:
        candidate_paths.append(Path(dataset_source_override))
    if "DATASET_SOURCE_CANDIDATE" in globals() and DATASET_SOURCE_CANDIDATE:
        candidate_paths.append(Path(DATASET_SOURCE_CANDIDATE))

    copied = False
    for candidate in candidate_paths:
        if candidate.exists():
            shutil.copy(candidate, DATASET_PATH)
            print(f"Copied dataset from {candidate} -> {DATASET_PATH}")
            copied = True
            break

    if not copied:
        import requests

        headers = {}
        github_token = os.environ.get("GITHUB_TOKEN")
        if github_token:
            headers["Authorization"] = f"token {github_token}"
        url = (
            "https://raw.githubusercontent.com/ignaciolinari/beyond-the-cutoff/main/"
            f"evaluation/datasets/{DATASET_FILENAME}"
        )
        response = requests.get(url, headers=headers, timeout=60)
        if response.status_code == 200:
            DATASET_PATH.write_text(response.text, encoding="utf-8")
            print(f"Downloaded dataset to {DATASET_PATH}")
        else:
            raise RuntimeError(
                "Failed to retrieve dataset. Attach it as a Kaggle dataset or set BTC_DATASET_SOURCE."
            )

### 5. Create deterministic train/val/test splits
We stratify by paper and task type using seed `20251107` to stay aligned with the cognitive psychology pipeline plan.

In [None]:
import random
from collections import defaultdict


def load_examples(path: Path) -> list[dict]:
    raw = path.read_text(encoding="utf-8").strip().splitlines()
    return [json.loads(line) for line in raw if line]


def extract_paper_id(example: dict) -> str:
    meta = example.get("metadata") or {}
    if isinstance(meta, dict):
        candidate = meta.get("source_path") or meta.get("paper_id")
        if candidate:
            return Path(str(candidate)).stem
    sources = example.get("sources") or []
    if sources:
        return Path(str(sources[0])).stem
    rag = example.get("rag") or {}
    retrieved = rag.get("retrieved") or []
    if retrieved:
        first = retrieved[0]
        if isinstance(first, dict) and first.get("source_path"):
            return Path(str(first["source_path"])).stem
    return "unknown"


examples = load_examples(DATASET_PATH)
print(f"Loaded {len(examples)} examples")

groups = defaultdict(list)
for example in examples:
    key = (extract_paper_id(example), example.get("task_type"))
    groups[key].append(example)

buckets = list(groups.values())
rng = random.Random(SPLIT_SEED)
rng.shuffle(buckets)

target_counts = {
    "train": int(round(0.70 * len(examples))),
    "val": int(round(0.15 * len(examples))),
}
target_counts["test"] = len(examples) - target_counts["train"] - target_counts["val"]
splits = {"train": [], "val": [], "test": []}

for bucket in buckets:
    remaining = {split: target_counts[split] - len(splits[split]) for split in splits}
    target_split = max(remaining, key=lambda split: remaining[split])
    if remaining[target_split] <= 0:
        target_split = "train"
    splits[target_split].extend(bucket)

SPLIT_DIR = DATA_DIR / "splits"
SPLIT_DIR.mkdir(parents=True, exist_ok=True)
for split, rows in splits.items():
    out_path = SPLIT_DIR / f"{SPLIT_PREFIX}_{split}.jsonl"
    with out_path.open("w", encoding="utf-8") as handle:
        for row in rows:
            handle.write(json.dumps(row, ensure_ascii=False) + "\n")
    print(f"{split:>5}: {len(rows):>3} examples → {out_path}")

print(f"Used stratification seed {SPLIT_SEED}")

### 6. Load dataset into Hugging Face `DatasetDict`
We keep the raw fields and delegate prompt construction to the trainer's formatting function.

In [None]:
from datasets import Dataset, DatasetDict


def read_split(split: str) -> Dataset:
    path = SPLIT_DIR / f"{SPLIT_PREFIX}_{split}.jsonl"
    rows = []
    with path.open("r", encoding="utf-8") as handle:
        for line in handle:
            if line.strip():
                rows.append(json.loads(line))
    return Dataset.from_list(rows)


dataset = DatasetDict({split: read_split(split) for split in ["train", "val", "test"]})
dataset

### 7. Initialize tokenizer, model, and LoRA config
We keep the base model in float16 unless the GPU supports bfloat16, and target attention + MLP projections for LoRA adapters.


In [None]:
import torch
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model_name = "Qwen/Qwen2.5-3B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

supports_cuda = torch.cuda.is_available()
compute_capability = torch.cuda.get_device_capability(0)[0] if supports_cuda else None
prefer_bf16 = bool(supports_cuda and compute_capability is not None and compute_capability >= 8)
model_dtype = torch.bfloat16 if prefer_bf16 else torch.float16


def build_base_model() -> AutoModelForCausalLM:
    base_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=model_dtype,
        device_map="auto",
        trust_remote_code=True,
    )
    base_model.config.use_cache = False
    return base_model


model = build_base_model()


lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

print(f"Loaded model on {model.device} with dtype {model_dtype}")

### 8. Define prompt formatting and trainer

**INSTRUCTION-ONLY MODE**: We use only the instruction (no RAG contexts) in the user turn. This trains the model to answer questions based on its knowledge alone, matching the FT-only evaluation setup. The same model will be used for both FT-only and RAG+FT evaluation (mirroring the 0.5B experiment design).

**Important**: The user message format matches `_build_instruction_only_prompt()` from the evaluation runner to ensure training/evaluation consistency. The prompt includes the instruction text wrapped in the same format used during evaluation.


In [None]:
from collections.abc import Mapping, Sequence

from trl import SFTConfig, SFTTrainer


def _get_batch_value(column, index):
    if column is None:
        return None
    if isinstance(column, Mapping):
        return {key: _get_batch_value(value, index) for key, value in column.items()}
    if isinstance(column, Sequence) and not isinstance(column, str | bytes):
        return column[index] if len(column) > index else None
    return column


def build_user_message(
    instruction: str,
    rag_entry: dict | None,
    contexts_fallback: Sequence[str] | None,
) -> str:
    # INSTRUCTION-ONLY MODE: Return only instruction, ignore contexts
    # This trains the model to answer without RAG contexts
    # Format matches evaluation runner's _build_instruction_only_prompt() for consistency
    instruction_text = instruction.strip()
    if not instruction_text:
        return ""
    return (
        "You are a research paper assistant. Answer the following question based on your knowledge.\n\n"
        f"Question: {instruction_text}\n\nAnswer:"
    )


def format_example(example: dict) -> dict[str, str]:
    instruction = (example["instruction"] or "").strip()
    rag_item = example.get("rag")
    contexts_item = example.get("contexts")
    user_content = build_user_message(instruction, rag_item, contexts_item)
    assistant_reply = (example.get("expected_response") or "").strip()
    # Use chat template format (required for HuggingFace models)
    # but user content matches evaluation format for consistency
    # System message is minimal since instructions are in user content
    messages = [
        {
            "role": "system",
            "content": "You are a research paper assistant.",
        },
        {"role": "user", "content": user_content},
        {"role": "assistant", "content": assistant_reply},
    ]
    rendered = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    if tokenizer.eos_token:
        rendered += tokenizer.eos_token
    return {"text": rendered}


output_root = Path(BASE_DIR) / "outputs" / EXPERIMENT_NAME
output_root.mkdir(parents=True, exist_ok=True)


model_dtype = globals().get("model_dtype", getattr(model, "dtype", torch.float16))
supports_cuda = torch.cuda.is_available()
prefer_bf16 = globals().get("prefer_bf16", supports_cuda and model_dtype == torch.bfloat16)
fp16_flag = supports_cuda and model_dtype == torch.float16
bf16_flag = supports_cuda and prefer_bf16
max_seq_length = 2048
tokenizer.model_max_length = max_seq_length


if tokenizer.eos_token is None or not isinstance(tokenizer.eos_token, str):
    tokenizer.eos_token = "<|endoftext|>"


if hasattr(model, "peft_config"):
    model = build_base_model()


formatted_dataset = dataset.map(format_example, remove_columns=dataset["train"].column_names)


training_args = SFTConfig(
    output_dir=str(output_root),
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    logging_strategy="steps",
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    gradient_checkpointing=True,
    fp16=fp16_flag,
    bf16=bf16_flag,
    max_grad_norm=1.0,
    report_to="none",
    packing=False,
    per_device_eval_batch_size=1,
    eval_accumulation_steps=1,
)


trainer = SFTTrainer(
    model=model,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["val"],
    peft_config=lora_config,
    args=training_args,
)

### 9. Fine-tune the model
Training typically finishes in 30 minutes on a T4 due to the compact dataset.

In [None]:
train_batch_size = training_args.per_device_train_batch_size * max(
    1, training_args.gradient_accumulation_steps
)
print(
    f"Starting LoRA fine-tuning on {len(formatted_dataset['train'])} training examples "
    f"({train_batch_size} effective batch size) for {training_args.num_train_epochs} epochs."
)
print(
    f"Evaluation enabled: {training_args.eval_strategy != 'no'} | "
    f"Logging every {training_args.logging_steps} steps | "
    f"Gradient checkpointing: {training_args.gradient_checkpointing}"
)

train_result = trainer.train()
print("Training finished.")
trainer.log_metrics("train", train_result.metrics)
trainer.save_metrics("train", train_result.metrics)
trainer.save_state()

if train_result.metrics:
    print("Train metrics:")
    for key, value in train_result.metrics.items():
        print(f"  {key}: {value}")

print("Running evaluation...")
eval_metrics = None
if training_args.eval_strategy != "no":
    eval_metrics = trainer.evaluate()
    trainer.log_metrics("eval", eval_metrics)
    trainer.save_metrics("eval", eval_metrics)
    if eval_metrics:
        print("Eval metrics:")
        for key, value in eval_metrics.items():
            print(f"  {key}: {value}")
else:
    print("Evaluation skipped (eval_strategy='no').")

#### Plot training loss curve
Use this after training to confirm the optimizer is behaving as expected.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

loss_history = [
    {"step": entry["step"], "loss": entry["loss"]}
    for entry in trainer.state.log_history
    if "loss" in entry and "step" in entry
]

if not loss_history:
    print("No loss values logged yet. Run the training cell first or raise logging verbosity.")
else:
    loss_df = pd.DataFrame(loss_history).drop_duplicates(subset="step").sort_values("step")
    display(loss_df.tail())

    fig, ax = plt.subplots(figsize=(8, 4.5))
    ax.plot(loss_df["step"], loss_df["loss"], marker="o", linewidth=1.5, markersize=3)
    ax.set_xlabel("Global step")
    ax.set_ylabel("Loss")
    ax.set_title("Training Loss vs. Step")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

### 10. Quick validation sanity check
Generate answers for a few validation samples to verify the adapter behaviour before exporting.

In [None]:
model.eval()


def preview_response(example: dict, max_new_tokens: int = 256) -> str:
    user_text = build_user_message(
        example["instruction"], example.get("rag"), example.get("contexts")
    )
    # Match training format: minimal system message, user content includes instructions
    messages = [
        {
            "role": "system",
            "content": "You are a research paper assistant.",
        },
        {"role": "user", "content": user_text},
    ]
    prompt_text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    encoded = tokenizer(
        prompt_text,
        return_tensors="pt",
        truncation=True,
        max_length=tokenizer.model_max_length,
        return_attention_mask=True,
    )
    encoded = {key: value.to(model.device) for key, value in encoded.items()}
    if encoded["input_ids"].shape[-1] == tokenizer.model_max_length:
        print(
            "Prompt truncated to fit model_max_length; consider tightening contexts if this occurs often."
        )
    gen_config = GenerationConfig(
        do_sample=False,
        max_new_tokens=max_new_tokens,
        temperature=0.2,
        top_p=0.9,
    )
    with torch.no_grad():
        generated = model.generate(**encoded, generation_config=gen_config)
    response_ids = generated[0, encoded["input_ids"].shape[-1] :]
    return tokenizer.decode(response_ids, skip_special_tokens=True).strip()


for example in dataset["val"].select(range(min(3, len(dataset["val"])))):
    print("Instruction:", example["instruction"])
    print("Ground truth:", example.get("expected_response", "").strip())
    print("Model output:", preview_response(example))
    print("-" * 80)

### 11. Save adapters and tokenizer
Store artifacts under `outputs/adapters/lora_science_3b_instruction_only`. Download them from Kaggle or sync to your preferred remote storage as needed.

In [None]:
adapter_dir = output_root / "adapters"
adapter_dir.mkdir(parents=True, exist_ok=True)
adapter_path = adapter_dir / EXPERIMENT_NAME
trainer.model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)
print(f"Saved LoRA adapter to {adapter_path}")

### 12. Package artifacts
Compress the adapter, training args, and logs so you can download them from Kaggle or sync back to the repo.

In [None]:
# import shutil

# archive_stem = output_root.parent / f"{EXPERIMENT_NAME}_artifacts"
# zip_target = archive_stem.with_suffix(".zip")
# if zip_target.exists():
#     zip_target.unlink()
# zip_path = shutil.make_archive(str(archive_stem), "zip", root_dir=output_root)
# print(f"Packed artifacts at {zip_path}")

### 13. Next steps
1. **Persist training metadata** → run the next cell to capture seed `20251107`, the key hyperparameters, and recent metrics in `outputs/adapters/lora_science_3b_instruction_only/EXPERIMENT_METADATA.json`.
2. **Materialize merged weights** → execute the following cell to merge the LoRA adapter into the base model and save a ready-to-quantize checkpoint under `outputs/lora_science_3b_instruction_only/merged_full_model`.
3. **Quantize for Ollama** → use the provided CLI snippet to convert the merged weights to GGUF via `llama.cpp`'s `convert-hf-to-gguf.py`, then register the artifact with Ollama.
4. **Re-benchmark** → rerun `scripts/evaluate_models.py` with the new Ollama tag to compare against the `qwen2_5-3b-instruct-q4_K_M` RAG control.

In [None]:
import json
from datetime import datetime

import peft
import transformers
import trl
from packaging.version import Version

metadata_target = output_root / "adapters" / EXPERIMENT_NAME / "EXPERIMENT_METADATA.json"
metadata_target.parent.mkdir(parents=True, exist_ok=True)
train_result_obj = locals().get("train_result")

training_summary = {
    "seed": SPLIT_SEED,
    "timestamp_utc": datetime.utcnow().isoformat(timespec="seconds") + "Z",
    "base_model": model_name,
    "adapter_dir": str(output_root / "adapters" / EXPERIMENT_NAME),
    "output_dir": str(output_root),
    "hyperparameters": {
        "num_train_epochs": training_args.num_train_epochs,
        "per_device_train_batch_size": training_args.per_device_train_batch_size,
        "gradient_accumulation_steps": training_args.gradient_accumulation_steps,
        "learning_rate": training_args.learning_rate,
        "weight_decay": training_args.weight_decay,
        "lr_scheduler_type": training_args.lr_scheduler_type,
        "warmup_ratio": training_args.warmup_ratio,
        "max_seq_length": max_seq_length,
    },
    "optimizer_steps": trainer.state.global_step,
    "train_loss": next(
        (entry.get("loss") for entry in reversed(trainer.state.log_history) if "loss" in entry),
        None,
    ),
    "final_metrics": train_result_obj.metrics if train_result_obj is not None else {},
    "libraries": {
        "transformers": Version(transformers.__version__).base_version,
        "trl": Version(trl.__version__).base_version,
        "peft": Version(peft.__version__).base_version,
    },
}

metadata_target.write_text(
    json.dumps(training_summary, indent=2, sort_keys=True) + "\n", encoding="utf-8"
)
print(f"Wrote metadata to {metadata_target}")

In [None]:
import gc
from pathlib import Path

import torch
from peft import AutoPeftModelForCausalLM

output_root = Path(BASE_DIR) / "outputs" / EXPERIMENT_NAME
adapter_dir = output_root / "adapters"
adapter_path = adapter_dir / EXPERIMENT_NAME

merged_output_dir = output_root / "merged_full_model"
merged_output_dir.mkdir(parents=True, exist_ok=True)

print("Merging adapter into the base model…")

# Free up memory by deleting the original model and clearing CUDA cache
if "model" in globals():
    del model
if torch.cuda.is_available():
    torch.cuda.empty_cache()
gc.collect()

offload_dir = Path(BASE_DIR) / "tmp_merge_offload"
offload_dir.mkdir(parents=True, exist_ok=True)

merge_dtype = torch.float16
merged_model = AutoPeftModelForCausalLM.from_pretrained(
    adapter_path,
    torch_dtype=merge_dtype,
    device_map={"": "cpu"},
    low_cpu_mem_usage=True,
    offload_folder=str(offload_dir),
    trust_remote_code=True,
)

with torch.inference_mode():
    merged_model = merged_model.merge_and_unload()

merged_model.save_pretrained(merged_output_dir)
tokenizer.save_pretrained(merged_output_dir)
print(f"Merged checkpoint written to {merged_output_dir}")

#### Quantize and evaluate
Run the commands below from your local checkout once the merged checkpoint is synced down:
```bash
# Convert merged HF weights to GGUF
python /path/to/llama.cpp/convert-hf-to-gguf.py \
  --model-dir outputs/lora_science_3b_instruction_only/merged_full_model \
  --outfile outputs/lora_science_3b_instruction_only/merged_full_model/Qwen2.5-3B-lora_science_3b_instruction_only.Q4_K_M.gguf \
  --data-type q4_K_M

# Register with Ollama using instruction-only Modelfile
# This same model will be used for both FT-only and RAG+FT evaluation
ollama create lora_science_3b_instruction_only -f ollama/Modelfile.instruction_only
ollama push lora_science_3b_instruction_only

# Create a 3B comparison plan first (e.g., compare_3b_experiments.yaml)
# based on configs/evaluation/compare_0p5b_experiments.yaml, then run:
python scripts/compare_models.py \
  --plan configs/evaluation/compare_3b_experiments.yaml \
  --output evaluation/results/comparison_report_3b.json
```

### 14. Re-package artifacts
Create a fresh archive after merging and metadata updates so downstream steps stay in sync.

In [None]:
import shutil

archive_stem = output_root.parent / f"{EXPERIMENT_NAME}_artifacts_postmerge"
zip_target = archive_stem.with_suffix(".zip")
if zip_target.exists():
    zip_target.unlink()
zip_path = shutil.make_archive(str(archive_stem), "zip", root_dir=output_root)
print(f"Packed updated artifacts at {zip_path}")