# LoRA Fine-Tuning: lora_science_v1_instruction_only

This Colab-ready workflow fine-tunes `Qwen/Qwen2.5-0.5B-Instruct` with LoRA adapters on the offline dataset **WITHOUT RAG contexts** for the 6-condition experiment.

**Important**: This notebook trains the model WITHOUT RAG contexts (instruction only). This model is used for:
- **Condition 3**: FT-only (evaluated without RAG contexts)
- **Condition 4**: FT+RAG instruction-only (evaluated WITH RAG contexts)

For conditions 5-6, use `lora_science_v1.ipynb` which trains WITH RAG contexts.

> This model is designed to answer questions based on its knowledge alone. When evaluated with RAG contexts (condition 4), it tests transfer learning - whether a model trained without contexts can benefit from RAG at inference time.

### 0. Runtime checklist
* Select a Colab runtime with GPU (T4 preferred) before running any code.
* Keep an eye on VRAM usage (~16 GB on T4). Reduce sequence length or increase gradient accumulation if memory errors appear.

In [None]:
!nvidia-smi

### 1. Install Python dependencies
We pin versions compatible with Transformers 4.45+ and TRL's SFTTrainer.

In [None]:
%%capture
!pip install --upgrade pip
!pip install --quiet "transformers>=4.45.0" "accelerate>=0.33.0" "datasets>=3.0.0" peft trl bitsandbytes sentencepiece evaluate huggingface_hub pynvml

### 2. Authenticate (optional but recommended)
Set Hugging Face and GitHub tokens if you plan to pull private assets or push adapters. Tokens are stored in-memory only for this runtime.

In [None]:
import os
import subprocess
from getpass import getpass

hf_token = os.environ.get("HF_TOKEN")
if hf_token is None:
    entered = getpass("Enter Hugging Face token (leave blank to skip): ")
    hf_token = entered.strip() or None
if hf_token:
    os.environ["HF_TOKEN"] = hf_token
    subprocess.run(
        ["huggingface-cli", "login", "--token", hf_token, "--add-to-git-credential"], check=False
    )
else:
    print("Skipping Hugging Face login.")

if os.environ.get("GITHUB_TOKEN") is None:
    gh_token = getpass("Enter GitHub token for private repo access (leave blank to skip): ")
    if gh_token.strip():
        os.environ["GITHUB_TOKEN"] = gh_token.strip()
        print("Stored GitHub token in this session.")
    else:
        print("Skipping GitHub token setup. Upload the dataset manually if download fails.")

### 3. (Optional) Mount Google Drive
Use Drive if you want automatic persistence of adapters, logs, or dataset snapshots.

In [None]:
USE_DRIVE = True
if USE_DRIVE:
    from google.colab import drive

    drive.mount("/content/drive")
    BASE_DIR = "/content/drive/MyDrive/beyond-the-cutoff"
else:
    BASE_DIR = "/content"
print(f"Working directory base: {BASE_DIR}")

### 4. Retrieve the training dataset
We use `train_dataset.jsonl` which contains only the training portion of the data. The evaluation portion (`eval_dataset.jsonl`) is held out for the final experiment.

**Important**: Use `train_dataset.jsonl`, NOT `offline_dataset.jsonl`. This prevents data leakage — the model won't see the exact questions used in the final evaluation.

In [None]:
import json
from pathlib import Path

DATA_DIR = Path(BASE_DIR) / "data" / "offline_eval"
DATA_DIR.mkdir(parents=True, exist_ok=True)
DATASET_PATH = DATA_DIR / "train_dataset.jsonl"

if DATASET_PATH.exists():
    print(f"Dataset already available: {DATASET_PATH}")
else:
    import requests

    headers = {}
    github_token = os.environ.get("GITHUB_TOKEN")
    if github_token:
        headers["Authorization"] = f"token {github_token}"
    # Use train_dataset.jsonl (not offline_dataset.jsonl) to prevent data leakage
    url = "https://raw.githubusercontent.com/ignaciolinari/beyond-the-cutoff/main/evaluation/datasets/train_dataset.jsonl"
    response = requests.get(url, headers=headers, timeout=60)
    if response.status_code == 200:
        DATASET_PATH.write_text(response.text, encoding="utf-8")
        print(f"Downloaded training dataset to {DATASET_PATH}")
    else:
        raise RuntimeError(
            f"Failed to download dataset (status {response.status_code}). "
            "Upload train_dataset.jsonl manually and rerun this cell."
        )

### 5. Create deterministic train/val split for training
We stratify by paper and task type using seed `20251101`. This internal split is for **training validation only** — the final experiment evaluation uses the separate `eval_dataset.jsonl`.

In [None]:
import random
from collections import defaultdict


def load_examples(path: Path) -> list[dict]:
    raw = path.read_text(encoding="utf-8").strip().splitlines()
    return [json.loads(line) for line in raw if line]


def extract_paper_id(example: dict) -> str:
    meta = example.get("metadata") or {}
    if isinstance(meta, dict):
        candidate = meta.get("source_path") or meta.get("paper_id")
        if candidate:
            return Path(str(candidate)).stem
    sources = example.get("sources") or []
    if sources:
        return Path(str(sources[0])).stem
    rag = example.get("rag") or {}
    retrieved = rag.get("retrieved") or []
    if retrieved:
        first = retrieved[0]
        if isinstance(first, dict) and first.get("source_path"):
            return Path(str(first["source_path"])).stem
    return "unknown"


examples = load_examples(DATASET_PATH)
print(f"Loaded {len(examples)} training examples")

groups = defaultdict(list)
for example in examples:
    key = (extract_paper_id(example), example.get("task_type"))
    groups[key].append(example)

buckets = list(groups.values())
rng = random.Random(20251101)
rng.shuffle(buckets)

# 85/15 split for train/val (internal training validation only)
# The held-out eval_dataset.jsonl is used for final experiment evaluation
target_counts = {
    "train": int(round(0.85 * len(examples))),
    "val": int(round(0.15 * len(examples))),
}
splits = {"train": [], "val": []}

for bucket in buckets:
    remaining = {split: target_counts[split] - len(splits[split]) for split in splits}
    target_split = max(remaining, key=lambda split: remaining[split])
    if remaining[target_split] <= 0:
        target_split = "train"
    splits[target_split].extend(bucket)

SPLIT_DIR = DATA_DIR / "splits"
SPLIT_DIR.mkdir(parents=True, exist_ok=True)
for split, rows in splits.items():
    out_path = SPLIT_DIR / f"lora_science_v1_instruction_only_{split}.jsonl"
    with out_path.open("w", encoding="utf-8") as handle:
        for row in rows:
            handle.write(json.dumps(row, ensure_ascii=False) + "\n")
    print(f"{split:>5}: {len(rows):>3} examples → {out_path}")

print("\nNote: Final evaluation uses eval_dataset.jsonl (not part of this training data)")

### 6. Load dataset into Hugging Face `DatasetDict`
We keep the raw fields and delegate prompt construction to the trainer’s formatting function.

In [None]:
from datasets import Dataset, DatasetDict


def read_split(split: str) -> Dataset:
    path = SPLIT_DIR / f"lora_science_v1_instruction_only_{split}.jsonl"
    rows = []
    with path.open("r", encoding="utf-8") as handle:
        for line in handle:
            if line.strip():
                rows.append(json.loads(line))
    return Dataset.from_list(rows)


dataset = DatasetDict({split: read_split(split) for split in ["train", "val"]})
dataset

### 7. Initialize tokenizer, model, and LoRA config
We keep the base model in float16 and target attention + MLP projections for LoRA adapters.

In [None]:
import torch
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model_name = "Qwen/Qwen2.5-0.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

supports_cuda = torch.cuda.is_available()

compute_capability = torch.cuda.get_device_capability(0)[0] if supports_cuda else None

prefer_bf16 = bool(supports_cuda and compute_capability is not None and compute_capability >= 8)

model_dtype = torch.bfloat16 if prefer_bf16 else torch.float16


def build_base_model() -> AutoModelForCausalLM:
    base_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=model_dtype,
        device_map="auto",
        trust_remote_code=True,
    )
    base_model.config.use_cache = False
    return base_model


model = build_base_model()


lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

print(f"Loaded model on {model.device} with dtype {model_dtype}")

### 8. Define prompt formatting and trainer
**INSTRUCTION-ONLY MODE**: We use only the instruction (no RAG contexts) in the user turn. This trains the model to answer questions based on its knowledge alone, matching the FT-only evaluation setup.

**Important**: The user message format matches `_build_instruction_only_prompt()` from the evaluation runner to ensure training/evaluation consistency. The prompt includes the instruction text wrapped in the same format used during evaluation.

In [None]:
from collections.abc import Mapping, Sequence

from trl import SFTConfig, SFTTrainer


def _get_batch_value(column, index):
    if column is None:
        return None
    if isinstance(column, Mapping):
        return {key: _get_batch_value(value, index) for key, value in column.items()}
    if isinstance(column, Sequence) and not isinstance(column, str | bytes):
        return column[index] if len(column) > index else None
    return column


def build_user_message(
    instruction: str,
    rag_entry: dict | None,
    contexts_fallback: Sequence[str] | None,
) -> str:
    # INSTRUCTION-ONLY MODE: Return only instruction, ignore contexts
    # This trains the model to answer without RAG contexts
    # Format matches evaluation runner's _build_instruction_only_prompt() for consistency
    # System message comes from Modelfile, so user content should NOT duplicate it
    instruction_text = instruction.strip()
    if not instruction_text:
        return ""
    return f"Question: {instruction_text}\n\nAnswer:"


def format_example(example: dict) -> dict[str, str]:
    instruction = (example["instruction"] or "").strip()
    rag_item = example.get("rag")
    contexts_item = example.get("contexts")
    user_content = build_user_message(instruction, rag_item, contexts_item)
    assistant_reply = (example.get("expected_response") or "").strip()
    # Use chat template format (required for HuggingFace models)
    # but user content matches evaluation format for consistency
    # System message is minimal since instructions are in user content
    messages = [
        {
            "role": "system",
            "content": "You are a research paper assistant.",
        },
        {"role": "user", "content": user_content},
        {"role": "assistant", "content": assistant_reply},
    ]
    rendered = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    # Ensure EOS token is added here as packing is False
    if tokenizer.eos_token:
        rendered += tokenizer.eos_token
    return {"text": rendered}


output_root = Path(BASE_DIR) / "outputs" / "lora_science_v1_instruction_only"
output_root.mkdir(parents=True, exist_ok=True)


model_dtype = globals().get("model_dtype", getattr(model, "dtype", torch.float16))
supports_cuda = torch.cuda.is_available()
prefer_bf16 = globals().get("prefer_bf16", supports_cuda and model_dtype == torch.bfloat16)
fp16_flag = supports_cuda and model_dtype == torch.float16
bf16_flag = supports_cuda and prefer_bf16
tokenizer.model_max_length = 1024


# Ensure tokenizer.eos_token is a string
if tokenizer.eos_token is None or not isinstance(tokenizer.eos_token, str):
    tokenizer.eos_token = "<|endoftext|>"  # Set a common EOS token if it's not already a string


# Make sure we do not stack multiple adapters on the same model instance
if hasattr(model, "peft_config"):
    model = build_base_model()


# Apply formatting to the dataset before passing to SFTTrainer
formatted_dataset = dataset.map(format_example, remove_columns=dataset["train"].column_names)


# Deduplicate formatted prompts to ensure no duplicate prompts in training
# This prevents data leakage and ensures evaluation can use the same config
def deduplicate_dataset(ds, split_name: str):
    """Remove duplicate prompts based on formatted text."""
    seen_prompts = set()
    unique_indices = []
    duplicates_count = 0

    for idx, example in enumerate(ds):
        prompt_text = example.get("text", "").strip()
        if prompt_text and prompt_text not in seen_prompts:
            seen_prompts.add(prompt_text)
            unique_indices.append(idx)
        else:
            duplicates_count += 1

    if duplicates_count > 0:
        print(f"[info] Removed {duplicates_count} duplicate prompt(s) from {split_name} split")

    return ds.select(unique_indices) if unique_indices else ds


formatted_dataset["train"] = deduplicate_dataset(formatted_dataset["train"], "train")
formatted_dataset["val"] = deduplicate_dataset(formatted_dataset["val"], "val")

print(
    f"After deduplication: train={len(formatted_dataset['train'])}, val={len(formatted_dataset['val'])}"
)


training_args = SFTConfig(
    output_dir=str(output_root),
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    logging_strategy="steps",
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    gradient_checkpointing=True,
    fp16=fp16_flag,
    bf16=bf16_flag,
    max_grad_norm=1.0,
    report_to="none",
    packing=False,  # Keep packing=False as we formatted the text ourselves
)


trainer = SFTTrainer(
    model=model,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["val"],
    # Removed formatting_func as we pre-formatted the dataset
    peft_config=lora_config,
    args=training_args,
    # Removed tokenizer argument
)
trainer

### 9. Fine-tune the model
Training time will depend on the size of the dataset.

In [None]:
train_batch_size = training_args.per_device_train_batch_size * max(
    1, training_args.gradient_accumulation_steps
)
print(
    f"Starting LoRA fine-tuning on {len(formatted_dataset['train'])} training examples "
    f"({train_batch_size} effective batch size) for {training_args.num_train_epochs} epochs."
)
print(
    f"Evaluation enabled: {training_args.eval_strategy != 'no'} | "
    f"Logging every {training_args.logging_steps} steps | "
    f"Gradient checkpointing: {training_args.gradient_checkpointing}"
)

train_result = trainer.train()
print("Training finished.")
trainer.log_metrics("train", train_result.metrics)
trainer.save_metrics("train", train_result.metrics)
trainer.save_state()

if train_result.metrics:
    print("Train metrics:")
    for key, value in train_result.metrics.items():
        print(f"  {key}: {value}")

print("Running evaluation...")
eval_metrics = None
if training_args.eval_strategy != "no":
    eval_metrics = trainer.evaluate()
    trainer.log_metrics("eval", eval_metrics)
    trainer.save_metrics("eval", eval_metrics)
    if eval_metrics:
        print("Eval metrics:")
        for key, value in eval_metrics.items():
            print(f"  {key}: {value}")
else:
    print("Evaluation skipped (eval_strategy='no').")

#### Plot training loss curve
Use this after training to confirm the optimizer is behaving as expected.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

loss_history = [
    {"step": entry["step"], "loss": entry["loss"]}
    for entry in trainer.state.log_history
    if "loss" in entry and "step" in entry
]

if not loss_history:
    print("No loss values logged yet. Run the training cell first or raise logging verbosity.")
else:
    loss_df = pd.DataFrame(loss_history).drop_duplicates(subset="step").sort_values("step")
    display(loss_df.tail())

    fig, ax = plt.subplots(figsize=(8, 4.5))
    ax.plot(loss_df["step"], loss_df["loss"], marker="o", linewidth=1.5, markersize=3)
    ax.set_xlabel("Global step")
    ax.set_ylabel("Loss")
    ax.set_title("Training Loss vs. Step")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

### 10. Quick validation sanity check
Generate answers for a few validation samples to verify the adapter behaviour before exporting.

In [None]:
model.eval()


def preview_response(example: dict, max_new_tokens: int = 256) -> str:
    user_text = build_user_message(
        example["instruction"], example.get("rag"), example.get("contexts")
    )
    # Match training format: minimal system message, user content includes instructions
    messages = [
        {
            "role": "system",
            "content": "You are a research paper assistant.",
        },
        {"role": "user", "content": user_text},
    ]
    prompt_text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    encoded = tokenizer(
        prompt_text,
        return_tensors="pt",
        truncation=True,
        max_length=tokenizer.model_max_length,
        return_attention_mask=True,
    )
    encoded = {key: value.to(model.device) for key, value in encoded.items()}
    if encoded["input_ids"].shape[-1] == tokenizer.model_max_length:
        print(
            "Prompt truncated to fit model_max_length; consider tightening contexts if this occurs often."
        )
    gen_config = GenerationConfig(
        do_sample=False,
        max_new_tokens=max_new_tokens,
        temperature=0.2,
        top_p=0.9,
    )
    with torch.no_grad():
        generated = model.generate(**encoded, generation_config=gen_config)
    response_ids = generated[0, encoded["input_ids"].shape[-1] :]
    return tokenizer.decode(response_ids, skip_special_tokens=True).strip()


for example in dataset["val"].select(range(min(3, len(dataset["val"])))):
    print("Instruction:", example["instruction"])
    print("Ground truth:", example.get("expected_response", "").strip())
    print("Model output:", preview_response(example))
    print("-" * 80)

### 11. Save adapters and tokenizer
Store artifacts under `outputs/adapters/lora_science_v1_instruction_only`. Upload to Drive or remote storage as needed.

In [None]:
adapter_dir = output_root / "adapters"
adapter_dir.mkdir(parents=True, exist_ok=True)
adapter_path = adapter_dir / "lora_science_v1_instruction_only"
trainer.model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)
print(f"Saved LoRA adapter to {adapter_path}")

### 12. Package artifacts
Compress the adapter, training args, and logs for upload back to the repo or Drive.

In [None]:
# import shutil

# archive_stem = output_root.parent / "lora_science_v1_artifacts"
# zip_target = archive_stem.with_suffix(".zip")
# if zip_target.exists():
#     zip_target.unlink()
# zip_path = shutil.make_archive(str(archive_stem), "zip", root_dir=output_root)
# print(f"Packed artifacts at {zip_path}")

### 13. Next steps
1. **Persist training metadata** &rightarrow; run the next cell to capture seed `20251101`, the key hyperparameters, and recent metrics in `outputs/adapters/lora_science_v1_instruction_only/EXPERIMENT_METADATA.json`.
2. **Materialize merged weights** &rightarrow; execute the following cell to merge the LoRA adapter into the base model and save a ready-to-quantize checkpoint under `outputs/lora_science_v1_instruction_only/merged_full_model`.
3. **Quantize for Ollama** &rightarrow; use the provided CLI snippet to convert the merged weights to GGUF via `llama.cpp`'s `convert-hf-to-gguf.py`, then register the artifact with Ollama.
4. **Re-benchmark** &rightarrow; use this model for both FT-only evaluation (condition 3) and FT+RAG instruction-only evaluation (condition 4). For conditions 5-6, use the RAG-trained model from `lora_science_v1.ipynb`.

In [None]:
# Persist experiment metadata for reproducibility
import json
from datetime import datetime

import peft
import transformers
import trl
from packaging.version import Version

metadata_target = (
    output_root / "adapters" / "lora_science_v1_instruction_only" / "EXPERIMENT_METADATA.json"
)
metadata_target.parent.mkdir(parents=True, exist_ok=True)

training_summary = {
    "seed": 20251101,
    "timestamp_utc": datetime.utcnow().isoformat(timespec="seconds") + "Z",
    "base_model": model_name,
    "training_mode": "instruction_only",  # Mark this as instruction-only training
    "adapter_dir": str(output_root / "adapters" / "lora_science_v1_instruction_only"),
    "output_dir": str(output_root),
    "hyperparameters": {
        "num_train_epochs": training_args.num_train_epochs,
        "per_device_train_batch_size": training_args.per_device_train_batch_size,
        "gradient_accumulation_steps": training_args.gradient_accumulation_steps,
        "learning_rate": training_args.learning_rate,
        "weight_decay": training_args.weight_decay,
        "lr_scheduler_type": training_args.lr_scheduler_type,
        "warmup_ratio": training_args.warmup_ratio,
        "max_seq_length": 1024,
    },
    "optimizer_steps": trainer.state.global_step,
    "train_loss": next(
        (entry.get("loss") for entry in reversed(trainer.state.log_history) if "loss" in entry),
        None,
    ),
    "final_metrics": train_result.metrics
    if "train_result" in locals() and train_result is not None
    else {},
    "libraries": {
        "transformers": Version(transformers.__version__).base_version,
        "trl": Version(trl.__version__).base_version,
        "peft": Version(peft.__version__).base_version,
    },
}

metadata_target.write_text(
    json.dumps(training_summary, indent=2, sort_keys=True) + "\n", encoding="utf-8"
)
print(f"Wrote metadata to {metadata_target}")

In [None]:
# Merge the LoRA adapter back into the base model and save full weights
from peft import AutoPeftModelForCausalLM

merged_output_dir = output_root / "merged_full_model"
merged_output_dir.mkdir(parents=True, exist_ok=True)

print("Merging adapter into the base model…")
merge_dtype = globals().get("model_dtype", torch.float16)
merged_model = AutoPeftModelForCausalLM.from_pretrained(
    adapter_path,
    torch_dtype=merge_dtype,
    device_map="auto",
    trust_remote_code=True,
).merge_and_unload()

merged_model.to("cpu")
merged_model.save_pretrained(merged_output_dir)
tokenizer.save_pretrained(merged_output_dir)
print(f"Merged checkpoint written to {merged_output_dir}")

#### Quantize and evaluate
Run the commands below from your local checkout once the merged checkpoint is synced down:
```bash
# Convert merged HF weights to GGUF (using Q4_K_M quantization to match Ollama)
python /path/to/llama.cpp/convert-hf-to-gguf.py \
  --model-dir outputs/lora_science_v1_instruction_only/merged_full_model \
  --outfile outputs/lora_science_v1_instruction_only/merged_full_model/Qwen2.5-0.5B-lora_science_v1_instruction_only.Q4_K_M.gguf \
  --data-type q4_K_M

# Register a new Ollama model tag for instruction-only model
# Use Modelfile.instruction_only which has the correct system prompt (no citations)
ollama create lora_science_0p5_instruction_only -f ollama/Modelfile.instruction_only
ollama push lora_science_0p5_instruction_only

# Run FT-only evaluation (instruction-only prompts)
python scripts/evaluate_models.py \
  --model-config configs/lora_science_v1_instruction_only_ollama.yaml \
  --model-label lora_science_0p5b_ft_only \
  --prompt-mode instruction \
  --output evaluation/results/lora_science_0p5b_ft_only/metrics.json
```

### 14. Re-package artifacts
Create a fresh archive after merging and metadata updates so downstream steps stay in sync.

In [None]:
# Rebuild the archive to include merged weights and metadata
import shutil

archive_stem = output_root.parent / "lora_science_v1_instruction_only_artifacts_postmerge"
zip_target = archive_stem.with_suffix(".zip")
if zip_target.exists():
    zip_target.unlink()
zip_path = shutil.make_archive(str(archive_stem), "zip", root_dir=output_root)
print(f"Packed updated artifacts at {zip_path}")