# Training a LoRA Adapter for llamadart

This notebook provides a step-by-step guide to fine-tuning a Large Language Model (LLM) using **LoRA (Low-Rank Adaptation)** and preparing it for use with the `llamadart` plugin in Flutter or Dart applications.

### Why LoRA?
- **Efficient**: Only a small fraction of parameters are trained.
- **Portable**: The resulting "adapter" files are small (typically 10MB - 200MB).
- **Dynamic**: You can swap adapters at runtime in `llamadart` without reloading the base model.

### Resources
- **Model**: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
- **Dataset**: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

## 1. Setup Environment

We will use the Hugging Face `peft` library for LoRA and `bitsandbytes` for 4-bit quantization (QLoRA) where supported.

In [None]:
%pip install -q -U transformers peft bitsandbytes datasets accelerate trl

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import os

# Configuration
model_id = "Qwen/Qwen2.5-0.5B-Instruct" # Small model for demonstration (Downloaded from Hugging Face)
dataset_id = "timdettmers/openassistant-guanaco" # Instruction following dataset
output_dir = "./lora-adapter-output"

# Note: Models and datasets are cached by default in ~/.cache/huggingface/

# Detect environment
has_cuda = torch.cuda.is_available()
is_mac = torch.backends.mps.is_available()
print(f"CUDA available: {has_cuda}")
print(f"Apple Silicon (MPS) available: {is_mac}")

## 2. Load Model and Tokenizer

On NVIDIA GPUs, we use 4-bit quantization (QLoRA) to save VRAM. 
On Mac (Apple Silicon) or CPU, we load in `bfloat16` or `float16` directly, as bitsandbytes 4-bit is currently optimized for CUDA.

In [None]:
if has_cuda:
    # QLoRA config for NVIDIA GPUs
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )
    model_kwargs = {"quantization_config": bnb_config}
else:
    # FP16/BF16 for Mac or CPU (Standard LoRA)
    model_kwargs = {"torch_dtype": torch.bfloat16 if torch.cuda.is_bf16_supported() or is_mac else torch.float16}

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
    **model_kwargs
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

## 3. Prepare for LoRA

Define the LoRA parameters. `r` is the rank (higher = more expressive but larger file).

In [None]:
if has_cuda:
    model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Depends on architecture
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

## 4. Load and Tokenize Dataset

We pre-tokenize the dataset to avoid issues with specialized trainers like SFTTrainer.

In [None]:
dataset = load_dataset(dataset_id, split="train[:100]")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)

## 5. Training

We use the standard `Trainer` from the `transformers` library, which is stable and avoids version conflicts.

In [None]:
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    max_steps=50,
    fp16=not is_mac and not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported() or is_mac,
    save_strategy="no",
    report_to="none",
    remove_unused_columns=False
)

trainer = Trainer(
    model=model,
    train_dataset=tokenized_dataset,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

# Save only the adapter
model.save_pretrained(output_dir)

## 6. Convert to GGUF format

After training, you have a Hugging Face LoRA adapter (several `.bin` or `.safetensors` files and a `adapter_config.json`). To use it with `llamadart`, you must convert it to the **GGUF** format.

### Step 1: Clone llama.cpp
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt
```

### Step 2: Convert
Run the conversion script. You need the directory where you saved the adapter.

```bash
python convert_lora_to_gguf.py ../lora-adapter-output/ --out-file my_adapter.gguf
```

### Step 3: Use in llamadart

#### Option A: Load in your code
```dart
await service.init(
  'base_model.gguf',
  modelParams: ModelParams(
    loras: [LoraAdapterConfig(path: 'path/to/my_adapter.gguf', scale: 1.0)],
  ),
);
```

#### Option B: Test with basic_app CLI
Navigate to the `example/basic_app` directory and run the following command. Note that the `basic_app` uses the same **Qwen2.5-0.5B** base model by default, so it will match your trained adapter perfectly.

```bash
dart run bin/llamadart_basic_example.dart --lora path/to/my_adapter.gguf
```