# FinScribe LLaMA-Factory Micro LoRA Experiment

This notebook runs a tiny LoRA SFT with LLaMA-Factory on 10 synthetic invoice pairs for development/testing.

**Requirements:**
- Colab with GPU runtime (recommended)
- Hugging Face token (if using gated models)
- ~20GB disk space


## Cell 1: Setup & Install


In [None]:
# Colab cell 1: install deps & clone
# If you run into space issues, consider mounting Google Drive
!nvidia-smi || true
!git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
!pip install -e ".[torch,metrics]"  # may take time, adjust to your CUDA/PyTorch combo


## Cell 2: Create Tiny Dataset


In [None]:
# Colab cell 2: Create synthetic invoice dataset
import json, random, os

os.makedirs("data", exist_ok=True)
train = []

for i in range(10):
    vendor = random.choice(["TechCorp Inc.","Acme LLC","Globex"])
    inv = f"INV-{1000+i}"
    date = f"2024-0{random.randint(1,9)}-{random.randint(10,28)}"
    prompt = f"Validate and correct: OCR_TEXT: Vendor: {vendor} Invoice: {inv} Date: {date} Items: Widget 2x50 Total 100"
    completion = json.dumps({
        "document_type":"invoice",
        "vendor":{"name":vendor},
        "client":{},
        "line_items":[{"desc":"Widget","qty":2,"unit_price":50.0,"line_total":100.0}],
        "financial_summary":{"subtotal":100.0,"tax_rate":0.0,"tax_amount":0.0,"grand_total":100.0}
    })
    train.append({"instruction":"Validate and return JSON only", "input": prompt, "output": completion})

with open("data/finscribe_lf_train.jsonl","w") as f:
    for item in train:
        f.write(json.dumps(item) + "\n")

print(f"Wrote {len(train)} examples to data/finscribe_lf_train.jsonl")


## Cell 3: Register Dataset & Create Training Config


In [None]:
# Colab cell 3: Register dataset and create YAML config
import json

# Register dataset in dataset_info.json
dataset_info = {
    "finscribe_lf_train": {
        "file_name": "finscribe_lf_train.jsonl",
        "format": "jsonl",
        "description": "FinScribe micro experiment dataset"
    }
}

with open("data/dataset_info.json", "w") as f:
    json.dump(dataset_info, f, indent=2)

print("Registered dataset in data/dataset_info.json")

# Create training YAML
yaml_config = """
model_name_or_path: <SMALL_MODEL_NAME>  # Replace with small model like 'facebook/opt-125m' or 'microsoft/phi-2'
stage: sft
finetuning_type: lora
dataset: finscribe_lf_train
cutoff_len: 512
output_dir: saves/finscribe_test
per_device_train_batch_size: 1
num_train_epochs: 1
learning_rate: 2e-5
bf16: false
logging_steps: 5
save_steps: 10
"""

os.makedirs("examples/train_lora", exist_ok=True)
with open("examples/train_lora/finscribe_colab.yaml", "w") as f:
    f.write(yaml_config.strip())

print("Created examples/train_lora/finscribe_colab.yaml")
print("\n⚠️  IMPORTANT: Edit the YAML file to replace <SMALL_MODEL_NAME> with your chosen model!")


## Cell 4: Run Training


In [None]:
# Colab cell 4: Run training
# Make sure you've edited the YAML to set model_name_or_path

# If llamafactory-cli is not in PATH, use Python module:
!python -m llamafactory.entrypoints train examples/train_lora/finscribe_colab.yaml

# Or if CLI works:
# !llamafactory-cli train examples/train_lora/finscribe_colab.yaml


## Cell 5: Inference Test (if serving locally)


In [None]:
# Colab cell 5: Example inference stub
# Use your running LLaMA-Factory API or load model directly

import requests
import json

# Example API call (if you started the API server)
API = "http://localhost:8000/v1/chat/completions"
payload = {
    "model": "finscribe-llama",
    "messages": [
        {"role": "user", "content": "Validate JSON: {\"document_type\":\"invoice\",\"vendor\":{\"name\":\"TechCorp Inc.\"}, ... }"}
    ],
    "temperature": 0
}

# Uncomment to call API:
# try:
#     r = requests.post(API, json=payload, timeout=30)
#     print(r.json())
# except Exception as e:
#     print(f"API not available: {e}")

print("If you trained & served the model, call the API with the payload above.")
