# Plaited World Agent Training

Train a UI generation agent using Unsloth on Google Colab.

## Training Pipeline Overview

```mermaid
flowchart TB
    subgraph Training["Training Pipeline"]
        Notebook["training/plaited-world-agent-training.ipynb"]
        SFT["1. SFT<br/>(Developer writes)"]
        DPO["2. DPO<br/>(Developer preferences)"]
    end

    subgraph Local["Local Training (Requires Browser)"]
        GRPO["3. GRPO<br/>(Model generates ‚Üí Browser validates)"]
    end

    Notebook --> SFT --> DPO
    DPO -->|"Trained Model"| GRPO
```

**Note:** GRPO requires browser execution for reward computation and runs locally with `plaited-world-agent-grpo.ipynb`.

## Training Phases
1. **SFT (Supervised Fine-Tuning)** - Learn from gold examples (this notebook)
2. **DPO (Direct Preference Optimization)** - Learn from preference pairs (optional, this notebook)
3. **GRPO (Group Relative Policy Optimization)** - Model generates, browser validates (separate notebook, requires local setup)

## Trajectory Format

The training data now includes tiered analysis results and structural metadata:

```json
{
  "messages": [...],
  "reward": 0.85,
  "tiers": {
    "static": {"passed": true, "tier": 1, "checks": [...]},
    "browser": {"passed": true, "a11yPassed": true, "totalAssertions": 5, "passedAssertions": 5}
  },
  "structural": {
    "objects": [{"name": "Button", "grouping": "nested"}],
    "channel": "selection",
    "loops": [{"trigger": "click", "handler": "click"}],
    "levers": ["button"],
    "block": "card"
  }
}
```

## Prerequisites

### 1. Enable GPU Runtime
Runtime ‚Üí Change runtime type ‚Üí **T4 GPU**

### 2. Upload Training Files
Click üìÅ in left sidebar, then upload:
- `trajectories.jsonl` (required) - Generate with: `bun scripts/generate-trajectories.ts training/stories -o training/trajectories.jsonl`
- `dpo-preferences.jsonl` (optional, for DPO phase)

### 3. Add Colab Secrets
Click üîë in left sidebar, add these secrets:

| Secret Name | Value | Example |
|-------------|-------|---------|
| `HF_TOKEN` | Your HuggingFace token (with write access) | `hf_xxxxx` |
| `HF_USERNAME` | HuggingFace org or username | `plaited` |
| `HF_MODEL_NAME` | Model name to push | `plaited-world-agent-lora` |

Toggle "Notebook access" ON for each secret.

In [None]:
# Cell 1: Install Dependencies (pinned versions for Unsloth compatibility)
!pip install unsloth
!pip install trl==0.24.0 datasets==4.3.0 transformers

In [None]:
# Cell 2: Load Secrets and Login to HuggingFace
from huggingface_hub import login
from google.colab import userdata

# Load secrets from Colab (set via üîë sidebar)
hf_token = userdata.get('HF_TOKEN')
hf_username = userdata.get('HF_USERNAME')
hf_model_name = userdata.get('HF_MODEL_NAME')

# Login to HuggingFace
login(token=hf_token)

print(f"‚úì Logged in to HuggingFace")
print(f"‚úì Will push to: {hf_username}/{hf_model_name}")

In [None]:
# Cell 3: Load Model with Unsloth
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/functiongemma-270m-it",  # Function calling optimized (270M params)
    max_seq_length=2048,
    dtype=None,  # Auto-detect
    load_in_4bit=True,  # 4-bit quantization for memory efficiency
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

print(f"Model loaded with {model.num_parameters():,} parameters")

In [None]:
# Cell 4: Load Training Data
from datasets import load_dataset

# Upload your trajectories.jsonl to Colab or load from HuggingFace
dataset = load_dataset("json", data_files="trajectories.jsonl", split="train")

print(f"Loaded {len(dataset)} trajectories")
print(f"Sample intent: {dataset[0]['messages'][1]['content']}")

def format_for_training(example):
    """Format trajectory for GRPO training."""
    messages = example["messages"]
    reward = example["reward"]

    # Format as chat template
    text = tokenizer.apply_chat_template(messages, tokenize=False)

    return {
        "text": text,
        "reward": reward
    }

dataset = dataset.map(format_for_training)
print("Dataset formatted for training")

In [None]:
# Cell 5: Configure SFT Training (Phase 1)
from trl import SFTConfig, SFTTrainer

config = SFTConfig(
    output_dir="./sft-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    logging_steps=10,
    save_steps=100,
    fp16=True,
    max_seq_length=2048,
)

trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

print("SFT Trainer configured")

In [None]:
# Cell 6: Train
print("Starting training...")
trainer.train()
print("Training complete!")

## Phase 2: DPO Training (Optional)

After SFT, you can further improve the model with Direct Preference Optimization (DPO). DPO requires pairs of (chosen, rejected) responses for the same intent.

**When to use DPO:**
- You have examples of good AND bad generations for the same intent
- You want the model to learn subtle quality differences
- SFT alone isn't producing high-quality outputs

**Data format for DPO:**
```json
{
  "prompt": "Create a primary button with hover state",
  "chosen": "<start_function_call>call:writeTemplate{...good output...}<end_function_call>",
  "rejected": "<start_function_call>call:writeTemplate{...bad output...}<end_function_call>"
}
```

In [None]:
# Cell 7: DPO Training (Phase 2 - Optional)
# Skip this cell if you only want SFT training
# Requires: dpo-preferences.jsonl with prompt/chosen/rejected fields

from trl import DPOConfig, DPOTrainer
from datasets import load_dataset

# Load preference pairs (upload dpo-preferences.jsonl first)
try:
    dpo_dataset = load_dataset("json", data_files="dpo-preferences.jsonl", split="train")
    
    dpo_config = DPOConfig(
        output_dir="./dpo-output",
        num_train_epochs=1,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=5e-6,  # Lower LR for DPO
        warmup_ratio=0.1,
        logging_steps=10,
        save_steps=50,
        fp16=True,
        max_length=2048,
        max_prompt_length=512,
        beta=0.1,  # KL penalty coefficient
    )
    
    dpo_trainer = DPOTrainer(
        model=model,
        ref_model=None,  # Use implicit reference (recommended for LoRA)
        args=dpo_config,
        train_dataset=dpo_dataset,
        tokenizer=tokenizer,
    )
    
    print("Starting DPO training...")
    dpo_trainer.train()
    print("DPO training complete!")
    
except FileNotFoundError:
    print("No dpo-preferences.jsonl found. Skipping DPO phase.")
    print("To use DPO, create preference pairs from model generations.")

In [None]:
# Cell 8: Save and Push to Hub
MODEL_NAME = f"{hf_username}/{hf_model_name}"

# Save locally first
model.save_pretrained(f"./{hf_model_name}")
tokenizer.save_pretrained(f"./{hf_model_name}")
print("Saved locally")

# Push to HuggingFace Hub
model.push_to_hub(MODEL_NAME, token=hf_token)
tokenizer.push_to_hub(MODEL_NAME, token=hf_token)

print(f"\nModel pushed to: https://huggingface.co/{MODEL_NAME}")
print("\nNext steps:")
print("1. Go to https://huggingface.co/endpoints")
print("2. Create new endpoint with your model")
print("3. Select T4 GPU and vLLM container")
print("4. Copy endpoint URL for use in your agent")

## Troubleshooting

### Out of Memory
- Reduce `per_device_train_batch_size` to 2 or 1
- Increase `gradient_accumulation_steps` to compensate

### Model Not Generating Tools
- Check tool schema format matches training data
- Try increasing temperature to 0.7

### push_to_hub Fails
- Verify HF token has write access
- Check Colab secrets are set correctly

## Connecting the Trained Agent

After deploying to HuggingFace Inference Endpoints, connect from TypeScript:

```typescript
import { InferenceClient } from '@huggingface/inference'
import { useWorldAgent, createCoreTools } from 'plaited/agent'

const client = new InferenceClient(process.env.HF_TOKEN)

const trigger = await useWorldAgent({
  tools: createCoreTools({ outputDir: './generated' }),
  
  // Optional: Discover and register skill scripts as callable tools
  skills: {
    skillsRoot: '.claude/skills',  // Scans for scripts/ directories
    timeout: 30000,                 // Script execution timeout
  },
  
  // Optional: Custom system prompt (skill context is auto-appended)
  systemPrompt: 'You are a Plaited UI generation agent.',
  
  model: {
    chatCompletion: async (args) => {
      const response = await client.chatCompletion({
        model: 'your-username/plaited-world-agent',
        endpointUrl: 'https://xxx.endpoints.huggingface.cloud',
        messages: args.messages,
        tools: args.tools?.map(t => ({ type: 'function', function: t }))
      })
      return {
        tool_calls: response.choices[0]?.message?.tool_calls?.map(tc => ({
          name: tc.function.name,
          arguments: tc.function.arguments
        }))
      }
    }
  }
})

// Generate UI from intent
trigger({
  type: 'generate',
  detail: { intent: 'Create a primary button with hover state' }
})
```