# Fine-tuning GUI-Actor 3B on FiftyOne App Dataset
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/visual_agents_workshop/blob/main/session_3/Fine_tuning_GUI_Actor_3B_on_FiftyOne_App_Dataset_.ipynb)


## 📋 Prerequisites:
- **GPU Runtime**: Select GPU in `Runtime` → `Change runtime type`
- **Hugging Face Account**: For accessing models and datasets

# 📦 Installation & Setup

First, let's install all the required packages:


In [None]:
!pip install -q flash-attn --no-build-isolation

In [None]:
!pip install -e git+https://github.com/harpreetsahota204/GUI-Actor-for-FiftyOne.git#egg=gui_actor

In [None]:
import torch
import transformers
import os
import fiftyone as fo
import fiftyone.utils.huggingface as fouh

## Load an SFT Dataset


In [None]:
import fiftyone as fo

from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "harpreetsahota/FiftyOne-GUI-Grounding-Train-with-Synthetic",
    overwrite=True
    )

In [None]:
dataset.first().keypoints.keypoints[0]

### Bridging Two Worlds: FiftyOne Annotations → GUI-Actor Conversations

A fundamental challenge is that FiftyOne organizes data around visual annotations - keypoints mark where to click, bounding boxes show regions to select.

But GUI-Actor expects conversational training data where each interaction is a dialogue between user and assistant.

This means we must creat an entire transformation pipeline that could take a single screenshot with multiple annotations and convert it into multiple training conversations, each with proper system prompts, user instructions, and assistant responses that include coordinate ground truth.

In [None]:
import json

KP_SYSTEM_MESSAGE = """You are a GUI Agent specialized in interacting with the FiftyOne application. Given a screenshot of the current FiftyOne GUI and a human instruction, your task is to locate the screen element that corresponds to the instruction.

You should output a response that:
1. Describes the action you will take in natural language
2. Provides the exact coordinates where you will interact
3. Includes a valid JSON object with the action details, element information, interaction points, and any relevant metadata

The JSON should contain fields for action, element_info, points (as coordinate arrays), and custom_metadata."""

BB_SYSTEM_MESSAGE = """You are a GUI Agent specialized in interacting with the FiftyOne application. Given a screenshot of the current FiftyOne GUI and a human instruction, your task is to locate the screen element that corresponds to the instruction.

You should output a response that:
1. Describes the action you will take in natural language
2. Provides the bounding box coordinates for the interaction region
3. Includes a valid JSON object with the action details, element information, bounding box coordinates, and any relevant metadata

The JSON should contain fields for action, element_info, bounding_box (as [x_min, y_min, x_max, y_max]), and custom_metadata."""

def add_message_payload_to_dataset(dataset):
    """
    Add message payloads to FiftyOne dataset annotations for GUI-Actor training.
    
    This function processes both keypoint (single-point) and detection (bounding box)
    annotations, converting them into conversation format for vision-language model training.
    Each annotation gets a complete conversation with system prompt, user query, and 
    assistant response containing both natural language and structured JSON.
    
    Args:
        dataset: FiftyOne dataset with 'keypoints' and/or 'detections' fields
        
    Modifies:
        Adds 'message_payload' field to each annotation containing the formatted
        conversation for training
    """
    
    for sample in dataset.iter_samples(autosave=True, progress=True):
        filepath = sample["filepath"]
        
        # Process keypoints (point-based interactions like clicks)
        if sample.keypoints:
            for kp in sample.keypoints.keypoints:
                # Extract keypoint attributes with safe defaults
                task_desc = getattr(kp, 'task_description', '')
                element_info = getattr(kp, 'element_info', None)
                action = getattr(kp, 'label', 'click')  # Default action is click
                points = getattr(kp, 'points', [])
                custom_metadata = getattr(kp, 'custom_metadata', None)
                
                # Ensure element_info is a proper dict or string
                if element_info is None or element_info == {} or element_info == '':
                    element_info = "ui_element"  # Simple string fallback
                elif isinstance(element_info, dict) and not element_info:
                    element_info = "ui_element"  # Empty dict -> string
                elif not isinstance(element_info, (dict, str)):
                    element_info = str(element_info)  # Convert to string if weird type
                
                # Ensure custom_metadata is a proper dict
                if custom_metadata is None or custom_metadata == {} or custom_metadata == '':
                    custom_metadata = {"type": "point_interaction"}
                elif not isinstance(custom_metadata, dict):
                    custom_metadata = {"value": str(custom_metadata)}
                
                # Only process if we have valid coordinates
                if points and len(points) > 0:
                    x, y = points[0]
                    
                    # Build the JSON response object
                    json_response = {
                        "action": action,
                        "element_info": element_info,
                        "points": [[round(x, 4), round(y, 4)]],  # Round for cleaner output
                        "custom_metadata": custom_metadata
                    }
                    
                    # Create natural language response with embedded JSON
                    # Note: coordinates in text match those used for pointer tokens
                    response_text = f"""I will {action} the {element_info if isinstance(element_info, str) else 'element'}. ```json {json.dumps(json_response)}```"""
                    
                    # Create the full conversation format
                    messages = [
                        {
                            "role": "system",
                            "content": [
                                {"type": "text", "text": KP_SYSTEM_MESSAGE}
                            ]
                        },
                        {
                            "role": "user",
                            "content": [
                                {"type": "image", "image": filepath},
                                {"type": "text", "text": task_desc}
                            ]
                        },
                        {
                            "role": "assistant",
                            "content": [
                                {"type": "text", "text": response_text}
                            ],
                            "recipient": "os",
                            "end_turn": True,
                            "point_gt": points[0] if points else None  # Ground truth for pointer loss
                        }
                    ]
                    
                    kp.message_payload = messages
        
        # Process detections (bounding box interactions like drag, select)
        if sample.detections:
            for det in sample.detections.detections:
                # Extract detection attributes with safe defaults
                task_desc = getattr(det, 'task_description', '')
                element_info = getattr(det, 'element_info', None)
                action = getattr(det, 'label', 'select')  # Default action is select
                bounding_box = getattr(det, 'bounding_box', [])
                custom_metadata = getattr(det, 'custom_metadata', None)
                
                # Ensure element_info is a proper dict or string
                if element_info is None or element_info == {} or element_info == '':
                    element_info = "ui_region"  # Simple string fallback
                elif isinstance(element_info, dict) and not element_info:
                    element_info = "ui_region"  # Empty dict -> string
                elif not isinstance(element_info, (dict, str)):
                    element_info = str(element_info)  # Convert to string if weird type
                
                # Ensure custom_metadata is a proper dict
                if custom_metadata is None or custom_metadata == {} or custom_metadata == '':
                    custom_metadata = {"type": "bbox_interaction"}
                elif not isinstance(custom_metadata, dict):
                    custom_metadata = {"value": str(custom_metadata)}
                
                # Only process if we have valid bounding box
                if bounding_box and len(bounding_box) == 4:
                    # FiftyOne format: [x, y, width, height] in relative coords [0,1]
                    x, y, width, height = bounding_box
                    
                    # Convert to [x_min, y_min, x_max, y_max] for consistency
                    x_min = round(x, 4)
                    y_min = round(y, 4)
                    x_max = round(x + width, 4)
                    y_max = round(y + height, 4)
                    bbox_gt_format = [x_min, y_min, x_max, y_max]
                    
                    # Build the JSON response object
                    json_response = {
                        "action": action,
                        "element_info": element_info,
                        "bounding_box": bbox_gt_format,
                        "custom_metadata": custom_metadata
                    }
                    
                    # Create natural language response with embedded JSON
                    # Note: from_coord/to_coord format matches training patterns
                    response_text = f"""I will {action} the {element_info if isinstance(element_info, str) else 'element'}. ```json {json.dumps(json_response)}```"""
                    
                    # Create the full conversation format
                    messages = [
                        {
                            "role": "system",
                            "content": [
                                {"type": "text", "text": BB_SYSTEM_MESSAGE}
                            ]
                        },
                        {
                            "role": "user",
                            "content": [
                                {"type": "image", "image": filepath},
                                {"type": "text", "text": task_desc}
                            ]
                        },
                        {
                            "role": "assistant",
                            "content": [
                                {"type": "text", "text": response_text}
                            ],
                            "recipient": "os",
                            "end_turn": True,
                            "bbox_gt": bbox_gt_format  # Ground truth for pointer loss
                        }
                    ]
                    
                    det.message_payload = messages

In [None]:
add_message_payload_to_dataset(dataset)

In [None]:
dataset.first().keypoints.keypoints[0]

# Split the dataset into training and validation sets

# Create PyTorch Datasets from the FiftyOne Dataset


FiftyOne's dataset operations don't naturally align with PyTorch's training loops.

The solution is creating a flattened dataset structure where each annotation becomes an independent sample, implementing proper worker initialization for FiftyOne's multiprocessing, preserving file paths through the transformation pipeline for image loading, handling both keypoint and detection annotations with different processing logic, and maintaining annotation metadata through to the training loop.

This allows researchers to use FiftyOne's powerful dataset management while training GUI-Actor models.

In [None]:
from fiftyone.utils.torch import GetItem

class DataGetter(GetItem):
    @property
    def required_keys(self):
        return ['filepath', 'keypoints', 'detections']

    def __call__(self, d):
        message_payloads = []

        # Extract message_payload from all keypoints in the sample
        keypoints = d.get("keypoints")
        if keypoints is not None and hasattr(keypoints, 'keypoints'):
            for keypoint in keypoints.keypoints:
                if hasattr(keypoint, 'message_payload') and keypoint.message_payload is not None:
                    message_payloads.append(keypoint.message_payload)

        # Extract message_payload from all detections in the sample
        detections = d.get("detections")
        if detections is not None and hasattr(detections, 'detections'):
            for detection in detections.detections:
                if hasattr(detection, 'message_payload') and detection.message_payload is not None:
                    message_payloads.append(detection.message_payload)

        return {
            "filepath": d.get("filepath", ""),
            "message_payload": message_payloads,
        }


class FlattenedDataset:
    """
    Flattens a FiftyOne torch dataset so each item is a single message_payload
    with its associated filepath.
    """
    def __init__(self, fiftyone_torch_dataset):
        self.items = []
        for sample in fiftyone_torch_dataset:
            filepath = sample["filepath"]
            for message_payload in sample["message_payload"]:
                if message_payload:  # Only add non-empty payloads
                    self.items.append({
                        "filepath": filepath,
                        "message_payload": message_payload
                    })
        print(f"FlattenedDataset created with {len(self.items)} items")

    def __len__(self):
        return len(self.items)

    def __getitem__(self, idx):
        item = self.items[idx]
        return item


In [None]:
train_view = dataset.match_tags("train")

val_view = dataset.match_tags("val")

# Create torch datasets using DataGetter
train_torch_dataset = train_view.to_torch(DataGetter())
val_torch_dataset = val_view.to_torch(DataGetter())


train_dataset = FlattenedDataset(train_torch_dataset)
val_dataset = FlattenedDataset(val_torch_dataset)

## Behind the Scenes: Collation

The collator became the heart of the system, orchestrating multiple processing stages for each batch.

It extracts ground truth coordinates from message payloads, processes images through the vision encoder to get patch dimensions, reformats text to replace coordinates with special pointer tokens, calculates visual token indices for supervision, generates patch labels for region-based interactions, and handles padding and batching while preserving all this structured information.

Each stage had to handle edge cases and validation to ensure training stability.

## GUI-Actor Fine-Tuning Recipe

The fine-tuning strategy for GUI-Actor on FiftyOne datasets has two distinct modes:

### 1. **General Fine-tuning** (for multi-application scenarios)

```python
# Optimized for generalization across multiple applications
num_train_epochs=3
learning_rate=2e-5
gradient_accumulation_steps=4
warmup_ratio=0.1
weight_decay=0.01
unfreeze_all_parameters=False
```

### 2. **Single-Application Mode** (for specialized use cases)

The implementation provides flexible parameter unfreezing strategies optimized for different training objectives.

```python
# Aggressive specialization for one target application
single_app_mode=True  # Auto-configures aggressive settings
unfreeze_vision_layers=True  # (auto-enabled)
unfreeze_last_n_layers=8  # (auto-increased from 4)
# Allows intentional overfitting for maximum specialization
```

## 🧠 Parameter Unfreezing Strategy

### Layer-wise Unfreezing Hierarchy:

| Component | Default Mode | Single-App Mode | Rationale |
|-----------|--------------|-----------------|-----------|
| **Vision Layers** | ❌ Frozen | ✅ Last 25% unfrozen | Single-app needs visual specialization |
| **Language Layers** | ✅ Last 4 unfrozen | ✅ Last 8+ unfrozen | Task-specific reasoning adaptation |
| **LM Head** | ✅ Always unfrozen | ✅ Always unfrozen | Output vocabulary adaptation |
| **Pointer Head** | ✅ Always unfrozen | ✅ Always unfrozen | Coordinate prediction adaptation |

## 📊 Training Hyperparameters

### Learning Rate Strategy

- **Base LR**: `2e-5` (higher than original `5e-6` for better fine-tuning)
- **Warmup**: `10%` of total steps (vs original `3%`)
- **Scheduler**: Cosine decay for smooth convergence
- **Weight Decay**: `0.01` for regularization (vs original `0.0`)

### Batch Configuration
- **Train Batch Size**: `1` per device (memory optimized)
- **Eval Batch Size**: `4` per device (faster evaluation)
- **Gradient Accumulation**: `4` steps (effective batch size = 4)
- **Workers**: `4` (reduced from `8` for stability)

### Training Duration
- **Epochs**: `3` (vs original `1`) for better convergence
- **Save Frequency**: Every `500` steps (vs original `2000`)
- **Evaluation**: Enabled with validation set monitoring

## Usage Recommendations

### For General Fine-tuning (Multi-application)
```bash
python train.py --dataset_name your_dataset
# Uses conservative unfreezing, good generalization
```

### For Single-Application Specialization
```bash
python train.py --dataset_name your_app_data --single_app_mode
# Aggressive unfreezing, maximum specialization
```

### For Conservative Fine-tuning (Small datasets)
```bash
python train.py --dataset_name small_dataset --unfreeze_last_n_layers 2
# Minimal unfreezing, prevents overfitting
```


In [None]:
from gui_actor.train import train_gui_actor_on_fiftyone

# Train the model
model, processor = train_gui_actor_on_fiftyone(
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    num_train_epochs=1,
    single_app_mode=True
)

In [None]:
# One-liner to push to HF Hub (replace with your repo name)
model.push_to_hub("gui-actor-fiftyone-finetuned-on-synthetic", private=True)
processor.push_to_hub("gui-actor-fiftyone-finetuned-on-synthetic", private=True)