
## 📋 Prerequisites:
- **GPU Runtime**: Select GPU in `Runtime` → `Change runtime type`
- **Hugging Face Account**: For accessing models and datasets

# 📦 Installation & Setup

First, let's install all the required packages:


In [None]:
!pip install -q flash-attn --no-build-isolation

In [None]:
!pip install -e git+https://github.com/harpreetsahota204/GUI-Actor-for-FiftyOne.git#egg=gui_actor

In [None]:
import torch
import transformers
import os
import fiftyone as fo
import fiftyone.utils.huggingface as fouh

## Load an SFT Dataset


In [1]:
import fiftyone as fo

from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
    "harpreetsahota/FiftyOne-GUI-Grounding-Train",
    overwrite=True
    )

Downloading config file fiftyone.yml from harpreetsahota/FiftyOne-GUI-Grounding-Train


INFO:fiftyone.utils.huggingface:Downloading config file fiftyone.yml from harpreetsahota/FiftyOne-GUI-Grounding-Train


Loading dataset


INFO:fiftyone.utils.huggingface:Loading dataset


Importing samples...


INFO:fiftyone.utils.data.importers:Importing samples...


 100% |█████████████████| 739/739 [36.6ms elapsed, 0s remaining, 20.2K samples/s]     


INFO:eta.core.utils: 100% |█████████████████| 739/739 [36.6ms elapsed, 0s remaining, 20.2K samples/s]     


In [5]:
dataset.first().keypoints.keypoints[0]

<Keypoint: {
    'id': '68b1ad2fcd37ccc965b1dcf1',
    'attributes': {},
    'tags': [],
    'label': 'click',
    'points': [[0.5, 0.28627170617420067]],
    'confidence': None,
    'index': None,
    'visible': [2],
    'coco_id': 1,
    'supercategory': 'interaction',
    'iscrowd': 0,
    'task_description': 'Open color settings',
    'action_type': 'click',
    'element_info': 'Icon',
    'custom_metadata': {},
}>

In [2]:
KP_SYSTEM_MESSAGE = """You are a GUI Agent specialized in interacting with the FiftyOne application. Given a screenshot of the current FiftyOne GUI and a human instruction, your task is to locate the screen element that corresponds to the instruction.

You should output a response indicating the element type to interact with, action to be taken on that element, correct position of the action, and any additional metadata.

Your response must be a valid JSON wrapped exactly this format:

```json
{{"element_info": {element_info}, "label": "{label}", "points": {points}, "custom_metadata": {custom_metadata}}}
```"""

BB_SYSTEM_MESSAGE = """You are a GUI Agent specialized in interacting with the FiftyOne application. Given a screenshot of the current FiftyOne GUI and a human instruction, your task is to locate the screen element that corresponds to the instruction.

You should output a response indicating the element type to interact with, action to be taken on that element, correct position of the action, and any additional metadata.

Your response must be a valid JSON wrapped exactly this format:

```json
{{"element_info": {element_info}, "label": "{label}", "bounding_box": {bounding_box}, "custom_metadata": {custom_metadata}}}
```"""


def add_message_payload_to_dataset(dataset):
    """
    Add message payload to all keypoints and detections, where the label
    represents the ACTION to be performed (click, type, select, etc.).

    Converts bounding boxes from [x, y, width, height] to [x_min, y_min, x_max, y_max] for bbox_gt.
    """

    for sample in dataset.iter_samples(autosave=True, progress=True):
        filepath = sample["filepath"]

        # Process keypoints (point-based interactions)
        if sample.keypoints:
            for kp in sample.keypoints.keypoints:
                # Extract keypoint attributes
                task_desc = getattr(kp, 'task_description', '')
                element_info = getattr(kp, 'element_info', {})
                action = getattr(kp, 'label', '')
                points = getattr(kp, 'points', [])
                custom_metadata = getattr(kp, 'custom_metadata', {})

                # Extract coordinates for pointer loss
                if points and len(points) > 0:
                    x, y = points[0]

                    # Create response with action and location
                    response_text = f"""I can {action} the {element_info} at x={x}, y={y} to complete this task. Here is the valid JSON response: ```json {{"action": "{action}", "element_info": {element_info}, "points": {points}, "custom_metadata": {custom_metadata}}}```"""

                    # Create message payload in the correct format
                    messages = [
                        {
                            "role": "system",
                            "content": [
                                {"type": "text", "text": KP_SYSTEM_MESSAGE}
                            ]
                        },
                        {
                            "role": "user",
                            "content": [
                                {"type": "image", "image": filepath},
                                {"type": "text", "text": task_desc}
                            ]
                        },
                        {
                            "role": "assistant",
                            "content": [
                                {"type": "text", "text": response_text}
                            ],
                            "recipient": "os",
                            "end_turn": True,
                            "point_gt": points[0] if points else None
                        }
                    ]

                    kp.message_payload = messages

        # Process detections (region-based interactions)
        if sample.detections:
            for det in sample.detections.detections:
                # Extract detection attributes
                task_desc = getattr(det, 'task_description', '')
                element_info = getattr(det, 'element_info', '')
                action = getattr(det, 'label', '')
                bounding_box = getattr(det, 'bounding_box', [])
                custom_metadata = getattr(det, 'custom_metadata', {})

                if bounding_box and len(bounding_box) == 4:
                    # Extract bounding box coordinates [x, y, width, height]
                    x, y, width, height = bounding_box

                    # Convert to [x_min, y_min, x_max, y_max] format for bbox_gt
                    x_min = x
                    y_min = y
                    x_max = x + width
                    y_max = y + height
                    bbox_gt_format = [x_min, y_min, x_max, y_max]

                    # Create response with action and bounding box information
                    response_text = f"""I can {action} the {element_info} from_coord=[{x_min}, {y_min}] to_coord=[{x_max}, {y_max}] to complete this task. Here is the valid JSON response: ```json {{"action": "{action}", "element_info": {element_info}, "bounding_box": {bbox_gt_format}, "custom_metadata": {custom_metadata}}}```"""

                    # Create message payload in the correct format
                    messages = [
                        {
                            "role": "system",
                            "content": [
                                {"type": "text", "text": BB_SYSTEM_MESSAGE}
                            ]
                        },
                        {
                            "role": "user",
                            "content": [
                                {"type": "image", "image": filepath},
                                {"type": "text", "text": task_desc}
                            ]
                        },
                        {
                            "role": "assistant",
                            "content": [
                                {"type": "text", "text": response_text}
                            ],
                            "recipient": "os",
                            "end_turn": True,
                            "bbox_gt": bbox_gt_format
                        }
                    ]

                    det.message_payload = messages

In [3]:
add_message_payload_to_dataset(dataset)

 100% |█████████████████| 739/739 [5.7s elapsed, 0s remaining, 198.9 samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 739/739 [5.7s elapsed, 0s remaining, 198.9 samples/s]      


In [8]:
dataset.first().keypoints.keypoints[0]

<Keypoint: {
    'id': '68b1ad2fcd37ccc965b1dcf1',
    'attributes': {},
    'tags': [],
    'label': 'click',
    'points': [[0.5, 0.28627170617420067]],
    'confidence': None,
    'index': None,
    'visible': [2],
    'coco_id': 1,
    'supercategory': 'interaction',
    'iscrowd': 0,
    'task_description': 'Open color settings',
    'action_type': 'click',
    'element_info': 'Icon',
    'custom_metadata': {},
    'message_payload': [
        {
            'role': 'system',
            'content': [
                {
                    'type': 'text',
                    'text': 'You are a GUI Agent specialized in interacting with the FiftyOne application. Given a screenshot of the current FiftyOne GUI and a human instruction, your task is to locate the screen element that corresponds to the instruction.\n\nYou should output a response indicating the element type to interact with, action to be taken on that element, correct position of the action, and any additional metadata.\n\n

# Split the dataset into training and validation sets

In [4]:
import fiftyone.utils.random as four

four.random_split(dataset, {"train": 0.8, "val": 0.2})

train_view = dataset.match_tags("train")

val_view = dataset.match_tags("val")

# Create PyTorch Datasets from the FiftyOne Dataset

In [5]:
from fiftyone.utils.torch import GetItem

class DataGetter(GetItem):
    @property
    def required_keys(self):
        return ['filepath', 'keypoints', 'detections']

    def __call__(self, d):
        message_payloads = []

        # Extract message_payload from all keypoints in the sample
        keypoints = d.get("keypoints")
        if keypoints is not None and hasattr(keypoints, 'keypoints'):
            for keypoint in keypoints.keypoints:
                if hasattr(keypoint, 'message_payload') and keypoint.message_payload is not None:
                    message_payloads.append(keypoint.message_payload)

        # Extract message_payload from all detections in the sample
        detections = d.get("detections")
        if detections is not None and hasattr(detections, 'detections'):
            for detection in detections.detections:
                if hasattr(detection, 'message_payload') and detection.message_payload is not None:
                    message_payloads.append(detection.message_payload)

        return {
            "filepath": d.get("filepath", ""),
            "message_payload": message_payloads,
        }


class FlattenedDataset:
    """
    Flattens a FiftyOne torch dataset so each item is a single message_payload
    with its associated filepath.
    """
    def __init__(self, fiftyone_torch_dataset):
        self.items = []
        for sample in fiftyone_torch_dataset:
            filepath = sample["filepath"]
            for message_payload in sample["message_payload"]:
                if message_payload:  # Only add non-empty payloads
                    self.items.append({
                        "filepath": filepath,
                        "message_payload": message_payload
                    })
        print(f"FlattenedDataset created with {len(self.items)} items")

    def __len__(self):
        return len(self.items)

    def __getitem__(self, idx):
        item = self.items[idx]
        return item


In [6]:
# Create torch datasets using DataGetter
train_torch_dataset = train_view.to_torch(DataGetter())
val_torch_dataset = val_view.to_torch(DataGetter())


train_dataset = FlattenedDataset(train_torch_dataset)
val_dataset = FlattenedDataset(val_torch_dataset)

FlattenedDataset created with 2628 items
FlattenedDataset created with 939 items


# 🎯 GUI-Actor Fine-Tuning Recipe Report

## 📋 **Executive Summary**

The fine-tuning strategy for GUI-Actor on FiftyOne datasets has two distinct modes:

1. **General Fine-tuning** (for multi-application scenarios)


```python
# Optimized for generalization across multiple applications
num_train_epochs=3
learning_rate=2e-5
gradient_accumulation_steps=4
warmup_ratio=0.1
weight_decay=0.01
unfreeze_all_parameters=False
```

2. **Single-Application Mode** (for specialized use cases). The implementation provides flexible parameter unfreezing strategies optimized for different training objectives.

```python
# Aggressive specialization for one target application
single_app_mode=True  # Auto-configures aggressive settings
unfreeze_vision_layers=True (auto-enabled)
unfreeze_last_n_layers=8+ (auto-increased)
# Allows intentional overfitting for maximum specialization
```

## 🧠 **Parameter Unfreezing Strategy**

### **Layer-wise Unfreezing Hierarchy:**

| Component | Default Mode | Single-App Mode | Rationale |
|-----------|--------------|-----------------|-----------|
| **Vision Layers** | ❌ Frozen | ✅ Last 25% unfrozen | Single-app needs visual specialization |
| **Language Layers** | ✅ Last 4 unfrozen | ✅ Last 8+ unfrozen | Task-specific reasoning adaptation |
| **LM Head** | ✅ Always unfrozen | ✅ Always unfrozen | Output vocabulary adaptation |
| **Pointer Head** | ✅ Always unfrozen | ✅ Always unfrozen | Coordinate prediction adaptation |

---

## 📊 **Training Hyperparameters**

### **Learning Rate Strategy**

- **Base LR**: `2e-5` (higher than original `5e-6` for better fine-tuning)
- **Warmup**: `10%` of total steps (vs original `3%`)
- **Scheduler**: Cosine decay for smooth convergence
- **Weight Decay**: `0.01` for regularization (vs original `0.0`)

### **Batch Configuration**
- **Train Batch Size**: `1` per device (memory optimized)
- **Eval Batch Size**: `4` per device (faster evaluation)
- **Gradient Accumulation**: `4` steps (effective batch size = 4)
- **Workers**: `4` (reduced from `8` for stability)

### **Training Duration**
- **Epochs**: `3` (vs original `1`) for better convergence
- **Save Frequency**: Every `500` steps (vs original `2000`)
- **Evaluation**: Enabled with validation set monitoring



##  **Usage Recommendations**

### **For General Fine-tuning** (Multi-application)
```bash
python train.py --dataset_name your_dataset
# Uses conservative unfreezing, good generalization
```

### **For Single-Application Specialization**
```bash
python train.py --dataset_name your_app_data --single_app_mode
# Aggressive unfreezing, maximum specialization
```

### **For Conservative Fine-tuning** (Small datasets)
```bash
python train.py --dataset_name small_dataset --unfreeze_last_n_layers 2
# Minimal unfreezing, prevents overfitting
```




In [None]:
from gui_actor.train import train_gui_actor_on_fiftyone

# Train the model
model, processor = train_gui_actor_on_fiftyone(
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    num_train_epochs=1,
    single_app_mode=True
)

Optimizing CUDA memory allocation strategy...
Memory optimization settings applied

🚀 Starting GUI-Actor training with memory-optimized settings
💾 Memory optimization techniques applied:
  - Chunked tensor operations in forward pass
  - Gradient checkpointing enabled
  - PyTorch memory allocation optimized with expandable_segments
  - Reduced sequence and image dimensions
  - Increased gradient accumulation steps
  - Using batch size: 1 with 16x accumulation


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


🎯 SINGLE-APP MODE: Applying aggressive fine-tuning settings...
   - Vision layers will be unfrozen
   - Unfreezing 8 transformer layers
   - This will maximize adaptation to your specific application
Using selective unfreezing strategy for fine-tuning...
Unfreezing language modeling head...
Unfreezing last 8 transformer layers...
Unfreezing vision layers for single-application adaptation...
  - Unfreezing Qwen2.5-VL vision blocks...
    Unfroze last 8/32 vision blocks
✓ Gradient checkpointing enabled for memory efficiency

Trainable parameters: 1,084,853,696 (28.64%)
Total parameters: 3,787,653,120

GPU Memory Stats:
  GPU 0: NVIDIA A100-SXM4-40GB
    - Allocated: 7.06 GB
    - Reserved: 7.13 GB
    - Max Allocated: 7.06 GB

Trainable components:
- Language modeling head
- Transformer layers: 28, 29, 30, 31, 32, 33, 34, 35
[2025-08-29 14:52:17,185] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-29 14:52:19,256] [INFO] [logging.py:



Step,Training Loss,Validation Loss


In [None]:
# One-liner to push to HF Hub (replace with your repo name)
model.push_to_hub("gui-actor-fiftyone-finetuned", private=True)
processor.push_to_hub("gui-actor-fiftyone-finetuned", private=True)