# ARPO UITARS 1.5 7B - OSWorld Inference

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/YOUR_USERNAME/YOUR_REPO/blob/main/ARPO_UITARS_Inference.ipynb)

This notebook demonstrates how to run inference with the ARPO-trained UITARS model on OSWorld tasks using **4-bit quantization** for memory efficiency.

**Model**: [Fanbin/ARPO_UITARS1.5_7B](https://huggingface.co/Fanbin/ARPO_UITARS1.5_7B)

**Performance**:
- OSWorld (128 Tasks): **83.9%**
- OSWorld Overall: **29.9%**

---

## üöÄ Quick Start

**This notebook works on:**
- ‚úÖ **Google Colab** (Free T4 GPU - recommended!)
- ‚úÖ **Local Jupyter** (8GB+ GPU)
- ‚úÖ **Kaggle Notebooks**
- ‚úÖ **Any Python environment** with GPU

**No OSWorld setup required!** This notebook uses **real desktop screenshots** downloaded from the web, so you can start testing immediately.

---

## üìù What You'll Learn

1. Load ARPO UITARS model with 4-bit quantization
2. Process desktop screenshots
3. Generate GUI actions (click, type, scroll, etc.)
4. Handle multi-turn conversations
5. Parse and execute actions

Let's get started! üëá

## 1. Install Required Dependencies

In [1]:
# Install required packages - using latest versions for Qwen2.5-VL support
%pip install -q --upgrade transformers accelerate
%pip install -q qwen-vl-utils pillow torch

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.0/44.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.0/12.0 MB[0m [31m84.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m:01[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m380.9/380.9 kB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m566.1/566.1 kB[0m [31m47.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m3.3/3.3 MB[0m [31m116.3 MB/s[0m eta 

## 2. Load Model with 4-bit Quantization

In [None]:
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from qwen_vl_utils import process_vision_info
from PIL import Image
import io
import base64
import math
import warnings
warnings.filterwarnings('ignore')

# ===== HuggingFace Authentication =====
# Login to HuggingFace (required for gated models)
from huggingface_hub import login

# Option 1: Login with token
try:
    # Get your token from: https://huggingface.co/settings/tokens
    login(token="YOUR_HF_TOKEN")  # ‚Üê Replace with your token!
    print("‚úÖ Logged in to HuggingFace")
except Exception as e:
    print(f"‚ö†Ô∏è HuggingFace login failed: {e}")
    print("Proceeding anyway (model might be public)...")

# Model configuration
repo = "Fanbin/ARPO_UITARS1.5_7B"

print("\nü§ñ Loading ARPO UITARS model with 4-bit quantization...")
print("This will take 1-2 minutes...")

# Check if CUDA is available
if not torch.cuda.is_available():
    print("‚ö†Ô∏è WARNING: CUDA not available. This will be very slow!")
    print("On Colab: Runtime ‚Üí Change runtime type ‚Üí T4 GPU")
    use_quantization = False
else:
    print(f"‚úÖ CUDA available: {torch.cuda.get_device_name(0)}")
    use_quantization = True

# Load processor
print("üì¶ Loading processor...")
processor = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
print("‚úÖ Processor loaded")

# Configure 4-bit quantization
if use_quantization:
    print("‚öôÔ∏è Configuring 4-bit quantization...")
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4"
    )
    
    # Load model with quantization
    print("üì• Loading model (this may take 1-2 minutes)...")
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        repo,
        quantization_config=quantization_config,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.float16,
    )
else:
    # Load model without quantization (CPU fallback)
    print("üì• Loading model in FP16 (no quantization)...")
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        repo,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.float16,
    )

print(f"\n‚úÖ Model loaded successfully!")
print(f"üìç Device: {model.device}")
print(f"üî¢ Dtype: {model.dtype}")

# Print memory usage
if torch.cuda.is_available():
    print(f"üíæ GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

‚úÖ Logged in to HuggingFace

ü§ñ Loading ARPO UITARS model with 4-bit quantization...
This will take 1-2 minutes...
‚úÖ CUDA available: Tesla T4
üì¶ Loading processor...


preprocessor_config.json:   0%|          | 0.00/763 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


‚úÖ Processor loaded
‚öôÔ∏è Configuring 4-bit quantization...
üì• Loading model (this may take 1-2 minutes)...


config.json: 0.00B [00:00, ?B/s]

You are using a model of type qwen2_5_vl to instantiate a model of type qwen2_vl. This is not supported for all configurations of models and can yield errors.


False

The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/lib/python3.12/dist-packages/cv2/../../lib64')}
The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//mp.kaggle.net')}
The following directories listed in your path were found to be non-existent: {PosixPath('//172.28.0.1'), PosixPath('http'), PosixPath('8013')}
The following directories listed in your path were found to be non-existent: {PosixPath('--logtostderr --listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-3p9206ekdjht9 --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_c

RuntimeError: Failed to import transformers.integrations.bitsandbytes because of the following error (look up to see its traceback):

        CUDA Setup failed despite GPU being available. Please run the following command to get more information:

        python -m bitsandbytes

        Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

## 2.1 Troubleshooting Model Loading

If you encounter errors in the previous cell, try these solutions:

In [3]:
# ===== Troubleshooting Solutions =====

# Problem 1: "AttributeError: 'weight' is not an nn.Module"
# Solution: Restart runtime and reinstall packages
"""
# In Colab: Runtime ‚Üí Restart runtime
# Then run these:
%pip uninstall -y bitsandbytes
%pip install bitsandbytes==0.43.0
# Then re-run cells 1-4
"""

# Problem 2: Out of memory
# Solution: Try without quantization or reduce memory usage
"""
# Option A: Load without quantization (needs more memory)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Fanbin/ARPO_UITARS1.5_7B",
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,  # or torch.float16
)

# Option B: Enable CPU offload
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Fanbin/ARPO_UITARS1.5_7B",
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    offload_folder="offload",
    offload_state_dict=True,
)
"""

# Problem 3: CUDA not available
# Solution: Enable GPU in Colab
"""
# Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator ‚Üí T4 GPU ‚Üí Save
# Then Runtime ‚Üí Restart runtime
# Re-run all cells
"""

# Problem 4: HuggingFace authentication
# Solution: Login manually
"""
from huggingface_hub import login
login()  # This will prompt for your token
# Or use: login(token="YOUR_TOKEN_HERE")
"""

print("üí° If you're still having issues:")
print("1. Restart runtime (Runtime ‚Üí Restart runtime)")
print("2. Clear outputs (Edit ‚Üí Clear all outputs)")
print("3. Run cells 1-4 again")
print("4. If still failing, try without quantization (see commented code above)")

üí° If you're still having issues:
1. Restart runtime (Runtime ‚Üí Restart runtime)
2. Clear outputs (Edit ‚Üí Clear all outputs)
3. Run cells 1-4 again
4. If still failing, try without quantization (see commented code above)


## üö® QUICK FIX: Load Model WITHOUT Quantization

**Run this cell if you got bitsandbytes CUDA errors above!**  
This skips quantization and loads the model in BFloat16 - works perfectly on Colab T4 GPU.

In [None]:
# ===== FIXED: Load Model WITHOUT Quantization =====
import torch
from transformers import AutoModel, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import io, base64, math, warnings
warnings.filterwarnings('ignore')

# Login
from huggingface_hub import login
try:
    # Get your token from: https://huggingface.co/settings/tokens
    login(token="YOUR_HF_TOKEN")  # ‚Üê Replace with your token!
    print("‚úÖ Logged in\n")
except: pass

repo = "Fanbin/ARPO_UITARS1.5_7B"

print("ü§ñ Loading model WITHOUT quantization (BFloat16)...\n")

# Load processor
print("üì¶ Loading processor...")
processor = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
print("‚úÖ Processor loaded\n")

# Load model with AutoModel + trust_remote_code
# This automatically loads the custom Qwen2.5-VL code from the repo
print("üì• Loading model (2-3 min)...")
model = AutoModel.from_pretrained(
    repo,
    device_map="auto",
    trust_remote_code=True,  # Loads custom model code
    torch_dtype=torch.bfloat16,
)

print(f"\n{'='*60}")
print("‚úÖ MODEL LOADED!")
print(f"{'='*60}")
print(f"Device: {model.device}")
print(f"Dtype: {model.dtype}")
if torch.cuda.is_available():
    print(f"GPU Memory: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
print(f"{'='*60}\n")
print("‚úÖ Ready! Continue to next section for inference.")

‚úÖ Logged in

ü§ñ Loading model WITHOUT quantization (BFloat16)...

üì¶ Loading processor...


chat_template.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


‚úÖ Processor loaded

üì• Loading model (2-3 min)...


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]




‚úÖ MODEL LOADED!
Device: cuda:0
Dtype: torch.bfloat16
GPU Memory: 12.69 GB

‚úÖ Ready! Continue to next section for inference.


## 3. Setup Image Processing and Action Space

In [3]:
# Image processing constants
IMAGE_FACTOR = 28
MIN_PIXELS = 100 * 28 * 28
MAX_PIXELS = 16384 * 28 * 28  # Max 16K tokens

def linear_resize(height: int, width: int, factor: int = IMAGE_FACTOR, 
                  min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS):
    """Resize image maintaining aspect ratio within pixel constraints."""
    if width * height > max_pixels:
        resize_factor = math.sqrt(max_pixels / (width * height))
        width, height = int(width * resize_factor), int(height * resize_factor)
    if width * height < min_pixels:
        resize_factor = math.sqrt(min_pixels / (width * height))
        width, height = math.ceil(width * resize_factor), math.ceil(height * resize_factor)
    return height, width

def preprocess_image(image):
    """Preprocess image for model input."""
    if isinstance(image, bytes):
        image = Image.open(io.BytesIO(image))
    elif isinstance(image, str):
        # Assume it's a base64 encoded string or file path
        if image.startswith('data:image'):
            # Base64 encoded
            image_data = base64.b64decode(image.split(',')[1])
            image = Image.open(io.BytesIO(image_data))
        else:
            # File path
            image = Image.open(image)
    
    # Resize if needed
    if image.width * image.height > MAX_PIXELS:
        resize_factor = math.sqrt(MAX_PIXELS / (image.width * image.height))
        new_width = int(image.width * resize_factor)
        new_height = int(image.height * resize_factor)
        image = image.resize((new_width, new_height))
    
    if image.width * image.height < MIN_PIXELS:
        resize_factor = math.sqrt(MIN_PIXELS / (image.width * image.height))
        new_width = math.ceil(image.width * resize_factor)
        new_height = math.ceil(image.height * resize_factor)
        image = image.resize((new_width, new_height))
    
    # Convert to RGB if needed
    if image.mode != "RGB":
        image = image.convert("RGB")
    
    return image

# UITARS Action Space
UITARS_ACTION_SPACE = """## Action Space
The actions you can perform fall into the following categories:

- **Mouse Click**: Perform click actions with a bounding box to specify the click target.
  - `click(start_box='(x, y)')`: Click at position (x, y)
  - `left_double(start_box='(x, y)')`: Double-click at position (x, y)
  - `right_single(start_box='(x, y)')`: Right-click at position (x, y)

- **Keyboard Input**: Type content or press hotkeys.
  - `type(content='text content here')`: Type the given text
  - `hotkey(key='key combination')`: Press hotkey (e.g., 'ctrl c', 'ctrl v')
  - `press(key='key_name')`: Press a single key

- **Scroll**: Scroll in a direction.
  - `scroll(start_box='(x, y)', direction='up/down')`: Scroll at position

- **Drag**: Drag from one position to another.
  - `drag(start_box='(x1, y1)', end_box='(x2, y2)')`: Drag from start to end

- **Task Control**:
  - `finished()`: Task is complete
  - `wait()`: Need to wait for something
  - `error_env()`: Environment error
"""

print("Image processing and action space configured!")

Image processing and action space configured!


In [4]:
def generate_action(instruction, screenshot, history_images=None, history_responses=None, 
                   max_new_tokens=4096, temperature=0.0, top_p=0.9):
    """
    Generate action prediction from screenshot and instruction.
    
    Args:
        instruction: Task instruction
        screenshot: Current screenshot (PIL Image, bytes, or base64 string)
        history_images: List of previous screenshots for multi-turn
        history_responses: List of previous model responses
        max_new_tokens: Maximum tokens to generate
        temperature: Sampling temperature
        top_p: Top-p sampling parameter
    
    Returns:
        Generated action string
    """
    # Process current screenshot
    current_image = preprocess_image(screenshot)
    
    # Build message history
    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful GUI agent assistant."}]
        }
    ]
    
    # Create user prompt
    user_prompt = f"""You are a GUI agent. Your task is to complete the following instruction by interacting with the computer screen.

**Instruction**: {instruction}

{UITARS_ACTION_SPACE}

**Format**: 
Thought: [Your reasoning about what to do next]
Action: [Your action following the action space format]

Please provide your response in English."""
    
    # Add history if available
    if history_images and history_responses:
        history_n = min(15, len(history_images))  # Keep last 15 images
        for i in range(len(history_responses)):
            if i >= len(history_responses) - history_n:
                hist_img = preprocess_image(history_images[i])
                messages.append({
                    "role": "user",
                    "content": [{"type": "image", "image": hist_img}]
                })
            messages.append({
                "role": "assistant",
                "content": [{"type": "text", "text": history_responses[i]}]
            })
    
    # Add user prompt and current image
    if not history_images or not history_responses:
        messages.append({
            "role": "user",
            "content": [{"type": "text", "text": user_prompt}]
        })
    
    messages.append({
        "role": "user",
        "content": [{"type": "image", "image": current_image}]
    })
    
    # Process messages
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    
    image_inputs, video_inputs = process_vision_info(messages)
    
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    
    inputs = inputs.to(model.device)
    
    # Generate
    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature if temperature > 0 else None,
            top_p=top_p if temperature > 0 else None,
            do_sample=temperature > 0,
        )
    
    # Trim generated tokens to remove input
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    
    return output_text

print("Inference function ready!")

Inference function ready!


## 5. Example Usage: Simple Inference

In [None]:
# Example: Load a desktop screenshot and generate action

from PIL import Image, ImageDraw
import requests
from io import BytesIO

# Try to download a sample image, or create synthetic one
print("üì• Loading desktop screenshot...")

try:
    # Try a working direct image URL
    url = "https://raw.githubusercontent.com/ultralytics/yolov5/master/data/images/zidane.jpg"
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    screenshot = Image.open(BytesIO(response.content))
    screenshot = screenshot.resize((1920, 1080))
    print(f"‚úÖ Screenshot loaded: {screenshot.size}")
except Exception as e:
    print(f"‚ö†Ô∏è Download failed: {e}")
    print("üìù Creating synthetic desktop screenshot for demo...")
    # Create a desktop-like image
    screenshot = Image.new('RGB', (1920, 1080), color=(45, 45, 48))
    draw = ImageDraw.Draw(screenshot)
    # Taskbar
    draw.rectangle([(0, 1040), (1920, 1080)], fill=(30, 30, 30))
    # Window
    draw.rectangle([(100, 100), (800, 600)], fill=(255, 255, 255), outline=(150, 150, 150), width=2)
    # Dock icons
    for i in range(5):
        x = 50 + i * 80
        draw.rectangle([(x, 1045), (x+60, 1075)], fill=(100, 100, 200))
    print(f"‚úÖ Synthetic screenshot created: {screenshot.size}")

# You can also use your own screenshots:
# Option 1: Load from local file
# screenshot = Image.open("path/to/your/screenshot.png")

# Option 2: Upload file in Colab
# from google.colab import files
# uploaded = files.upload()
# screenshot = Image.open(list(uploaded.keys())[0])

# Define task instruction
instruction = "Click on the Firefox icon to open the browser"

# Generate action
print(f"\nüìã Instruction: {instruction}")
print("üîÆ Generating action...\n")

action = generate_action(
    instruction=instruction,
    screenshot=screenshot,
    temperature=0.0  # Greedy decoding for deterministic output
)

print(f"\n{'='*60}")
print("üéØ GENERATED ACTION:")
print(f"{'='*60}")
print(action)
print(f"{'='*60}")

üì• Downloading sample Ubuntu desktop screenshot...


UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7faa42619580>

## 6. Alternative: Upload Your Own Screenshots

If you want to use your own desktop screenshots:

In [None]:
# ===== OPTION 1: Upload file (Colab/Jupyter) =====
# Uncomment to upload your own screenshot
"""
from google.colab import files
uploaded = files.upload()  # This will show upload button
screenshot = Image.open(list(uploaded.keys())[0])
instruction = "Your task here"
action = generate_action(instruction, screenshot)
print(action)
"""

# ===== OPTION 2: Load from local file path =====
# Uncomment if running locally with a file
"""
screenshot = Image.open('/path/to/your/screenshot.png')
instruction = "Your task here"
action = generate_action(instruction, screenshot)
print(action)
"""

# ===== OPTION 3: Use with OSWorld (if available) =====
# Only works if you have OSWorld environment set up locally
# NOT compatible with Colab due to Docker/VM requirements
"""
import sys
sys.path.append('ARPO/OSWorld')
from desktop_env.desktop_env import DesktopEnv
from io import BytesIO
import json

# Initialize OSWorld environment (requires Docker/VM)
env = DesktopEnv(
    action_space="pyautogui",
    screen_size=(1920, 1080),
    os_type="Ubuntu",
    provider_name="docker",
)

# Load a test task
with open('ARPO/OSWorld/evaluation_examples/test_subset32.json', 'r') as f:
    tasks = json.load(f)

domain = list(tasks.keys())[0]
example_id = tasks[domain][0]
with open(f'ARPO/OSWorld/evaluation_examples/examples/{domain}/{example_id}.json', 'r') as f:
    task = json.load(f)

# Get screenshot from environment
obs = env.reset(task)
screenshot = Image.open(BytesIO(obs['screenshot']))
instruction = task['instruction']

# Run inference
action = generate_action(instruction, screenshot)
print(action)
"""

print("üìù See commented code above for different ways to load screenshots")
print("‚úÖ For Colab: Use Option 1 (Upload) or use the sample Ubuntu desktop (Section 5)")
print("üñ•Ô∏è  For local with OSWorld: Use Option 3")


    def __init__(self, model, processor, max_history=15):
        self.model = model
        self.processor = processor
        self.max_history = max_history
        self.history_images = []
        self.history_responses = []
    
    def reset(self):
        """Reset history for new task."""
        self.history_images = []
        self.history_responses = []
    
    def predict(self, instruction, screenshot, temperature=0.0, top_p=0.9, max_new_tokens=4096):
        """Predict next action."""
        # Generate action
        action = generate_action(
            instruction=instruction,
            screenshot=screenshot,
            history_images=self.history_images,
            history_responses=self.history_responses,
            temperature=temperature,
            top_p=top_p,
            max_new_tokens=max_new_tokens
        )
        
        # Update history
        self.history_images.append(screenshot)
        self.history_responses.append(action)
        
        # Keep only last max_history items
        if len(self.history_images) > self.max_history:
            self.history_images = self.history_images[-self.max_history:]
            self.history_responses = self.history_responses[-self.max_history:]
        
        return action
    
    def is_finished(self, action):
        """Check if task is finished."""
        return 'finished()' in action.lower()

# Create agent
agent = UITARSInferenceAgent(model, processor)

print("Multi-turn agent ready!")
print("\nExample usage:")
print("agent.reset()  # Start new task")
print("action = agent.predict(instruction, screenshot)")
print("if agent.is_finished(action): # Task complete")

## 7. Multi-Turn Interaction Example

In [None]:
# Multi-turn interaction example
class UITARSInferenceAgent:
    def __init__(self, model, processor, max_history=15):
        self.model = model
        self.processor = processor
        self.max_history = max_history
        self.history_images = []
        self.history_responses = []
    
    def reset(self):
        """Reset history for new task."""
        self.history_images = []
        self.history_responses = []
    
    def predict(self, instruction, screenshot, temperature=0.0, top_p=0.9, max_new_tokens=4096):
        """Predict next action."""
        # Generate action
        action = generate_action(
            instruction=instruction,
            screenshot=screenshot,
            history_images=self.history_images,
            history_responses=self.history_responses,
            temperature=temperature,
            top_p=top_p,
            max_new_tokens=max_new_tokens
        )
        
        # Update history
        self.history_images.append(screenshot)
        self.history_responses.append(action)
        
        # Keep only last max_history items
        if len(self.history_images) > self.max_history:
            self.history_images = self.history_images[-self.max_history:]
            self.history_responses = self.history_responses[-self.max_history:]
        
        return action
    
    def is_finished(self, action):
        """Check if task is finished."""
        return 'finished()' in action.lower()

# Create agent
agent = UITARSInferenceAgent(model, processor)

print("Multi-turn agent ready!")
print("\nExample usage:")
print("agent.reset()  # Start new task")
print("action = agent.predict(instruction, screenshot)")
print("if agent.is_finished(action): # Task complete")

## 8. Action Parsing Utilities

These utilities help parse model outputs into executable actions:

In [6]:
import re
import ast

def parse_action(action_str):
    """Parse action string into structured format."""
    try:
        node = ast.parse(action_str.strip(), mode='eval')
        if not isinstance(node, ast.Expression):
            return None
        
        call = node.body
        if not isinstance(call, ast.Call):
            return None
        
        # Get function name
        if isinstance(call.func, ast.Name):
            func_name = call.func.id
        elif isinstance(call.func, ast.Attribute):
            func_name = call.func.attr
        else:
            return None
        
        # Get keyword arguments
        kwargs = {}
        for kw in call.keywords:
            key = kw.arg
            if isinstance(kw.value, ast.Constant):
                value = kw.value.value
            elif isinstance(kw.value, ast.Str):
                value = kw.value.s
            else:
                value = None
            kwargs[key] = value
        
        return {
            'function': func_name,
            'args': kwargs
        }
    except Exception as e:
        print(f"Failed to parse action '{action_str}': {e}")
        return None

def extract_thought_and_action(response):
    """Extract thought and action from model response."""
    thought = None
    action = None
    
    # Extract thought
    thought_match = re.search(r"Thought:\s*(.+?)(?=\s*Action:|$)", response, re.DOTALL)
    if thought_match:
        thought = thought_match.group(1).strip()
    
    # Extract action
    if "Action:" in response:
        action = response.split("Action:")[-1].strip()
    
    return thought, action

# Test parsing
test_response = """Thought: I need to click on the Firefox icon to open the browser.
Action: click(start_box='(100, 200)')"""

thought, action = extract_thought_and_action(test_response)
print(f"Thought: {thought}")
print(f"Action: {action}")

if action:
    parsed = parse_action(action)
    print(f"Parsed: {parsed}")

Thought: I need to click on the Firefox icon to open the browser.
Action: click(start_box='(100, 200)')
Parsed: {'function': 'click', 'args': {'start_box': '(100, 200)'}}


## 9. Integration with OSWorld Environment (Local Only)

**Note**: This section only works on local machines with Docker/VM, not on Colab.

To use this model with OSWorld environment locally:

In [7]:
# Example integration with OSWorld (requires OSWorld setup)
# Uncomment and modify based on your OSWorld setup

"""
import sys
sys.path.append('OSWorld')

from desktop_env.desktop_env import DesktopEnv
import json

# Initialize environment
env = DesktopEnv(
    path_to_vm=None,  # Set your VM path
    action_space="pyautogui",
    screen_size=(1920, 1080),
    headless=False,
    os_type="Ubuntu",
    provider_name="docker",
    require_a11y_tree=False,
)

# Load a task
with open('OSWorld/evaluation_examples/test_subset32.json', 'r') as f:
    test_tasks = json.load(f)

# Reset agent
agent.reset()

# Get first task
domain = list(test_tasks.keys())[0]
example_id = test_tasks[domain][0]
config_file = f'OSWorld/evaluation_examples/examples/{domain}/{example_id}.json'

with open(config_file, 'r') as f:
    example = json.load(f)

instruction = example['instruction']
print(f"Task: {instruction}")

# Reset environment
obs = env.reset(example)

# Run for max 15 steps
max_steps = 15
for step in range(max_steps):
    print(f"\nStep {step + 1}/{max_steps}")
    
    # Get screenshot
    screenshot = obs['screenshot']
    
    # Predict action
    action_text = agent.predict(instruction, screenshot)
    print(f"Action: {action_text}")
    
    # Check if finished
    if agent.is_finished(action_text):
        print("Task completed!")
        break
    
    # Parse and execute action (requires action parsing from OSWorld/mm_agents/uitars_agent.py)
    # obs, reward, done, info = env.step(parsed_action)

env.close()
"""

print("\nOSWorld integration example (commented out)")
print("Uncomment and modify based on your OSWorld setup.")


OSWorld integration example (commented out)
Uncomment and modify based on your OSWorld setup.


## 10. Memory and Performance Tips

In [None]:
import torch

def print_memory_usage():
    """Print current GPU memory usage."""
    if torch.cuda.is_available():
        print(f"GPU Memory Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
        print(f"GPU Memory Reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
    else:
        print("CUDA not available")

print_memory_usage()

print("\n=== Memory Optimization Tips ===")
print("1. Using 4-bit quantization reduces memory by ~4x")
print("2. Reduce max_pixels to 8192*28*28 or 4096*28*28 for lower memory")
print("3. Reduce max_new_tokens to 2048 or 1024")
print("4. Use torch.cuda.empty_cache() between inferences if needed")
print("5. Limit history_n to 5-10 instead of 15 for less memory")

## 11. Save Configuration

In [None]:
# Configuration for reproducibility
config = {
    "model_name": "Fanbin/ARPO_UITARS1.5_7B",
    "quantization": "4-bit (nf4)",
    "compute_dtype": "float16",
    "max_pixels": MAX_PIXELS,
    "min_pixels": MIN_PIXELS,
    "max_history": 15,
    "max_new_tokens": 4096,
    "temperature": 0.0,
    "top_p": 0.9,
}

import json
print("Configuration:")
print(json.dumps(config, indent=2))

## Summary

This notebook demonstrates:
1. ‚úÖ Loading ARPO UITARS model with 4-bit quantization (works on Colab!)
2. ‚úÖ Preprocessing screenshots for optimal performance
3. ‚úÖ Single-turn and multi-turn inference
4. ‚úÖ Using real desktop screenshots (Ubuntu desktop sample)
5. ‚úÖ Multiple ways to load screenshots (upload, file, URL)
6. ‚úÖ Action parsing utilities
7. ‚úÖ OSWorld integration (for local machines)

**This notebook is ready to run on:**
- ‚úÖ Google Colab (Free T4 GPU)
- ‚úÖ Local Jupyter
- ‚úÖ Kaggle Notebooks
- ‚úÖ Any environment with 8GB+ GPU

**Next Steps:**
- Upload your own desktop screenshots to test
- Try different instructions
- Set up OSWorld environment for full evaluation
- Implement action execution pipeline
- Run evaluation on test tasks

**Citation:**
```bibtex
@article{lu2025arpo,
  title={ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay},
  author={Fanbin Lu and Zhisheng Zhong and Shu Liu and Chi-Wing Fu and Jiaya Jia},
  journal={arxiv},
  year={2025}
}
```

**Model Link:** [Fanbin/ARPO_UITARS1.5_7B](https://huggingface.co/Fanbin/ARPO_UITARS1.5_7B)