Skip to content

Fix Problems for Online-Mind2Web#207

Merged
Parth220 merged 6 commits intohud-evals:mainfrom
Genteki:Update--Online-Mind2Web
Nov 24, 2025
Merged

Fix Problems for Online-Mind2Web#207
Parth220 merged 6 commits intohud-evals:mainfrom
Genteki:Update--Online-Mind2Web

Conversation

@Genteki
Copy link
Copy Markdown
Contributor

@Genteki Genteki commented Nov 22, 2025

Fixed following problems:

  1. Fixed problem that env display size doesn't fit remote browser display size, by adding request in anchorbrowser.py
  2. Override playwright tools in src/hud_controller/tools/, set default wait_for_load_state="load" instead of "networkidle", set timeout=10000 for click action, to reduce the waiting time for playwright. Added screenshot and action recording for playwright.
  3. In executor.py, use self.playwright_tool.screenshot() instead of self.page.screenshot() in screenshot function. To avoid timeout error.
  4. Completed a much more detailed system prompt in test_task.json

Revised environment achieved 58% success rate for claude agent in first 50 tasks in online-mind2web.


Note

Introduces a custom Playwright tool with screenshot/action history capture, configures AnchorBrowser viewport and longer session timeouts, updates defaults (load-state, click timeout), and refreshes deps/test task.

  • Playwright tooling:
    • New OnlineMind2Web_PlaywrightTool with:
      • Screenshot capture (saved to /screenshot) and action recording to /action_history.
      • Actions: navigate, click (timeout=10000), type, select_option, wait_for_element, get_elements, get_page_content.
      • Returns ContentResult and integrates auto-screenshots after key actions.
    • Executor now uses playwright_tool.screenshot() for reliability.
    • Server switches to local OnlineMind2Web_PlaywrightTool and registers named computer tools.
  • Provider (AnchorBrowser):
    • Defaults: max_duration=300, idle_timeout=120.
    • Adds viewport config (from kwargs or DISPLAY_WIDTH/DISPLAY_HEIGHT).
  • Defaults/behavior:
    • Setup navigate_to_url default wait_for_load_state -> "load".
    • Docker image env: DISPLAY_WIDTH=1400, DISPLAY_HEIGHT=850.
    • Minor logging fix in webjudge.
  • Dependencies:
    • Bump hud-python to >=0.4.67; add anthropic>=0.74.0.
  • Test task:
    • Replaces sample task with Trader Joe’s flow; uses playwright setup and webjudge evaluate; adds detailed system prompt.

Written by Cursor Bugbot for commit cf11cbe. This will update automatically on new commits. Configure here.

@Genteki Genteki changed the title Fix Problem for Online-Mind2Web Fix Problems for Online-Mind2Web Nov 22, 2025
Copy link
Copy Markdown
Contributor

@Parth220 Parth220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

The performance parity and mirroring of the academic/public versions is absolutely important to having consistent and reliable scores.

HudComputerTool,
)
from hud.tools import PlaywrightTool
from .tools import OlineMind2Web_PlaywrightTool as PlaywrightTool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: OlineMind2Web_PlaywrightTool -> OnlineMind2Web_PlaywrightTool

logger = logging.getLogger(__name__)


class OlineMind2Web_PlaywrightTool(PlaywrightTool):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: OlineMind2Web_PlaywrightTool -> OnlineMind2Web_PlaywrightTool

@Genteki
Copy link
Copy Markdown
Contributor Author

Genteki commented Nov 24, 2025

Fixed typos and other small issues. Ready to merge!

@Parth220 Parth220 merged commit 7a43031 into hud-evals:main Nov 24, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants