Fix Problems for Online-Mind2Web#207
Merged
Parth220 merged 6 commits intohud-evals:mainfrom Nov 24, 2025
Merged
Conversation
Parth220
approved these changes
Nov 23, 2025
Contributor
Parth220
left a comment
There was a problem hiding this comment.
Looks great!
The performance parity and mirroring of the academic/public versions is absolutely important to having consistent and reliable scores.
| HudComputerTool, | ||
| ) | ||
| from hud.tools import PlaywrightTool | ||
| from .tools import OlineMind2Web_PlaywrightTool as PlaywrightTool |
Contributor
There was a problem hiding this comment.
nit: OlineMind2Web_PlaywrightTool -> OnlineMind2Web_PlaywrightTool
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class OlineMind2Web_PlaywrightTool(PlaywrightTool): |
Contributor
There was a problem hiding this comment.
nit: OlineMind2Web_PlaywrightTool -> OnlineMind2Web_PlaywrightTool
environments/online_mind2web/src/hud_controller/tools/playwright.py
Outdated
Show resolved
Hide resolved
Contributor
Author
|
Fixed typos and other small issues. Ready to merge! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixed following problems:
wait_for_load_state="load"instead of"networkidle", settimeout=10000for click action, to reduce the waiting time for playwright. Added screenshot and action recording for playwright.self.playwright_tool.screenshot()instead ofself.page.screenshot()in screenshot function. To avoid timeout error.Revised environment achieved 58% success rate for claude agent in first 50 tasks in online-mind2web.
Note
Introduces a custom Playwright tool with screenshot/action history capture, configures AnchorBrowser viewport and longer session timeouts, updates defaults (load-state, click timeout), and refreshes deps/test task.
OnlineMind2Web_PlaywrightToolwith:/screenshot) and action recording to/action_history.navigate,click(timeout=10000),type,select_option,wait_for_element,get_elements,get_page_content.ContentResultand integrates auto-screenshots after key actions.playwright_tool.screenshot()for reliability.OnlineMind2Web_PlaywrightTooland registers named computer tools.max_duration=300,idle_timeout=120.viewportconfig (from kwargs orDISPLAY_WIDTH/DISPLAY_HEIGHT).navigate_to_urldefaultwait_for_load_state->"load".DISPLAY_WIDTH=1400,DISPLAY_HEIGHT=850.webjudge.hud-pythonto>=0.4.67; addanthropic>=0.74.0.playwrightsetup andwebjudgeevaluate; adds detailed system prompt.Written by Cursor Bugbot for commit cf11cbe. This will update automatically on new commits. Configure here.