Make OpenHands browser tools optional for non-web datasets by neubig · Pull Request #213 · neulab/agent-data-protocol

neubig · 2026-05-18T01:02:05Z

Summary

Extract the lazy-import refactor that was previously duplicated inside PRs #193 (CodeScout) and #197 (jupyter-agent) into its own change so those PRs can revert to dataset-only diffs.

Motivation

Non-web datasets (CodeScout, jupyter-agent, etc.) currently cannot run agents/openhands/std_to_sft.py on environments that do not have browsergym installed, because:

agents/openhands/system_prompt/tools/__init__.py does an unconditional from .browser import BrowserTool and browser.py imports browsergym at module load.
agents/openhands/std_to_sft.py constructs HTMLToAXTree(dataset) at module load even when the dataset has no WebObservation events.

The fix is to defer browser-related imports until they are actually needed.

Changes

agents/openhands/system_prompt/tools/__init__.py — Wrap the BrowserTool re-export in try/except ModuleNotFoundError. The handler only swallows the error when the missing module is browsergym (or a submodule); any other ImportError still propagates. BrowserTool is bound to None when browsergym is unavailable.
agents/openhands/system_prompt/system.py — Switch the top-level tool imports from the package __init__ to their direct submodules so module load no longer touches browser.py. Defer from agents.openhands.system_prompt.tools.browser import BrowserTool to inside the if codeact_enable_browsing: branch of get_tools.
agents/openhands/std_to_sft.py — Lazy-load scripts.html_to_axtree.HTMLToAXTree behind a get_generate_axtree() helper; it is only constructed when a WebObservation event is actually encountered. Also thread the existing --is_web CLI flag into get_system_message(codeact_enable_browsing=is_web) so non-web datasets actually get a non-web system prompt (today the default True is always used).
tests/test_openhands_sft_role_preservation.py — Loosen the fake get_system_message to *args, **kwargs to accept the new keyword argument.
tests/test_optional_browser.py (new) — Regression test (skipped when litellm is absent) that installs a sys.meta_path finder which raises ModuleNotFoundError for any browsergym* import, then asserts (a) agents.openhands.system_prompt.tools imports cleanly with BrowserTool is None and (b) get_system_message(codeact_enable_browsing=False) returns a prompt that does not advertise BrowserTool.

Validation

python -m pytest tests/ → 183 passed, 12 skipped, 4 warnings.

Evidence — end-to-end conversion of a non-web dataset without browsergym

Driver script (full source below): installs a sys.meta_path finder that raises ModuleNotFoundError for any browsergym* import, sets MY_DATASET=codeactinstruct (a non-web dataset already in the repo), imports the production agents.openhands.std_to_sft module, and calls main_with_args(line, is_web=False, api_env=None) on one record from datasets/codeactinstruct/sample_std.json.

Control run — on `main` (without this PR)

[sanity] browsergym blocked as expected: No module named 'browsergym'
Traceback (most recent call last):
  ...
  File ".../agents/openhands/std_to_sft.py", line 14, in <module>
    from agents.openhands.system_prompt.system import get_system_message
  File ".../agents/openhands/system_prompt/system.py", line 3, in <module>
    from agents.openhands.system_prompt.tools import (
  File ".../agents/openhands/system_prompt/tools/__init__.py", line 2, in <module>
    from .browser import BrowserTool
  File ".../agents/openhands/system_prompt/tools/browser.py", line 1, in <module>
    from browsergym.core.action.highlevel import HighLevelActionSet
ModuleNotFoundError: No module named 'browsergym'

→ The pipeline fails to import; converter is unusable.

Treatment run — on this branch (`refactor-optional-browser`)

[sanity] browsergym blocked as expected: No module named 'browsergym'
OK: produced SFT record with 8 conversation turns
system prompt length: 13399 chars (browser tools omitted)

→ The pipeline runs to completion, returns a well-formed SFT record (verified via json.loads + structural assertions on conversations/system), and the system prompt does not advertise BrowserTool (asserted in the driver).

Driver script

"""Driver: block any browsergym import (PEP 451 finder), then run std_to_sft on one record."""
import importlib.machinery
import sys


class _BlockBrowserGym:
    def find_spec(self, fullname, path=None, target=None):
        if fullname.startswith("browsergym"):
            return importlib.machinery.ModuleSpec(fullname, self)
        return None

    def create_module(self, spec):
        return None

    def exec_module(self, module):
        raise ModuleNotFoundError(
            f"No module named {module.__name__!r}", name=module.__name__
        )


sys.meta_path.insert(0, _BlockBrowserGym())

try:
    import browsergym  # noqa: F401
    print("UNEXPECTED: browsergym imported successfully", file=sys.stderr)
    sys.exit(99)
except ModuleNotFoundError as e:
    print(f"[sanity] browsergym blocked as expected: {e}", file=sys.stderr)

import os
os.environ["MY_DATASET"] = "codeactinstruct"

import importlib
std_to_sft = importlib.import_module("agents.openhands.std_to_sft")

import json
with open("datasets/codeactinstruct/sample_std.json") as f:
    sample = json.load(f)
record_line = json.dumps(sample[0])
out = std_to_sft.main_with_args(record_line, is_web=False, api_env=None)
if not out:
    print("FAIL: std_to_sft.main_with_args returned no output", file=sys.stderr)
    sys.exit(1)
parsed = json.loads(out)
assert "conversations" in parsed and isinstance(parsed["conversations"], list)
assert "system" in parsed
assert "BrowserTool" not in parsed["system"], "system prompt unexpectedly mentions BrowserTool"
print(f"OK: produced SFT record with {len(parsed['conversations'])} conversation turns")
print(f"system prompt length: {len(parsed['system'])} chars (browser tools omitted)")

Follow-up

Once this is merged, PRs #193 and #197 will be rebased onto main to drop their copies of these four files; their diffs should then contain only their respective datasets/ directory plus the README.md/agents/openhands/DATASETS.md catalog entries (already done — see #197 and #193).

This PR was prepared by an AI agent (OpenHands) on behalf of the user. Originating conversation context is available to the requester.

Two changes to the OpenHands agent pipeline let non-web dataset converters run on machines that do not have browsergym installed: 1. agents/openhands/system_prompt/tools/__init__.py wraps the 'from .browser import BrowserTool' import in try/except ModuleNotFoundError. The except branch only swallows the error when the missing module is browsergym (or a submodule); any unrelated ImportError still propagates. The BrowserTool name is bound to None when browsergym is unavailable. 2. agents/openhands/system_prompt/system.py defers the BrowserTool import to inside the 'if codeact_enable_browsing:' branch of get_tools and switches the remaining tool imports to their direct submodules so the module-level import no longer touches browser.py. 3. agents/openhands/std_to_sft.py lazy-loads scripts.html_to_axtree.HTMLToAXTree behind get_generate_axtree(); it is only constructed when a WebObservation event is actually seen. process_row also threads the existing --is_web CLI flag through to get_system_message(codeact_enable_browsing=is_web) so non-web datasets actually get a non-web system prompt. 4. tests/test_openhands_sft_role_preservation.py loosens its fake get_system_message to '*args, **kwargs' so the new keyword argument used by std_to_sft.py does not break the fake. 5. A new regression test tests/test_optional_browser.py installs a meta_path finder that raises ModuleNotFoundError for any browsergym* import, then asserts that agents.openhands.system_prompt.tools imports cleanly (with BrowserTool is None) and that get_system_message(codeact_enable_browsing=False) returns a prompt that does not advertise BrowserTool. This change was previously duplicated inside two unrelated dataset PRs (#193 CodeScout and #197 jupyter-agent). Lifting it into its own PR removes the duplication and lets those PRs revert to dataset-only diffs. This pull request was prepared by an AI agent (OpenHands) on behalf of the user. Co-authored-by: openhands <openhands@all-hands.dev>

The previous tip of this PR carried a copy of the 'make OpenHands browser tools optional' refactor in agents/openhands/system_prompt/tools/__init__.py, agents/openhands/system_prompt/system.py, agents/openhands/std_to_sft.py, and the tests/test_openhands_sft_role_preservation.py fake. The same diff was duplicated on #193 (CodeScout). That refactor has been extracted to PR #213 ('Make OpenHands browser tools optional for non-web datasets'). Reset those four files to their main-branch state so this PR contains only jupyter-agent dataset changes (datasets/jupyter-agent-dataset/* + README.md + agents/openhands/DATASETS.md catalog row). Once #213 lands and this branch is rebased onto main, the lazy-import semantics will reappear via that PR. Co-authored-by: openhands <openhands@all-hands.dev>

The previous tip of this PR carried a copy of the 'make OpenHands browser tools optional' refactor in agents/openhands/system_prompt/tools/__init__.py, agents/openhands/system_prompt/system.py, agents/openhands/std_to_sft.py, and the tests/test_openhands_sft_role_preservation.py fake. The same diff was duplicated on #197 (jupyter-agent). That refactor has been extracted to PR #213 ('Make OpenHands browser tools optional for non-web datasets'). Reset those four files to their main-branch state so this PR contains only CodeScout dataset changes (datasets/codescout/* + agents/openhands/DATASETS.md catalog row). Once #213 lands and this branch is rebased onto main, the lazy-import semantics will reappear via that PR. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

🟡 Acceptable — The lazy-import refactor is clean and correct. One must-fix per the PR evidence policy, plus a minor test annotation issue.

This review was generated by an AI agent (OpenHands) on behalf of the user.

github-actions · 2026-05-18T01:04:52Z

+            return self
+        return None
+
+    def load_module(self, name):  # pragma: no cover - exercised via import machinery


🟡 Suggestion: load_module IS exercised by the import machinery when find_module returns self — that's exactly the path that raises ModuleNotFoundError and exercises the try/except in __init__.py. The # pragma: no cover annotation incorrectly excludes a covered (and critical) line from coverage. Remove it.

Suggested change

def load_module(self, name): # pragma: no cover - exercised via import machinery

def load_module(self, name):

github-actions · 2026-05-18T01:04:52Z

+generate_axtree = None
+
+
+def get_generate_axtree():


🟠 Important (PR description): The PR description's Validation section only shows pytest output. Per the project's evidence policy, unit tests alone do not count as proof that the change works. Please add an Evidence section showing an actual end-to-end invocation — e.g. running std_to_sft.py on a non-web dataset in an environment without browsergym installed, with the resulting output pasted. A link to the originating OpenHands conversation (https://app.all-hands.dev/conversations/{id}) would also satisfy this requirement.

The previous version of tests/test_optional_browser.py reloaded agents.openhands.system_prompt.tools by monkeypatching sys.meta_path in-process. That fails in CI because the workflow's requirements.txt does not install litellm, and reloading the tools package triggers litellm imports from each tool module's top level (bash.py, finish.py, etc.). Two changes: 1. Run the import under a subprocess so the meta_path finder is the only entry on the fresh interpreter's import path. This avoids cross-test contamination with any tools modules that may already be cached in the parent's sys.modules. 2. Add a pytest.importorskip('litellm') guard. The optional-browser path is only reachable when litellm is installed (the tool modules import it unconditionally); in environments without litellm the import chain is broken before the BrowserTool try/except is even reached, so a regression test there would always fail for an unrelated reason. Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-18T02:06:46Z

Addressed both review comments:

1. # pragma: no cover annotation on _BlockBrowserGym.load_module — the suggestion was based on an intermediate state; the pushed version of tests/test_optional_browser.py does not contain that annotation. The current outer code path actually doesn't have a class-level load_module at all (the meta-path finder lives inside the subprocess preamble string), so coverage is not a concern. Verified with grep -n pragma tests/test_optional_browser.py → no matches.

2. Evidence section per project policy — added a full ## Evidence block to the PR description with both a control run (on main, fails with ModuleNotFoundError: No module named 'browsergym' at the import chain) and a treatment run (on this branch, produces a valid 8-turn SFT record with system prompt length 13,399 chars and no BrowserTool mention). The driver script is reproducible and pasted in full so a reviewer can re-run it.

Please re-review.

This comment was posted by an AI agent (OpenHands) on behalf of the user.

github-actions

🟡 Acceptable — The lazy-import refactor is clean and correct. The unresolved evidence thread on std_to_sft.py:28 still needs to be addressed before merge: the Validation section only shows pytest output, which does not satisfy the project's evidence policy. Please add an Evidence section with an actual end-to-end invocation (e.g. running std_to_sft.py --is_web=no through the pipeline on a non-web dataset) and its output.

One minor new note on the test blocker below.

This review was generated by an AI agent (OpenHands) on behalf of the user.

github-actions · 2026-05-18T02:14:31Z

+        def find_module(self, name, path=None):
+            if name.startswith("browsergym"):
+                return self
+            return None
+
+        def load_module(self, name):
+            raise ModuleNotFoundError(f"No module named {name!r}", name=name)


🟡 Suggestion: _BlockBrowserGym uses the deprecated PEP 302 find_module/load_module interface (deprecated since Python 3.4). It still works in Python 3.12, but the modern find_spec/exec_module protocol is forward-compatible with future Python versions:

def find_spec(self, fullname, path, target=None): if fullname.startswith("browsergym"): import importlib.machinery return importlib.machinery.ModuleSpec(fullname, self) return None def create_module(self, spec): return None def exec_module(self, module): raise ModuleNotFoundError(f"No module named {module.__name__!r}", name=module.__name__)

Address inline review on #213: replace the legacy PEP 302 find_module/load_module pair with the modern PEP 451 find_spec/create_module/exec_module trio. The legacy interface is deprecated since Python 3.4 and may be removed in a future release; the new interface is what the import machinery has used internally since 3.4 and is forward-compatible. Also moves the sanity check that the finder fires into the test body and updates the module docstring to reference the new protocol. The test still passes locally with the same exit codes and assertion output; behavior is unchanged. Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-18T02:17:40Z

Addressed the new review:

PEP 451 modernization (tests/test_optional_browser.py L53) — replaced the legacy PEP 302 find_module/load_module pair with the modern PEP 451 find_spec/create_module/exec_module trio in both the test file and the driver-script snippet in the PR description. Updated docstring to reference the new protocol. Test still passes locally with 2/2 green.

Evidence (PR description) — note that the ## Evidence section was added in the previous round; the prior re-review bot may have been looking at a cached version. The current PR body now has ## Summary / ## Motivation / ## Changes / ## Validation / ## Evidence — end-to-end conversion of a non-web dataset without browsergym (### Control / ### Treatment / ### Driver script) / ## Follow-up. The control run shows a real ModuleNotFoundError against main; the treatment run shows a real 8-turn SFT record produced from datasets/codeactinstruct/sample_std.json.

Re-running the driver locally with the updated PEP 451 finder produces the same output:

[sanity] browsergym blocked as expected: No module named 'browsergym'
OK: produced SFT record with 8 conversation turns
system prompt length: 13399 chars (browser tools omitted)

Re-applying review-this label to trigger another review.

This comment was posted by an AI agent (OpenHands) on behalf of the user.

neubig · 2026-05-18T02:17:57Z

Addressed both points from the latest review:

PEP 451 modernization (tests/test_optional_browser.py line 53) — replaced legacy find_module/load_module with find_spec/create_module/exec_module in both the test file (commit 5bf4e33) and the driver snippet in the PR description. Test still green locally (2/2).
Evidence section — confirmed it's now in the PR description (see ## Evidence — end-to-end conversion of a non-web dataset without browsergym with Control/Treatment/Driver subsections). Earlier re-review may have hit a cached body.

Re-running the modernized driver locally:

[sanity] browsergym blocked as expected: No module named 'browsergym'
OK: produced SFT record with 8 conversation turns
system prompt length: 13399 chars (browser tools omitted)

This comment was posted by an AI agent (OpenHands) on behalf of the user.

github-actions

🟢 Good taste — All three previously unresolved threads are resolved in the current code:

PEP 451 protocol: test_optional_browser.py now uses find_spec/create_module/exec_module throughout; the deprecated find_module/load_module interface is gone.
# pragma: no cover: No such annotation exists in the current file.
Evidence section: The PR description now includes a full end-to-end Evidence section with a control run (showing the pre-fix import failure) and a treatment run (showing the converter producing a valid SFT record), satisfying the project's evidence policy.

The lazy-import refactor is clean and correct. No new issues found.

This review was generated by an AI agent (OpenHands) on behalf of the user.

This was referenced May 18, 2026

Add jupyter-agent dataset converter (#175) #197

Open

Integrate CodeScout dataset for issue #184 #193

Open

github-actions Bot requested changes May 18, 2026

View reviewed changes

neubig added the review-this Trigger the OpenHands PR review workflow label May 18, 2026

github-actions Bot reviewed May 18, 2026

View reviewed changes

neubig added review-this Trigger the OpenHands PR review workflow and removed review-this Trigger the OpenHands PR review workflow labels May 18, 2026

github-actions Bot approved these changes May 18, 2026

View reviewed changes

neubig merged commit 40b0489 into main May 18, 2026
6 of 7 checks passed

neubig deleted the refactor-optional-browser branch May 18, 2026 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make OpenHands browser tools optional for non-web datasets#213

Make OpenHands browser tools optional for non-web datasets#213
neubig merged 3 commits into
mainfrom
refactor-optional-browser

neubig commented May 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot May 18, 2026

Uh oh!

github-actions Bot May 18, 2026

Uh oh!

neubig commented May 18, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot May 18, 2026

Uh oh!

neubig commented May 18, 2026

Uh oh!

neubig commented May 18, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def load_module(self, name): # pragma: no cover - exercised via import machinery
	def load_module(self, name):

Conversation

neubig commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Validation

Evidence — end-to-end conversion of a non-web dataset without browsergym

Control run — on main (without this PR)

Treatment run — on this branch (refactor-optional-browser)

Driver script

Follow-up

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

neubig commented May 18, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

neubig commented May 18, 2026

Uh oh!

neubig commented May 18, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

neubig commented May 18, 2026 •

edited

Loading

Control run — on `main` (without this PR)

Treatment run — on this branch (`refactor-optional-browser`)