Skip to content

Add OpenResearcher dataset converter (#174)#196

Open
neubig wants to merge 6 commits into
mainfrom
openhands/issue-174-openresearcher-dataset
Open

Add OpenResearcher dataset converter (#174)#196
neubig wants to merge 6 commits into
mainfrom
openhands/issue-174-openresearcher-dataset

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented May 14, 2026

Closes #174

This PR was created by an AI agent (OpenHands) on behalf of the user.

Summary

Adds the openresearcher dataset converter for OpenResearcher/OpenResearcher-Dataset and generated sample files in ADP raw, standardized, and OpenHands SFT formats.

Dataset source

Files added / changed

  • Added datasets/openresearcher/README.md
  • Added datasets/openresearcher/extract_raw.py
  • Added datasets/openresearcher/raw_to_standardized.py
  • Added datasets/openresearcher/schema_raw.py
  • Added datasets/openresearcher/api.py
  • Added datasets/openresearcher/sample_raw.json
  • Added datasets/openresearcher/sample_std.json
  • Added datasets/openresearcher/sample_sft.json
  • Added datasets/openresearcher/sample_sft/sample_sft_openhands.json
  • Updated agents/openhands/std_to_sft.py so function-call messages remain from: function_call instead of being rewritten to gpt.

Schema mapping summary

  • Raw developer plus first user content becomes the initial TextObservation(source="user").
  • Assistant analysis text is carried as the description/thought for the next browser action where possible.
  • Browser tool calls to browser.search, browser.open, and browser.find become dataset-local ApiAction events (search, open, find).
  • Raw tool messages become TextObservation(source="environment").
  • Assistant final channel messages become finish MessageAction events.

Design decisions

  • Ambiguity: The raw system message primarily contains model/tool metadata rather than task content.

    • Chosen approach: Skip raw system messages and preserve actionable task instructions from the developer and user messages.
    • Example: The browser tool schema in the raw system message is represented by api.py, while the deep-research instruction and question are included in the initial user observation.
    • Alternatives rejected: Emitting system metadata as a user observation would duplicate tool documentation already represented by the SFT converter.
  • Ambiguity: Assistant reasoning and browser calls are separate raw messages.

    • Chosen approach: Attach assistant analysis text as the description on the following ApiAction.
    • Example: An assistant explanation followed by browser.search becomes one ApiAction(function="search", description=...).
    • Alternatives rejected: Keeping every reasoning paragraph as a standalone MessageAction would split thoughts from the tool calls they motivate.
  • Ambiguity: Browser functions are not part of the OpenHands default browser action set (search, open, find differ from click/fill/goto-style browser actions).

    • Chosen approach: Add dataset-local API signatures and generate SFT with --api_env=browser.
    • Example: browser.search with {"query": ..., "topn": 10} becomes search(query=..., topn=10) inside the browser tool environment.
    • Alternatives rejected: Mapping these calls to generic bash/python code would lose the original browser-tool semantics.
  • Ambiguity: Some rows are very long, making committed samples large.

    • Chosen approach: Generate samples from successful seed_42 rows with --max-messages 120, yielding three compact trajectories that still include search/open/find calls and final answers.
    • Example: The committed samples use source qids 1, 2, and 5.
    • Alternatives rejected: Including the first contiguous rows produced much larger sample artifacts without adding new action types.
  • Ambiguity: The OpenHands SFT converter returned function_call messages internally but rewrote them to gpt at the end.

    • Chosen approach: Preserve function_call roles in the final SFT output to match current ADP convention.
    • Example: Browser calls and finish actions in sample_sft.json now use "from": "function_call".
    • Alternatives rejected: Post-processing only this dataset's generated JSON would make the sample non-reproducible from the converter.

Tests run

python -m ruff check datasets/openresearcher agents/openhands/std_to_sft.py
PYTHONPATH=$(pwd):$PYTHONPATH python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q -k 'openresearcher'
PYTHONPATH=$(pwd):$PYTHONPATH python -m pytest tests/test_std_to_sft_from_parameter_simple.py tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py -q

Known limitations

  • Extraction uses the Hugging Face datasets-server rows API from the standard library instead of downloading the full parquet corpus, so full extraction requires network access to Hugging Face.
  • The committed sample is intentionally capped by message count for size; the default extractor can still iterate all successful rows across all seed configs when no sample-specific filters are provided.

Evidence

Latest CI / validation results

Validation passed on head SHA 91686b39eb40ab5f0649f59d4d325b176ef5938a:

The passing Python Tests workflow validates the new openresearcher samples and the shared OpenHands SFT converter behavior on this PR head.

Shared converter change

The agents/openhands/std_to_sft.py update intentionally preserves from: function_call instead of rewriting function calls to gpt. This is a schema-compliance fix that aligns generated OpenHands SFT samples with ADP's current requirements. The passing test (3.11) workflow above provides cross-dataset regression coverage for this shared converter path.

End-to-end status

The committed sample_raw.json, sample_std.json, sample_sft.json, and sample_sft/sample_sft_openhands.json are covered by the passing dataset validation tests in CI.

Evidence update added by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Core implementation is solid and follows ADP conventions, but missing required evidence of execution.

[CRITICAL ISSUES] (Must fix)

See inline comment on README.md for missing Evidence section requirement.

[POSITIVE FINDINGS]

  • ✓ All required files present (README, extract_raw, raw_to_standardized, schema_raw, api, samples)
  • ✓ Sample files are consistent: same 3 trajectories (qids 1, 2, 5) across raw/std/sft with matching IDs
  • ✓ All JSON files have trailing newlines
  • ✓ schema_raw.py validates all raw samples successfully
  • ✓ ApiAction functions (search, open, find) match api.py signatures correctly
  • ✓ SFT conversion correctly preserves "from": "function_call" for function-call messages
  • ✓ Excellent design-decision catalog with 5 well-documented decisions
  • ✓ No extraneous files (full_*.json, chunks, corpora) committed
  • ✓ Samples are appropriately sized (3 trajectories) and cover key action types

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This is a dataset addition with well-structured code following established ADP patterns. The shared SFT converter change is minimal and correctly aligns with the repository's function_call role convention. No breaking changes to existing functionality.

VERDICT:
Needs evidence: Add Evidence section to PR description showing actual test execution and pipeline run, then this is ready to merge.

KEY INSIGHT:
The modification to std_to_sft.py correctly removes the function_call→gpt rewrite, aligning all datasets with ADP's current convention that function-call messages should retain "from": "function_call" in SFT format.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.


This review was generated by an AI agent (OpenHands) on behalf of the reviewer.

cat datasets/openresearcher/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=no --api_env=browser | python scripts/jsonl_to_json.py > datasets/openresearcher/sample_sft.json
mkdir -p datasets/openresearcher/sample_sft
cp datasets/openresearcher/sample_sft.json datasets/openresearcher/sample_sft/sample_sft_openhands.json
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - Missing Evidence: The PR description must include an Evidence section showing that the code actually works.

The current "Tests run" section only lists commands without showing they were executed or their output. This violates the repository's evidence requirement for dataset PRs.

Required additions to PR description:

  1. Evidence section with actual command output:
## Evidence

### Validation tests
```bash
$ python -m pytest tests/test_dataset_structure.py -v -k openresearcher
<actual output showing PASSED>

End-to-end pipeline verification

$ python datasets/openresearcher/extract_raw.py --configs seed_42 --limit 1 | python scripts/jsonl_to_json.py
[{"qid": 1, "question": "Identify the method...", ...}]

$ cat datasets/openresearcher/sample_raw.json | python scripts/json_to_jsonl.py | python datasets/openresearcher/raw_to_standardized.py | python scripts/jsonl_to_json.py | head -50
[{"id": "seed_42_train_1", "content": [...], ...}]

2. **Conversation URL**: Since the PR description states this was created by an AI agent, include the originating conversation link: `https://app.all-hands.dev/conversations/{conversation_id}`

This ensures reviewers can verify the work and trace the implementation decisions back to the agent session.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Core implementation follows ADP conventions and file structure is correct, but critical evidence requirement remains unmet and shared converter change needs broader validation.

[CRITICAL ISSUES] (Must fix)

The unresolved thread about missing Evidence is still valid and blocking. See inline comment on README.md.

Additionally, the change to agents/openhands/std_to_sft.py affects ALL datasets using the OpenHands SFT converter, not just OpenResearcher. This needs validation across existing datasets.

[POSITIVE FINDINGS]

  • ✓ All required files present (README, extract_raw, raw_to_standardized, schema_raw, api, samples)
  • ✓ Sample files are consistent: same 3 trajectories (qids 1, 2, 5) across raw/std/sft with matching IDs
  • ✓ All JSON files have trailing newline
  • ✓ API signatures (search, open, find) match the ApiActions in sample_std.json
  • ✓ TextObservation sources use only valid values (user, environment)
  • ✓ SFT messages with function patterns correctly use "from": "function_call" (74 occurrences validated)
  • ✓ Sample size is appropriate (3 trajectories) with good coverage of browser tool interactions
  • ✓ PR description is comprehensive with dataset source, license, schema mapping, and design decisions
  • ✓ Optional sample_sft/ directory properly contains agent-specific copy
  • ✓ No extra JSON files at dataset root level

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

Dataset addition is low risk for existing functionality, but the shared converter modification (std_to_sft.py) has medium risk of breaking existing datasets. The change is correct per ADP guidelines but needs validation that all existing OpenHands SFT samples still generate correctly. Recommendation: Run parametrized SFT conversion tests across all datasets before merging.

VERDICT:
Needs rework: Address missing evidence requirement and validate shared converter change across existing datasets.

KEY INSIGHT:
The dataset implementation is solid, but changes to shared converters require evidence that existing datasets still work correctly, not just the new one.


This review was generated by an AI agent (OpenHands) on behalf of the repository maintainers.

cat datasets/openresearcher/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=no --api_env=browser | python scripts/jsonl_to_json.py > datasets/openresearcher/sample_sft.json
mkdir -p datasets/openresearcher/sample_sft
cp datasets/openresearcher/sample_sft.json datasets/openresearcher/sample_sft/sample_sft_openhands.json
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - Missing Evidence: The PR description must include an Evidence section showing that the code actually works.

The current "Tests run" section only lists commands without showing they were executed or their output. This violates the repository's evidence requirement for dataset PRs.

Required:

  • Run the listed commands and paste the actual output showing tests pass
  • Or preferably, include a conversation link (e.g., https://app.all-hands.dev/conversations/{conversation_id}) if this work came from an agent session

Unit tests alone don't satisfy the evidence requirement—we need to see end-to-end execution of the sample generation pipeline.

Comment thread agents/openhands/std_to_sft.py Outdated
Comment on lines 301 to 303
for m in conversations:
if m["from"] == "function_call":
m["from"] = "gpt"
if m["from"] == "observation":
m["from"] = "human"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important - Shared Converter Change Needs Validation: This change removes the conversion of "function_call" to "gpt", which is correct per ADP guidelines. However, this modification affects ALL datasets using the OpenHands SFT converter, not just OpenResearcher.

Required before merge:
Demonstrate that this change doesn't break existing datasets by showing:

  1. The parametrized SFT conversion tests still pass: python -m pytest tests/test_std_to_sft_from_parameter_simple.py tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py -v
  2. Or at minimum, regenerate and verify sample_sft.json for 2-3 existing datasets to confirm they still validate

If existing datasets were already incorrect (using "from": "gpt" for function calls), they should be fixed in this PR or a follow-up PR should be explicitly mentioned.

@neubig neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved - Previous blocking issues have been resolved.

The two unresolved threads from prior reviews have been addressed:

  1. Evidence section: Now present with CI validation links showing all checks passed on head SHA 91686b39eb40ab5f0649f59d4d325b176ef5938a
  2. Shared converter validation: The agents/openhands/std_to_sft.py change is a schema compliance fix (preserving from: function_call per ADP guidelines), and passing regression tests validate behavior across all datasets

Verification Summary:

  • ✅ All required files present (README, extract_raw, raw_to_standardized, schema_raw, api, samples)
  • ✅ Sample IDs consistent across raw/std/sft (qids 1,2,5 → seed_42_train_1,2,5)
  • ✅ All API functions (search, open, find) defined in api.py with valid signatures
  • ✅ Function call messages correctly use "from": "function_call"
  • ✅ TextObservation sources valid (user, environment only)
  • ✅ No forbidden files (no full_*.json, temp artifacts, etc.)
  • ✅ All JSON files have trailing newlines
  • ✅ Comprehensive design-decision catalog in PR description
  • ✅ CI passed: test suite validates both openresearcher dataset and shared converter changes

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

The shared converter change affects all datasets using the OpenHands SFT converter. However, this is a schema compliance fix (not an arbitrary change) to align with ADP's requirement that function-call messages use from: function_call. The risk is mitigated by:

  • Passing regression test suite covering all existing datasets
  • Explicit documentation of the change and its validation
  • Schema requirement alignment (fixes incorrect behavior)

VERDICT:
Ready to merge - Dataset implementation is solid, previous concerns resolved, shared converter fix validated


Note: The unresolved review threads from previous reviews can now be marked as resolved since the underlying issues (missing Evidence section and shared converter validation) have been addressed.

This review was generated by an AI agent (OpenHands) on behalf of the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this Trigger the OpenHands PR review workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dataset: OpenResearcher/OpenResearcher-Dataset

2 participants