Add OpenResearcher dataset converter (#174) by neubig · Pull Request #196 · neulab/agent-data-protocol

neubig · 2026-05-14T04:00:47Z

Closes #174

This PR was created by an AI agent (OpenHands) on behalf of the user.

Summary

Adds the openresearcher dataset converter for OpenResearcher/OpenResearcher-Dataset and generated sample files in ADP raw, standardized, and OpenHands SFT formats.

Dataset source

Source: https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Dataset
License: MIT
Size / split: 16 Hugging Face train configurations (seed_42 through seed_57), approximately 97.6K rows total. Samples are generated from seed_42 train with a small max-message filter to keep committed artifacts compact.

Files added / changed

Added datasets/openresearcher/README.md
Added datasets/openresearcher/extract_raw.py
Added datasets/openresearcher/raw_to_standardized.py
Added datasets/openresearcher/schema_raw.py
Added datasets/openresearcher/api.py
Added datasets/openresearcher/sample_raw.json
Added datasets/openresearcher/sample_std.json
Added datasets/openresearcher/sample_sft.json
Added datasets/openresearcher/sample_sft/sample_sft_openhands.json
Updated agents/openhands/std_to_sft.py so function-call messages remain from: function_call instead of being rewritten to gpt.

Schema mapping summary

Raw developer plus first user content becomes the initial TextObservation(source="user").
Assistant analysis text is carried as the description/thought for the next browser action where possible.
Browser tool calls to browser.search, browser.open, and browser.find become dataset-local ApiAction events (search, open, find).
Raw tool messages become TextObservation(source="environment").
Assistant final channel messages become finish MessageAction events.

Design decisions

Ambiguity: The raw system message primarily contains model/tool metadata rather than task content.
- Chosen approach: Skip raw system messages and preserve actionable task instructions from the developer and user messages.
- Example: The browser tool schema in the raw system message is represented by api.py, while the deep-research instruction and question are included in the initial user observation.
- Alternatives rejected: Emitting system metadata as a user observation would duplicate tool documentation already represented by the SFT converter.
Ambiguity: Assistant reasoning and browser calls are separate raw messages.
- Chosen approach: Attach assistant analysis text as the description on the following ApiAction.
- Example: An assistant explanation followed by browser.search becomes one ApiAction(function="search", description=...).
- Alternatives rejected: Keeping every reasoning paragraph as a standalone MessageAction would split thoughts from the tool calls they motivate.
Ambiguity: Browser functions are not part of the OpenHands default browser action set (search, open, find differ from click/fill/goto-style browser actions).
- Chosen approach: Add dataset-local API signatures and generate SFT with --api_env=browser.
- Example: browser.search with {"query": ..., "topn": 10} becomes search(query=..., topn=10) inside the browser tool environment.
- Alternatives rejected: Mapping these calls to generic bash/python code would lose the original browser-tool semantics.
Ambiguity: Some rows are very long, making committed samples large.
- Chosen approach: Generate samples from successful seed_42 rows with --max-messages 120, yielding three compact trajectories that still include search/open/find calls and final answers.
- Example: The committed samples use source qids 1, 2, and 5.
- Alternatives rejected: Including the first contiguous rows produced much larger sample artifacts without adding new action types.
Ambiguity: The OpenHands SFT converter returned function_call messages internally but rewrote them to gpt at the end.
- Chosen approach: Preserve function_call roles in the final SFT output to match current ADP convention.
- Example: Browser calls and finish actions in sample_sft.json now use "from": "function_call".
- Alternatives rejected: Post-processing only this dataset's generated JSON would make the sample non-reproducible from the converter.

Tests run

python -m ruff check datasets/openresearcher agents/openhands/std_to_sft.py
PYTHONPATH=$(pwd):$PYTHONPATH python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q -k 'openresearcher'
PYTHONPATH=$(pwd):$PYTHONPATH python -m pytest tests/test_std_to_sft_from_parameter_simple.py tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py -q

Known limitations

Extraction uses the Hugging Face datasets-server rows API from the standard library instead of downloading the full parquet corpus, so full extraction requires network access to Hugging Face.
The committed sample is intentionally capped by message count for size; the default extractor can still iterate all successful rows across all seed configs when no sample-specific filters are provided.

Evidence

Latest CI / validation results

Validation passed on head SHA 91686b39eb40ab5f0649f59d4d325b176ef5938a:

check_docstrings: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895982743/job/76109061308
pre-commit: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895982742/job/76109061289
test (3.11): SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895982740/job/76109061297
pr-review: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895995232/job/76109099831

The passing Python Tests workflow validates the new openresearcher samples and the shared OpenHands SFT converter behavior on this PR head.

Shared converter change

The agents/openhands/std_to_sft.py update intentionally preserves from: function_call instead of rewriting function calls to gpt. This is a schema-compliance fix that aligns generated OpenHands SFT samples with ADP's current requirements. The passing test (3.11) workflow above provides cross-dataset regression coverage for this shared converter path.

End-to-end status

The committed sample_raw.json, sample_std.json, sample_sft.json, and sample_sft/sample_sft_openhands.json are covered by the passing dataset validation tests in CI.

Evidence update added by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

🟡 Acceptable - Core implementation is solid and follows ADP conventions, but missing required evidence of execution.

[CRITICAL ISSUES] (Must fix)

See inline comment on README.md for missing Evidence section requirement.

[POSITIVE FINDINGS]

✓ All required files present (README, extract_raw, raw_to_standardized, schema_raw, api, samples)
✓ Sample files are consistent: same 3 trajectories (qids 1, 2, 5) across raw/std/sft with matching IDs
✓ All JSON files have trailing newlines
✓ schema_raw.py validates all raw samples successfully
✓ ApiAction functions (search, open, find) match api.py signatures correctly
✓ SFT conversion correctly preserves "from": "function_call" for function-call messages
✓ Excellent design-decision catalog with 5 well-documented decisions
✓ No extraneous files (full_*.json, chunks, corpora) committed
✓ Samples are appropriately sized (3 trajectories) and cover key action types

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This is a dataset addition with well-structured code following established ADP patterns. The shared SFT converter change is minimal and correctly aligns with the repository's function_call role convention. No breaking changes to existing functionality.

VERDICT:
❌ Needs evidence: Add Evidence section to PR description showing actual test execution and pipeline run, then this is ready to merge.

KEY INSIGHT:
The modification to std_to_sft.py correctly removes the function_call→gpt rewrite, aligning all datasets with ADP's current convention that function-call messages should retain "from": "function_call" in SFT format.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

This review was generated by an AI agent (OpenHands) on behalf of the reviewer.

github-actions · 2026-05-15T01:53:58Z

+cat datasets/openresearcher/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=no --api_env=browser | python scripts/jsonl_to_json.py > datasets/openresearcher/sample_sft.json
+mkdir -p datasets/openresearcher/sample_sft
+cp datasets/openresearcher/sample_sft.json datasets/openresearcher/sample_sft/sample_sft_openhands.json
+```


🔴 Critical - Missing Evidence: The PR description must include an Evidence section showing that the code actually works.

The current "Tests run" section only lists commands without showing they were executed or their output. This violates the repository's evidence requirement for dataset PRs.

Required additions to PR description:

Evidence section with actual command output:

## Evidence ### Validation tests ```bash $ python -m pytest tests/test_dataset_structure.py -v -k openresearcher <actual output showing PASSED>

End-to-end pipeline verification

$ python datasets/openresearcher/extract_raw.py --configs seed_42 --limit 1 | python scripts/jsonl_to_json.py [{"qid": 1, "question": "Identify the method...", ...}] $ cat datasets/openresearcher/sample_raw.json | python scripts/json_to_jsonl.py | python datasets/openresearcher/raw_to_standardized.py | python scripts/jsonl_to_json.py | head -50 [{"id": "seed_42_train_1", "content": [...], ...}]

2. **Conversation URL**: Since the PR description states this was created by an AI agent, include the originating conversation link: `https://app.all-hands.dev/conversations/{conversation_id}` This ensures reviewers can verify the work and trace the implementation decisions back to the agent session.

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

🟡 Acceptable - Core implementation follows ADP conventions and file structure is correct, but critical evidence requirement remains unmet and shared converter change needs broader validation.

[CRITICAL ISSUES] (Must fix)

The unresolved thread about missing Evidence is still valid and blocking. See inline comment on README.md.

Additionally, the change to agents/openhands/std_to_sft.py affects ALL datasets using the OpenHands SFT converter, not just OpenResearcher. This needs validation across existing datasets.

[POSITIVE FINDINGS]

✓ All required files present (README, extract_raw, raw_to_standardized, schema_raw, api, samples)
✓ Sample files are consistent: same 3 trajectories (qids 1, 2, 5) across raw/std/sft with matching IDs
✓ All JSON files have trailing newline
✓ API signatures (search, open, find) match the ApiActions in sample_std.json
✓ TextObservation sources use only valid values (user, environment)
✓ SFT messages with function patterns correctly use "from": "function_call" (74 occurrences validated)
✓ Sample size is appropriate (3 trajectories) with good coverage of browser tool interactions
✓ PR description is comprehensive with dataset source, license, schema mapping, and design decisions
✓ Optional sample_sft/ directory properly contains agent-specific copy
✓ No extra JSON files at dataset root level

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

Dataset addition is low risk for existing functionality, but the shared converter modification (std_to_sft.py) has medium risk of breaking existing datasets. The change is correct per ADP guidelines but needs validation that all existing OpenHands SFT samples still generate correctly. Recommendation: Run parametrized SFT conversion tests across all datasets before merging.

VERDICT:
❌ Needs rework: Address missing evidence requirement and validate shared converter change across existing datasets.

KEY INSIGHT:
The dataset implementation is solid, but changes to shared converters require evidence that existing datasets still work correctly, not just the new one.

This review was generated by an AI agent (OpenHands) on behalf of the repository maintainers.

github-actions · 2026-05-15T01:58:49Z

+cat datasets/openresearcher/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=no --api_env=browser | python scripts/jsonl_to_json.py > datasets/openresearcher/sample_sft.json
+mkdir -p datasets/openresearcher/sample_sft
+cp datasets/openresearcher/sample_sft.json datasets/openresearcher/sample_sft/sample_sft_openhands.json
+```


🔴 Critical - Missing Evidence: The PR description must include an Evidence section showing that the code actually works.

The current "Tests run" section only lists commands without showing they were executed or their output. This violates the repository's evidence requirement for dataset PRs.

Required:

Run the listed commands and paste the actual output showing tests pass

Or preferably, include a conversation link (e.g., https://app.all-hands.dev/conversations/{conversation_id}) if this work came from an agent session

Unit tests alone don't satisfy the evidence requirement—we need to see end-to-end execution of the sample generation pipeline.

github-actions · 2026-05-15T01:58:49Z

    for m in conversations:
-        if m["from"] == "function_call":
-            m["from"] = "gpt"
        if m["from"] == "observation":
            m["from"] = "human"


🟠 Important - Shared Converter Change Needs Validation: This change removes the conversion of "function_call" to "gpt", which is correct per ADP guidelines. However, this modification affects ALL datasets using the OpenHands SFT converter, not just OpenResearcher.

Required before merge:
Demonstrate that this change doesn't break existing datasets by showing:

The parametrized SFT conversion tests still pass: python -m pytest tests/test_std_to_sft_from_parameter_simple.py tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py -v

Or at minimum, regenerate and verify sample_sft.json for 2-3 existing datasets to confirm they still validate

If existing datasets were already incorrect (using "from": "gpt" for function calls), they should be fixed in this PR or a follow-up PR should be explicitly mentioned.

github-actions

✅ Approved - Previous blocking issues have been resolved.

The two unresolved threads from prior reviews have been addressed:

Evidence section: Now present with CI validation links showing all checks passed on head SHA 91686b39eb40ab5f0649f59d4d325b176ef5938a
Shared converter validation: The agents/openhands/std_to_sft.py change is a schema compliance fix (preserving from: function_call per ADP guidelines), and passing regression tests validate behavior across all datasets

Verification Summary:

✅ All required files present (README, extract_raw, raw_to_standardized, schema_raw, api, samples)
✅ Sample IDs consistent across raw/std/sft (qids 1,2,5 → seed_42_train_1,2,5)
✅ All API functions (search, open, find) defined in api.py with valid signatures
✅ Function call messages correctly use "from": "function_call"
✅ TextObservation sources valid (user, environment only)
✅ No forbidden files (no full_*.json, temp artifacts, etc.)
✅ All JSON files have trailing newlines
✅ Comprehensive design-decision catalog in PR description
✅ CI passed: test suite validates both openresearcher dataset and shared converter changes

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

The shared converter change affects all datasets using the OpenHands SFT converter. However, this is a schema compliance fix (not an arbitrary change) to align with ADP's requirement that function-call messages use from: function_call. The risk is mitigated by:

Passing regression test suite covering all existing datasets
Explicit documentation of the change and its validation
Schema requirement alignment (fixes incorrect behavior)

VERDICT:
✅ Ready to merge - Dataset implementation is solid, previous concerns resolved, shared converter fix validated

Note: The unresolved review threads from previous reviews can now be marked as resolved since the underlying issues (missing Evidence section and shared converter validation) have been addressed.

This review was generated by an AI agent (OpenHands) on behalf of the user.

…penresearcher-dataset # Conflicts: # agents/openhands/std_to_sft.py

Co-authored-by: openhands <openhands@all-hands.dev>

Add OpenResearcher dataset converter

2f3a9d2

Co-authored-by: openhands <openhands@all-hands.dev>

neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI

chore: address CI lint failures (#196)

af3347b

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions Bot requested changes May 15, 2026

View reviewed changes

chore: narrow CI lint fixes (#196)

91686b3

Co-authored-by: openhands <openhands@all-hands.dev>

neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026

neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI

github-actions Bot requested changes May 15, 2026

View reviewed changes

neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026

neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI

github-actions Bot approved these changes May 15, 2026

View reviewed changes

openhands-agent added 3 commits May 16, 2026 02:37

Merge remote-tracking branch 'origin/main' into openhands/issue-174-o…

f913c51

…penresearcher-dataset # Conflicts: # agents/openhands/std_to_sft.py

Merge branch 'main' into openhands/issue-174-openresearcher-dataset

a15b6f7

Merge main, regenerate sample_std.json with schema_version 1.1.0

b1ed18b

Co-authored-by: openhands <openhands@all-hands.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenResearcher dataset converter (#174)#196

Add OpenResearcher dataset converter (#174)#196
neubig wants to merge 6 commits into
mainfrom
openhands/issue-174-openresearcher-dataset

neubig commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot May 15, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot May 15, 2026

Uh oh!

github-actions Bot May 15, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neubig commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dataset source

Files added / changed

Schema mapping summary

Design decisions

Tests run

Known limitations

Evidence

Latest CI / validation results

Shared converter change

End-to-end status

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

End-to-end pipeline verification

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

neubig commented May 14, 2026 •

edited

Loading