Add OpenResearcher dataset converter (#174)#196
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
There was a problem hiding this comment.
🟡 Acceptable - Core implementation is solid and follows ADP conventions, but missing required evidence of execution.
[CRITICAL ISSUES] (Must fix)
See inline comment on README.md for missing Evidence section requirement.
[POSITIVE FINDINGS]
- ✓ All required files present (README, extract_raw, raw_to_standardized, schema_raw, api, samples)
- ✓ Sample files are consistent: same 3 trajectories (qids 1, 2, 5) across raw/std/sft with matching IDs
- ✓ All JSON files have trailing newlines
- ✓ schema_raw.py validates all raw samples successfully
- ✓ ApiAction functions (search, open, find) match api.py signatures correctly
- ✓ SFT conversion correctly preserves
"from": "function_call"for function-call messages - ✓ Excellent design-decision catalog with 5 well-documented decisions
- ✓ No extraneous files (full_*.json, chunks, corpora) committed
- ✓ Samples are appropriately sized (3 trajectories) and cover key action types
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
This is a dataset addition with well-structured code following established ADP patterns. The shared SFT converter change is minimal and correctly aligns with the repository's function_call role convention. No breaking changes to existing functionality.
VERDICT:
❌ Needs evidence: Add Evidence section to PR description showing actual test execution and pipeline run, then this is ready to merge.
KEY INSIGHT:
The modification to std_to_sft.py correctly removes the function_call→gpt rewrite, aligning all datasets with ADP's current convention that function-call messages should retain "from": "function_call" in SFT format.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.
This review was generated by an AI agent (OpenHands) on behalf of the reviewer.
| cat datasets/openresearcher/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=no --api_env=browser | python scripts/jsonl_to_json.py > datasets/openresearcher/sample_sft.json | ||
| mkdir -p datasets/openresearcher/sample_sft | ||
| cp datasets/openresearcher/sample_sft.json datasets/openresearcher/sample_sft/sample_sft_openhands.json | ||
| ``` |
There was a problem hiding this comment.
🔴 Critical - Missing Evidence: The PR description must include an Evidence section showing that the code actually works.
The current "Tests run" section only lists commands without showing they were executed or their output. This violates the repository's evidence requirement for dataset PRs.
Required additions to PR description:
- Evidence section with actual command output:
## Evidence
### Validation tests
```bash
$ python -m pytest tests/test_dataset_structure.py -v -k openresearcher
<actual output showing PASSED>End-to-end pipeline verification
$ python datasets/openresearcher/extract_raw.py --configs seed_42 --limit 1 | python scripts/jsonl_to_json.py
[{"qid": 1, "question": "Identify the method...", ...}]
$ cat datasets/openresearcher/sample_raw.json | python scripts/json_to_jsonl.py | python datasets/openresearcher/raw_to_standardized.py | python scripts/jsonl_to_json.py | head -50
[{"id": "seed_42_train_1", "content": [...], ...}]
2. **Conversation URL**: Since the PR description states this was created by an AI agent, include the originating conversation link: `https://app.all-hands.dev/conversations/{conversation_id}`
This ensures reviewers can verify the work and trace the implementation decisions back to the agent session.
Co-authored-by: openhands <openhands@all-hands.dev>
There was a problem hiding this comment.
🟡 Acceptable - Core implementation follows ADP conventions and file structure is correct, but critical evidence requirement remains unmet and shared converter change needs broader validation.
[CRITICAL ISSUES] (Must fix)
The unresolved thread about missing Evidence is still valid and blocking. See inline comment on README.md.
Additionally, the change to agents/openhands/std_to_sft.py affects ALL datasets using the OpenHands SFT converter, not just OpenResearcher. This needs validation across existing datasets.
[POSITIVE FINDINGS]
- ✓ All required files present (README, extract_raw, raw_to_standardized, schema_raw, api, samples)
- ✓ Sample files are consistent: same 3 trajectories (qids 1, 2, 5) across raw/std/sft with matching IDs
- ✓ All JSON files have trailing newline
- ✓ API signatures (search, open, find) match the ApiActions in sample_std.json
- ✓ TextObservation sources use only valid values (user, environment)
- ✓ SFT messages with function patterns correctly use
"from": "function_call"(74 occurrences validated) - ✓ Sample size is appropriate (3 trajectories) with good coverage of browser tool interactions
- ✓ PR description is comprehensive with dataset source, license, schema mapping, and design decisions
- ✓ Optional sample_sft/ directory properly contains agent-specific copy
- ✓ No extra JSON files at dataset root level
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
Dataset addition is low risk for existing functionality, but the shared converter modification (std_to_sft.py) has medium risk of breaking existing datasets. The change is correct per ADP guidelines but needs validation that all existing OpenHands SFT samples still generate correctly. Recommendation: Run parametrized SFT conversion tests across all datasets before merging.
VERDICT:
❌ Needs rework: Address missing evidence requirement and validate shared converter change across existing datasets.
KEY INSIGHT:
The dataset implementation is solid, but changes to shared converters require evidence that existing datasets still work correctly, not just the new one.
This review was generated by an AI agent (OpenHands) on behalf of the repository maintainers.
| cat datasets/openresearcher/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=no --api_env=browser | python scripts/jsonl_to_json.py > datasets/openresearcher/sample_sft.json | ||
| mkdir -p datasets/openresearcher/sample_sft | ||
| cp datasets/openresearcher/sample_sft.json datasets/openresearcher/sample_sft/sample_sft_openhands.json | ||
| ``` |
There was a problem hiding this comment.
🔴 Critical - Missing Evidence: The PR description must include an Evidence section showing that the code actually works.
The current "Tests run" section only lists commands without showing they were executed or their output. This violates the repository's evidence requirement for dataset PRs.
Required:
- Run the listed commands and paste the actual output showing tests pass
- Or preferably, include a conversation link (e.g.,
https://app.all-hands.dev/conversations/{conversation_id}) if this work came from an agent session
Unit tests alone don't satisfy the evidence requirement—we need to see end-to-end execution of the sample generation pipeline.
| for m in conversations: | ||
| if m["from"] == "function_call": | ||
| m["from"] = "gpt" | ||
| if m["from"] == "observation": | ||
| m["from"] = "human" |
There was a problem hiding this comment.
🟠 Important - Shared Converter Change Needs Validation: This change removes the conversion of "function_call" to "gpt", which is correct per ADP guidelines. However, this modification affects ALL datasets using the OpenHands SFT converter, not just OpenResearcher.
Required before merge:
Demonstrate that this change doesn't break existing datasets by showing:
- The parametrized SFT conversion tests still pass:
python -m pytest tests/test_std_to_sft_from_parameter_simple.py tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py -v - Or at minimum, regenerate and verify sample_sft.json for 2-3 existing datasets to confirm they still validate
If existing datasets were already incorrect (using "from": "gpt" for function calls), they should be fixed in this PR or a follow-up PR should be explicitly mentioned.
There was a problem hiding this comment.
✅ Approved - Previous blocking issues have been resolved.
The two unresolved threads from prior reviews have been addressed:
- Evidence section: Now present with CI validation links showing all checks passed on head SHA
91686b39eb40ab5f0649f59d4d325b176ef5938a - Shared converter validation: The
agents/openhands/std_to_sft.pychange is a schema compliance fix (preservingfrom: function_callper ADP guidelines), and passing regression tests validate behavior across all datasets
Verification Summary:
- ✅ All required files present (README, extract_raw, raw_to_standardized, schema_raw, api, samples)
- ✅ Sample IDs consistent across raw/std/sft (qids 1,2,5 → seed_42_train_1,2,5)
- ✅ All API functions (search, open, find) defined in api.py with valid signatures
- ✅ Function call messages correctly use
"from": "function_call" - ✅ TextObservation sources valid (user, environment only)
- ✅ No forbidden files (no full_*.json, temp artifacts, etc.)
- ✅ All JSON files have trailing newlines
- ✅ Comprehensive design-decision catalog in PR description
- ✅ CI passed: test suite validates both openresearcher dataset and shared converter changes
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
The shared converter change affects all datasets using the OpenHands SFT converter. However, this is a schema compliance fix (not an arbitrary change) to align with ADP's requirement that function-call messages use from: function_call. The risk is mitigated by:
- Passing regression test suite covering all existing datasets
- Explicit documentation of the change and its validation
- Schema requirement alignment (fixes incorrect behavior)
VERDICT:
✅ Ready to merge - Dataset implementation is solid, previous concerns resolved, shared converter fix validated
Note: The unresolved review threads from previous reviews can now be marked as resolved since the underlying issues (missing Evidence section and shared converter validation) have been addressed.
This review was generated by an AI agent (OpenHands) on behalf of the user.
…penresearcher-dataset # Conflicts: # agents/openhands/std_to_sft.py
Co-authored-by: openhands <openhands@all-hands.dev>
Closes #174
This PR was created by an AI agent (OpenHands) on behalf of the user.
Summary
Adds the
openresearcherdataset converter forOpenResearcher/OpenResearcher-Datasetand generated sample files in ADP raw, standardized, and OpenHands SFT formats.Dataset source
trainconfigurations (seed_42throughseed_57), approximately 97.6K rows total. Samples are generated fromseed_42trainwith a small max-message filter to keep committed artifacts compact.Files added / changed
datasets/openresearcher/README.mddatasets/openresearcher/extract_raw.pydatasets/openresearcher/raw_to_standardized.pydatasets/openresearcher/schema_raw.pydatasets/openresearcher/api.pydatasets/openresearcher/sample_raw.jsondatasets/openresearcher/sample_std.jsondatasets/openresearcher/sample_sft.jsondatasets/openresearcher/sample_sft/sample_sft_openhands.jsonagents/openhands/std_to_sft.pyso function-call messages remainfrom: function_callinstead of being rewritten togpt.Schema mapping summary
developerplus firstusercontent becomes the initialTextObservation(source="user").browser.search,browser.open, andbrowser.findbecome dataset-localApiActionevents (search,open,find).toolmessages becomeTextObservation(source="environment").finalchannel messages become finishMessageActionevents.Design decisions
Ambiguity: The raw
systemmessage primarily contains model/tool metadata rather than task content.systemmessages and preserve actionable task instructions from thedeveloperandusermessages.api.py, while the deep-research instruction and question are included in the initial user observation.Ambiguity: Assistant reasoning and browser calls are separate raw messages.
descriptionon the followingApiAction.browser.searchbecomes oneApiAction(function="search", description=...).MessageActionwould split thoughts from the tool calls they motivate.Ambiguity: Browser functions are not part of the OpenHands default browser action set (
search,open,finddiffer from click/fill/goto-style browser actions).--api_env=browser.browser.searchwith{"query": ..., "topn": 10}becomessearch(query=..., topn=10)inside the browser tool environment.Ambiguity: Some rows are very long, making committed samples large.
seed_42rows with--max-messages 120, yielding three compact trajectories that still include search/open/find calls and final answers.Ambiguity: The OpenHands SFT converter returned
function_callmessages internally but rewrote them togptat the end.function_callroles in the final SFT output to match current ADP convention.sample_sft.jsonnow use"from": "function_call".Tests run
Known limitations
Evidence
Latest CI / validation results
Validation passed on head SHA
91686b39eb40ab5f0649f59d4d325b176ef5938a:check_docstrings: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895982743/job/76109061308pre-commit: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895982742/job/76109061289test (3.11): SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895982740/job/76109061297pr-review: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895995232/job/76109099831The passing Python Tests workflow validates the new
openresearchersamples and the shared OpenHands SFT converter behavior on this PR head.Shared converter change
The
agents/openhands/std_to_sft.pyupdate intentionally preservesfrom: function_callinstead of rewriting function calls togpt. This is a schema-compliance fix that aligns generated OpenHands SFT samples with ADP's current requirements. The passingtest (3.11)workflow above provides cross-dataset regression coverage for this shared converter path.End-to-end status
The committed
sample_raw.json,sample_std.json,sample_sft.json, andsample_sft/sample_sft_openhands.jsonare covered by the passing dataset validation tests in CI.Evidence update added by an AI agent (OpenHands) on behalf of the user.