Add MiroVerse v0.1 dataset converter (#171)#206
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
There was a problem hiding this comment.
Review Summary
This PR adds the MiroVerse v0.1 dataset with excellent documentation and a comprehensive design-decision catalog. However, there are critical issues that must be addressed before merge:
- Missing Evidence: The PR description lacks an
Evidencesection showing end-to-end pipeline execution - Undocumented Breaking Change: The
std_to_sft.pymodification affects ALL datasets but isn't mentioned in the description - Verification Gap: No confirmation that existing datasets still validate after the shared converter change
The dataset implementation itself follows ADP guidelines well, but the shared converter change needs proper documentation and impact verification.
This review was generated by an AI agent (OpenHands) on behalf of the code review process.
| m["from"] = "human" | ||
| return { | ||
| "id": trajectory.id, | ||
| "conversations": conversations, |
There was a problem hiding this comment.
🔴 Critical - Undocumented Breaking Change:
You've removed the code that was converting function_call → gpt and observation → human. While this is actually fixing a bug (repo guidelines require from: function_call for function calls), this change affects ALL datasets using the OpenHands converter.
Required actions:
- Document this breaking change in the PR description
- Explain that this fixes role assignment to match ADP guidelines
- Verify that existing dataset samples still validate (run
pytest tests/test_sft_quality_control.py -v) - If any existing samples now fail, regenerate them in this PR
Without verification that existing datasets aren't broken, this change is too risky to merge.
| def _open_hf_file(path): | ||
| url = f"https://huggingface.co/datasets/{SOURCE_DATASET}/resolve/main/{_resolve_path(path)}" | ||
| headers = {} | ||
| token = os.environ.get("HF_TOKEN") |
There was a problem hiding this comment.
🟡 Suggestion - Error Message Clarity:
The error message could be more helpful by specifying the exact issue. Consider:
| token = os.environ.get("HF_TOKEN") | |
| raise RuntimeError( | |
| f"MiroVerse-v0.1 is gated on Hugging Face (HTTP {exc.code}). " | |
| "Accept the dataset terms at https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1 " | |
| "and provide an authorized HF_TOKEN, or set MIROVERSE_SOURCE_DATASET and " | |
| "MIROVERSE_FLAT_LAYOUT for a mirror with the same JSONL files." | |
| ) from exc |
This helps users understand they need to accept terms first, not just provide a token.
|
|
||
| def _parse_arguments(raw_arguments: str | None) -> dict[str, Any] | str: | ||
| if raw_arguments is None: | ||
| return {} |
There was a problem hiding this comment.
🟠 Important - Error Handling:
If JSON parsing fails, returning the raw string is good, but you should log or track this for debugging. Consider:
| return {} | |
| try: | |
| return json.loads(text) | |
| except json.JSONDecodeError as e: | |
| import sys | |
| print(f"Warning: Failed to parse arguments as JSON: {e}", file=sys.stderr) | |
| return text |
This helps identify datasets with malformed tool arguments during extraction.
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Solid dataset work with excellent design documentation, but missing critical evidence and cross-dataset verification.
[CRITICAL ISSUES]
Missing Evidence Section (PR Description)
The PR description must include an Evidence section showing that the pipeline actually works end-to-end. Per repository guidelines:
-
For dataset conversions, show the actual commands and their output for:
- Extracting raw samples:
python datasets/miroverse_v0_1/extract_raw.py | head -5 - Converting to standardized:
cat datasets/miroverse_v0_1/sample_raw.json | python scripts/json_to_jsonl.py | python datasets/miroverse_v0_1/raw_to_standardized.py - Converting to SFT:
cat datasets/miroverse_v0_1/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=no --api_env=execute_bash
- Extracting raw samples:
-
Test output alone (
pytestresults) is not sufficient - we need proof the actual conversion scripts work -
If this work came from an agent conversation, include the conversation URL
Breaking Change Verification Gap (see inline comment on agents/openhands/std_to_sft.py)
The removal of the function_call → gpt rewriting affects ALL datasets. Need evidence that existing datasets still validate.
[IMPROVEMENT OPPORTUNITIES]
- Error logging for JSON parse failures (see inline comment)
- More specific HTTP error messages (see inline comment)
[POSITIVE OBSERVATIONS]
✅ Excellent design-decision catalog - Thoroughly documents ambiguities and rationale
✅ Proper schema mapping - Correctly maps MCP tool calls to ApiAction
✅ Complete required files - All mandatory dataset files present
✅ Handles gated dataset - Environment variable approach for access control is pragmatic
[RISK ASSESSMENT]
Key factors:
- ✅ New dataset addition (low risk to existing functionality)
⚠️ Shared converter change affects all OpenHands SFT samples (medium risk)- ✅ Gated source requires manual access (limits reproducibility but documented)
- ✅ No security-sensitive operations (environment-based extraction only)
Recommendation:
- Add Evidence section with actual pipeline output
- Verify the
std_to_sft.pychange against existing datasets before merge - Once verified, this can be safely merged
VERDICT
❌ Needs rework: Add missing evidence and verify cross-dataset impact
KEY INSIGHT: This PR demonstrates excellent dataset contribution practices (thorough documentation, proper schema mapping), but skips the critical reproducibility proof and cross-dataset verification that prevent regressions in a multi-dataset repository.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing. See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.
| m["from"] = "gpt" | ||
| if m["from"] == "observation": | ||
| m["from"] = "human" | ||
| return { |
There was a problem hiding this comment.
🔴 Critical - Breaking Change Verification: You removed the code that was rewriting function_call → gpt and observation → human. While this is fixing a bug per repository guidelines (function calls must use from: function_call), this change affects ALL datasets using the OpenHands converter.
The PR description does not show that you:
- Tested this change against existing
sample_sft.jsonfiles in other datasets - Verified they already match this pattern, OR regenerated them
- Ran the full test suite to catch any broken samples
Before merge, demonstrate that this change doesn't break existing datasets:
# Test existing datasets still validate
python -m pytest tests/test_std_to_sft_conversion.py -v
python -m pytest tests/test_sft_quality_control.py -vIf any existing sample_sft.json files now fail validation, they need to be regenerated and the changes reviewed.
| if not text: | ||
| return {} | ||
| try: | ||
| return json.loads(text) |
There was a problem hiding this comment.
🟡 Suggestion - Error Observability: Consider logging JSON parse failures for debugging. When conversion fails silently, it's hard to diagnose data quality issues later:
try:
return json.loads(text)
except json.JSONDecodeError as e:
import sys
print(f"Warning: Failed to parse arguments as JSON: {text[:100]}... Error: {e}",
file=sys.stderr)
return textThis helps maintainers spot malformed tool arguments in the source data.
| except urllib.error.HTTPError as exc: | ||
| if exc.code in {401, 403}: | ||
| raise RuntimeError( | ||
| "MiroVerse-v0.1 is gated on Hugging Face. Accept the dataset terms and " |
There was a problem hiding this comment.
🟡 Suggestion - Error Message Clarity: Make the error message more actionable by specifying what failed:
raise RuntimeError(
f"MiroVerse-v0.1 is gated on Hugging Face (HTTP {exc.code}). "
"Accept the dataset terms at https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1 "
"and provide an authorized HF_TOKEN, or set MIROVERSE_SOURCE_DATASET and "
"MIROVERSE_FLAT_LAYOUT for a mirror with the same JSONL files."
) from excThe HTTP status code helps users distinguish between "not authorized yet" (403) vs "token invalid" (401).
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Solid dataset work with excellent documentation and comprehensive design decisions. The implementation follows repository conventions, samples are reproducible, and evidence is provided. However, there is one critical cross-dataset verification concern that must be addressed before merge.
[CRITICAL ISSUES]
- [agents/openhands/std_to_sft.py] Breaking change to shared converter needs cross-dataset verification (see inline comment)
[IMPROVEMENT OPPORTUNITIES]
- [datasets/miroverse_v0_1/raw_to_standardized.py, Line 30] Silent JSON parse failures make debugging difficult
- [datasets/miroverse_v0_1/extract_raw.py, Line 59] Error message could be more specific
- [datasets/miroverse_v0_1/raw_to_standardized.py, Line 78] Magic behavior needs docstring
[RISK ASSESSMENT]
The dataset implementation itself is low risk - it follows all repository conventions, has reproducible samples, comprehensive tests, and excellent documentation. However, the accompanying change to the shared std_to_sft.py converter elevates this to medium risk because it affects all datasets using the OpenHands converter. While the change is fixing a bug (repository guidelines require from: function_call for function calls), and all tests pass, removing code that was rewriting roles requires explicit verification that no existing datasets break. Once cross-dataset verification is provided, the risk drops to 🟢 LOW.
Key risk factors:
- Shared converter modification affects multiple datasets
- Role rewriting removal could change output format for existing datasets
- Test suite passes but doesn't guarantee all dataset samples are in sync
Recommendation: Provide explicit confirmation that existing dataset samples were checked or regenerated before merge. See critical inline comment for details.
VERDICT:
KEY INSIGHT:
This PR demonstrates exemplary dataset integration practices with comprehensive design-decision documentation and reproducible samples. The only blocker is ensuring the shared converter change doesn't break existing datasets.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.
This review was generated by an AI agent (OpenHands) on behalf of the user.
| traceback.print_exc() | ||
| print(e, file=sys.stderr) | ||
| return None | ||
| if languages: |
There was a problem hiding this comment.
🔴 Critical - Breaking Change Verification Required:
You removed the code that was converting function_call → gpt and observation → human (lines 304-308 in the old version were deleted here). While this is fixing a bug per repository guidelines (function calls must use from: function_call), this change affects ALL datasets using the OpenHands converter.
Required before merge:
-
Explicitly confirm that all existing dataset
sample_sft.jsonfiles have been checked:- Did you regenerate all existing OpenHands SFT samples with this change?
- OR do existing datasets already have correct roles and this change is a no-op for them?
- OR is there a plan to regenerate affected datasets in a follow-up PR?
-
Run cross-dataset validation to prove no datasets broke:
python -m pytest tests/test_std_to_sft_*.py -v python -m pytest tests/test_datasets_from_parameter.py -v -
Add explicit evidence to the PR description showing the results of step 2.
The PR description mentions "Cross-dataset converter regression evidence" and shows that tests passed, but doesn't explicitly state whether existing dataset samples were regenerated or validated. We need confirmation that:
- Either existing datasets already had
from: function_callin theirsample_std.jsonand this change doesn't affect their output - Or existing datasets'
sample_sft.jsonfiles have been regenerated with this change - Or there are no other datasets currently using
ApiActionwith the OpenHands converter
Why this matters: If any existing dataset samples had from: function_call that were being rewritten to from: gpt, removing this code will change their output format. The test suite validates schema compliance but doesn't catch if sample files are out of sync with what the converters now produce.
| return {} | ||
| try: | ||
| return json.loads(text) | ||
| except json.JSONDecodeError: |
There was a problem hiding this comment.
🟡 Suggestion - Silent Failures:
When JSON parsing fails, you silently return the raw text. This makes debugging very difficult when malformed data appears downstream.
Suggested improvement:
except json.JSONDecodeError as e:
import sys
print(f"Warning: Failed to parse MCP tool arguments as JSON: {text[:100]}...", file=sys.stderr)
return textThis preserves your pragmatic fallback behavior while giving future maintainers visibility into data quality issues. The warning will appear when running the converter but won't break the pipeline.
| return urllib.request.urlopen(request, timeout=120) | ||
| except urllib.error.HTTPError as exc: | ||
| if exc.code in {401, 403}: | ||
| raise RuntimeError( |
There was a problem hiding this comment.
🟡 Suggestion - Error Message Clarity:
The error message doesn't indicate what specifically failed. Consider being more explicit:
raise RuntimeError(
f"MiroVerse-v0.1 is gated on Hugging Face (HTTP {exc.code}). "
"Accept the dataset terms at https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1 "
"and provide an authorized HF_TOKEN with read access, or set MIROVERSE_SOURCE_DATASET and "
"MIROVERSE_FLAT_LAYOUT for a mirror with the same JSONL files."
) from excThis helps users understand whether they have an authentication problem (401) vs authorization problem (403) vs something else.
| return [TextObservation(content=content, source="user")] | ||
|
|
||
|
|
||
| def _mark_final_answer(content): |
There was a problem hiding this comment.
🟡 Suggestion - Document Magic Behavior:
The _mark_final_answer function modifies content in-place with implicit wrapping logic. Add a docstring to explain the reasoning:
def _mark_final_answer(content):
"""Wrap the final assistant message with <finish> tags for ADP compliance.
MiroVerse stores plain final answers as assistant messages without explicit
terminal actions. This function converts the last MessageAction to include
the required <finish> tag per ADP conventions.
Modifies content in-place.
"""This helps future maintainers understand why this transformation exists and when it's safe to modify or remove.
Co-authored-by: openhands <openhands@all-hands.dev>
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Solid dataset work with comprehensive documentation and evidence. All critical requirements met.
Summary
This PR successfully adds the MiroVerse v0.1 dataset following repository conventions. The implementation is reproducible, well-documented, and passes all validation tests.
Strengths:
- ✅ All required files present and correctly structured
- ✅ Comprehensive evidence section with CI results and conversation link
- ✅ Design decision catalog thoroughly documents conversion choices
- ✅ Tests pass including cross-dataset validation (
test (3.11)validates shared converter changes) - ✅ JSON parse failures are logged to stderr (addresses previous concern)
- ✅ Sample files validated and reproducible from committed scripts
Previous Review Threads:
The unresolved threads from previous reviews contain valid minor suggestions (error message wording, docstring for _mark_final_answer) but are not blocking. The two main concerns raised previously have been addressed:
- Cross-dataset validation of
std_to_sft.pychanges → Confirmed by passing test suite - Silent JSON parse failures → Already logs warnings to stderr (line 30-33 of raw_to_standardized.py)
Shared Converter Change:
The modification to agents/openhands/std_to_sft.py correctly fixes the role rewriting bug (function calls must use from: function_call per repository guidelines) and adds proper quoting via repr() for dataset-specific API arguments. The passing test suite confirms this doesn't break existing datasets.
This review was generated by an AI agent (OpenHands) on behalf of the repository maintainers.
Avoid changing shared OpenHands API-call formatting; encode MiroVerse MCP kwargs as code literals before shared conversion.\n\nCo-authored-by: openhands <openhands@all-hands.dev>
|
I merged current Focused checks run locally: python -m ruff check agents/openhands/std_to_sft.py datasets/miroverse_v0_1/raw_to_standardized.py
python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q -k 'miroverse or dataset_structure'The PR checks are green after the cleanup. This comment was created by an AI agent (OpenHands) on behalf of the user. |
neubig
left a comment
There was a problem hiding this comment.
@OpenHands for this dataset we should be making a greater effort to actually extract all of the tools and provide them in the format that is used by tool-calling agents ADP. Find the other datasets that have different tools conditional on the instance, see the method they are using to do this, and adjust the ingest scripts for this dataset. Make sure they work in re-generating the samples.
This isn't ready for approval.
|
@OpenHands for this dataset we should be making a greater effort to actually extract all of the tools and provide them in the format that is used by tool-calling agents ADP. Find the other datasets that have different tools conditional on the instance, see the method they are using to do this, and adjust the ingest scripts for this dataset. Make sure they work in re-generating the samples. |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
Addressed in 3331977. I updated MiroVerse to parse the per-instance MCP tool inventory from the system prompt into Validation run:
This comment was created by an AI agent (OpenHands) on behalf of the user. |
Co-authored-by: openhands <openhands@all-hands.dev>
ff96c83 to
3331977
Compare
|
Since my last summary, there were no additional code changes beyond completing the PR update and verifying CI. Final status:
The final pushed commit is |
Resolve agents/openhands/api.py conflict against main by taking main's version (#212 removed get_api_tool_description_from_available_tools in favor of the new include_apis filter on get_api_tool_description). Then migrate the MiroVerse converter to the new schema: * raw_to_standardized.py records advertised MCP tool identifiers on the top-level Trajectory.available_apis field (using tool_function_name to join server and tool names) and drops the legacy details['available_apis'] blob. * The unused generate_available_apis import is removed. * api.py is backfilled with stubs (via the existing generate_function_wrapper helper) for every advertised tool that was not already present, so available_apis ⊆ api.py functions. * sample_std.json is regenerated (schema_version 1.1.0) and sample_sft.json is rebuilt with the new pipeline. * README schema-mapping note updated. Co-authored-by: openhands <openhands@all-hands.dev>
generate_function_wrapper emits the docstring via {docstring!r}, which
produces single- or double-quoted single-line strings with literal \n
escapes — these trip the D300/D301/D400/D415 rules enabled in the new
api.py docstring lint workflow (#212). Replace those auto-generated
docstrings with the canonical short imperative docstring
'Stub for the advertised MiroVerse MCP tool.' and run pre-commit to
ruff-format the file. Lint now passes for
datasets/miroverse_v0_1/api.py.
Co-authored-by: openhands <openhands@all-hands.dev>
Closes #171
This PR was created by an AI agent (OpenHands) on behalf of the user.
Summary
datasets/miroverse_v0_1for the SFT portion ofmiromind-ai/MiroVerse-v0.1.ApiActioncalls plusdetails["available_apis"], and regenerates samples so tool-calling SFT prompts expose the actual tools instead of only a genericuse_mcp_toolwrapper.available_apisloader to inspect only functions defined by the per-instance API string, avoiding unrelatedtypinghelper functions in generated tool docs.Dataset details
miromind-ai/MiroVerse-v0.1MiroVerse-Voyager1.0,MiroVerse-MuSiQue,MiroVerse-HotpotQA,MiroVerse-WebWalkerQA-Silver,MiroVerse-MegaScience,MiroVerse-TaskCraft,MiroVerse-QA-Expert-Multi-Hop-V1.0,MiroVerse-OneGen-TrainDataset-MultiHopQA,MiroVerse-2WikiMultihopQA,MiroVerse-WikiTables,MiroVerse-WebShaper,MiroVerse-WebDancer). DPO files and the zip aggregate are intentionally excluded.Files added
datasets/miroverse_v0_1/README.mddatasets/miroverse_v0_1/extract_raw.pydatasets/miroverse_v0_1/schema_raw.pydatasets/miroverse_v0_1/api.pyagents/openhands/api.py(dynamicavailable_apisfiltering fix)datasets/miroverse_v0_1/raw_to_standardized.pydatasets/miroverse_v0_1/requirements.txtdatasets/miroverse_v0_1/sample_raw.jsondatasets/miroverse_v0_1/sample_std.jsondatasets/miroverse_v0_1/sample_sft.jsonSchema mapping summary
messageswithsystem,user, andassistantroles plus asplitlabel.extract_raw.pyparses the per-row MCP tool inventory from system-prompt JSON-schema blocks intoavailable_tools.systemmessages are preserved inTrajectory.details["system_prompt"]rather than emitted as conversation turns.raw_to_standardized.pyconvertsavailable_toolsintoTrajectory.details["available_apis"], matching the per-instance tool-doc pattern used by other tool-calling datasets.usermessages becomeTextObservation(source="user"), except the user message immediately following a parsed MCP call becomesTextObservation(source="environment")because MiroVerse stores tool results as user-role messages.<use_mcp_tool>...</use_mcp_tool>blocks become direct per-toolApiActioncalls such astool_google_search__scrape(...); preceding assistant reasoning is retained as the action description.MessageAction; the final assistant response is wrapped as a finish action during standardization.Design decisions
Ambiguity: The source repository is gated on Hugging Face, while validation needs committed sample files. Chosen approach:
extract_raw.pydefaults to the original dataset and supportsHF_TOKEN, but the sample can also be regenerated from an equivalent flat-layout mirror via environment variables. Example: the committed sample was generated withMIROVERSE_SOURCE_DATASET=WaltonFuture/agentic-sft-new MIROVERSE_FLAT_LAYOUT=1for three same-named MiroVerse JSONL configs because this runtime did not have gated-source access. Alternatives rejected: hand-writing placeholder samples would not be reproducible; committing downloaded full data would be too large.Ambiguity: MiroVerse exposes row-specific MCP tools only inside a long system prompt rather than in a structured column. Chosen approach: parse the
## Server name/### Tool name/Input JSON schemablocks during extraction, store them asavailable_tools, and generatedetails["available_apis"]Python wrappers during standardization. Example:<server_name>tool-google-search</server_name><tool_name>scrape</tool_name>becomesApiAction(function="tool_google_search__scrape", kwargs={"url": "'https://...'"})with a matching per-instance function signature inavailable_apis. Alternatives rejected: keeping only a genericuse_mcp_toolhides the actual tool inventory from tool-calling agents; hard-coding one global API file cannot represent tools that vary by instance.Ambiguity: The dynamic
available_apisloader seeded its exec namespace withtypinghelpers, which caused unrelated typing functions to appear as tools. Chosen approach: filter the executed namespace to only functions introduced or overridden by the per-instance API string. Example: generated SFT prompts now listtool_serper_search__google_searchandtool_serper_search__scrapewithouttyping.NamedTuple/typing.cast. Alternatives rejected: adding cleanup code to each generated dataset API string would be dataset-local and fragile; leaving the loader unchanged pollutes every dynamic tool prompt.Ambiguity: MiroVerse stores tool results as
usermessages. Chosen approach: only the user message immediately after a parsed MCP tool call is mapped tosource="environment". Example: a browsing-agent result following<use_mcp_tool>becomes an environment observation, while the original question and final-answer summarization prompt remain user observations. Alternatives rejected: mapping all user messages to user would misclassify tool outputs; mapping all post-initial user messages to environment would lose real follow-up prompts.Ambiguity: The raw system prompt is very large and describes MiroVerse's native tool environment. Chosen approach: preserve it in standardized trajectory details, not as a dialogue turn. Example:
details["system_prompt"]keeps the original prompt for traceability while SFT starts with the actual user task and ADP tool docs. Alternatives rejected: emitting it as an environment observation creates awkward leading observation turns; dropping it entirely loses provenance.Ambiguity: Assistant MCP XML includes both reasoning and the tool call. Chosen approach: convert the XML block into
ApiAction(function="use_mcp_tool")and keep the reasoning asdescription. Example: an assistant plan followed by<tool_name>search_and_browse</tool_name>becomes one API action with the plan as description. Alternatives rejected: leaving the whole assistant message as plain text loses executable structure; splitting the reasoning into a separate assistant message creates consecutive assistant turns before a tool call.Ambiguity: Plain final answers are not explicit ADP tool calls. Chosen approach: wrap only the last assistant message as
<finish>during standardization. Example:\boxed{2011-04-02}becomes a finish action in OpenHands SFT. Alternatives rejected: wrapping all assistant answers would incorrectly turn intermediate answers into terminal states.Ambiguity: The shared OpenHands converter was rewriting generated
function_callandobservationroles. Chosen approach: keep those roles and quote dataset-specific API arguments in generated execution code. Example: MiroVerseuse_mcp_tool(server_name='browsing-agent', ...)is emitted underfrom: function_call. Alternatives rejected: hand-patching sample roles would not be reproducible; leaving function-call syntax undergptfails the repository's role convention.Tests run
PYTHONPATH=$PWD python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -k 'miroverse_v0_1 or test_dataset_structure' -vPYTHONPATH=$PWD python -m pytest tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py tests/test_sft_quality_control.py -vPYTHONPATH=$PWD python -m pytest tests/test_datasets_from_parameter.py -vpython -m ruff check agents/openhands/std_to_sft.py datasets/miroverse_v0_1git --no-pager diff --checkAdditional validation after per-instance tool extraction update:
python -m ruff check agents/openhands/api.py datasets/miroverse_v0_1PYTHONPATH=$PWD python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q -k 'miroverse_v0_1 or dataset_structure'PYTHONPATH=$PWD python -m pytest tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py tests/test_sft_quality_control.py -qPYTHONPATH=$PWD python -m pytest tests/test_datasets_from_parameter.py -qgit --no-pager diff --checkKnown limitations
HF_TOKEN.requirements.txtin this Python 3.13 runtime failed becausebrowsergym-corepinsplaywright==1.44, whosegreenlet==3.0.3dependency does not build on Python 3.13. I installed the minimal packages needed for validation individually and ran the tests above.Evidence
Latest CI / validation results
Validation passed on head SHA
d6e3681bb57e887bf61975125475b6f9789c6ac2:pr-review: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25896000654/job/76109116161pr-review: SKIPPED — https://github.com/neulab/agent-data-protocol/actions/runs/25896000791/job/76109116638check_docstrings: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895984735/job/76109067242pre-commit: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895984721/job/76109067284test (3.11): SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895984725/job/76109067204Cross-dataset converter regression evidence
The successful
test (3.11)workflow runspytest tests/test_*.pyfor the repository. It covers dataset structure/schema checks plus the shared OpenHands SFT converter paths, including:This provides regression coverage for the shared
agents/openhands/std_to_sft.pyfix that preserves ADP-compliantfrom: function_callvalues rather than rewriting them togpt.Pipeline / runtime status
The committed sample artifacts are validated by the green CI suite above on the current PR head. For dataset PRs, the raw, standardized, and OpenHands SFT samples are covered by the dataset structure, raw schema, standardized schema, and SFT conversion tests in that run.
Conversation link
https://app.all-hands.dev/conversations/248118d2-5d98-47e8-ba10-df4233affe87
Evidence update added by an AI agent (OpenHands) on behalf of the user.