Skip to content

Add Kwai-Klear SWE-smith mini_swe_agent_plus dataset (#178)#192

Merged
neubig merged 4 commits into
mainfrom
openhands/issue-178-kwai-klear-swe-smith-mini-swe-agent
May 23, 2026
Merged

Add Kwai-Klear SWE-smith mini_swe_agent_plus dataset (#178)#192
neubig merged 4 commits into
mainfrom
openhands/issue-178-kwai-klear-swe-smith-mini-swe-agent

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented May 14, 2026

Closes #178

This PR was created by an AI agent (OpenHands) on behalf of the user.

Summary

  • Adds the kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k dataset converter for Kwai-Klear/SWE-smith-mini_swe_agent_plus-trajectories-66k.
  • Adds raw schema, streaming extractor, raw-to-standardized converter, dataset README, requirements file, and generated 5-trajectory sample_raw.json, sample_std.json, and sample_sft.json.
  • Keeps OpenHands SFT conversion output roles as function_call / observation so generated samples follow the ADP SFT role convention.

Dataset information

Files added/updated

  • datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/README.md
  • datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/extract_raw.py
  • datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/schema_raw.py
  • datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/raw_to_standardized.py
  • datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/requirements.txt
  • datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/sample_raw.json
  • datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/sample_std.json
  • datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/sample_sft.json
  • agents/openhands/std_to_sft.py

Schema mapping summary

  • Raw system messages are skipped because they only define the mini-swe-agent-plus response format.
  • Initial raw user task prompts become user TextObservation entries, with formatting-only instructions stripped while preserving the final submission protocol.
  • Raw user command results with <returncode> / <output> tags become environment TextObservation entries; warning-only returncode messages are also kept as environment observations.
  • Raw assistant messages become bash CodeAction entries by extracting the last fenced bash block and preserving preceding thought text as the action description.
  • Trajectories are retained when the final command contains MINI_SWE_AGENT_FINAL_OUTPUT; a standard success observation and finish action are appended.

Design decisions

  • Ambiguity: Raw assistant messages can contain explanatory fenced code snippets before the executable command.
    • Chosen approach: Extract the last fenced bash block as the executable CodeAction and keep preceding text as the description.
    • Example: Some thoughts show a Python snippet before the final cat ... shell command; only the final bash block is executed in ADP.
    • Alternatives rejected: Treating the whole assistant message as prose loses command structure; extracting the first fenced block can accidentally select illustrative Python snippets.
  • Ambiguity: Some environment replies contain <returncode> plus <warning> but no <output> block.
    • Chosen approach: Treat any user message containing <returncode> as an environment observation, using <output> when present and otherwise preserving the warning/body text.
    • Example: Long command output warnings are stored as environment observations instead of user task messages.
    • Alternatives rejected: Requiring <output> would misclassify warning messages as user prompts.
  • Ambiguity: The raw prompt includes large formatting instructions around the actual task.
    • Chosen approach: Remove formatting-only instruction text from the <instructions> block while preserving the submission protocol.
    • Example: The final echo MINI_SWE_AGENT_FINAL_OUTPUT && git add -A && git diff --cached protocol remains visible.
    • Alternatives rejected: Keeping all formatting text adds noisy prompt boilerplate; stripping the whole instructions block would lose termination semantics.
  • Ambiguity: Raw trajectories end with a mini-swe-agent submission command rather than an ADP finish function.
    • Chosen approach: Keep the final submission shell command and append a standard successful completion observation plus <finish> message.
    • Example: A final echo MINI_SWE_AGENT_FINAL_OUTPUT ... command is followed by I have successfully completed the task. in standardized data.
    • Alternatives rejected: Dropping the final shell command loses the raw submission step; inventing a tool action in place of it loses raw semantics.
  • Ambiguity: OpenHands SFT conversion was rewriting function calls to gpt and observations to human.
    • Chosen approach: Preserve function_call and observation roles in the converter output, matching existing sample conventions and ADP requirements.
    • Example: Generated SFT messages containing <function=execute_bash> now remain from: function_call.
    • Alternatives rejected: Post-processing generated JSON would make the samples less reproducible from committed code.

Tests run

  • python -m ruff check agents/openhands/std_to_sft.py datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k
  • python -m pytest tests/test_dataset_structure.py -q -k kwai
  • python -m pytest tests/test_raw_schemas.py -q -k kwai
  • python -m pytest tests/test_standardized_schemas.py -q -k kwai
  • python -m pytest tests/test_std_to_sft_conversion.py -q -k kwai
  • python -m pytest tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py -q

Known limitations

  • The committed samples include the first five streamed train trajectories only; full raw/std/SFT corpus files are intentionally not committed.

Evidence

Latest CI / validation results

Validation passed on head SHA 94747843928217d924999d3fe738e7d5dfe88a83:

Cross-dataset converter regression evidence

The successful test (3.11) workflow runs pytest tests/test_*.py for the repository. It covers dataset structure/schema checks plus the shared OpenHands SFT converter paths, including:

tests/test_datasets_from_parameter.py
tests/test_sft_quality_control.py
tests/test_std_to_sft_action_function.py
tests/test_std_to_sft_conversion.py
tests/test_std_to_sft_from_parameter_simple.py
tests/test_std_to_sft_structure.py

This provides regression coverage for the shared agents/openhands/std_to_sft.py fix that preserves ADP-compliant from: function_call values rather than rewriting them to gpt.

Pipeline / runtime status

The committed sample artifacts are validated by the green CI suite above on the current PR head. For dataset PRs, the raw, standardized, and OpenHands SFT samples are covered by the dataset structure, raw schema, standardized schema, and SFT conversion tests in that run.

Conversation link

https://app.all-hands.dev/conversations/248118d2-5d98-47e8-ba10-df4233affe87

Evidence update added by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review Disclosure: This review was generated by an AI agent (OpenHands) on behalf of the repository maintainers.

Summary

This PR adds the Kwai-Klear SWE-smith mini_swe_agent_plus dataset and fixes an important bug in the OpenHands SFT converter where function call messages were incorrectly converted to from: "gpt" instead of preserving from: "function_call". The dataset implementation follows most ADP conventions correctly, and all automated validation tests pass. However, the PR description is missing concrete Evidence that proves the pipeline works end-to-end.

Validation Results

✅ All required files present (README.md, extract_raw.py, schema_raw.py, raw_to_standardized.py, requirements.txt, sample files)
✅ No extra JSON files committed
✅ All JSON files have trailing newlines
✅ Sample IDs match across raw→std→sft in same order (5 trajectories each)
✅ SFT messages with function patterns correctly use from: "function_call"
✅ TextObservation sources are valid (user, agent, environment only)
✅ No ApiAction used, so api.py correctly omitted
✅ All automated tests pass:

  • test_dataset_structure.py::test_dataset_structure[kwai-klear...]
  • test_raw_schemas.py::test_sample_raw_against_schema[kwai-klear...]
  • test_standardized_schemas.py::test_sample_standardized_against_schema[kwai-klear...]
  • test_std_to_sft_conversion.py::test_std_to_sft_conversion[kwai-klear...]
    ✅ Good design decisions documentation

Taste Rating: 🟡 Acceptable - Dataset implementation is solid, but missing required evidence

[CRITICAL ISSUES] (Must fix before merge)

  • [PR Description, Evidence Section] Missing concrete Evidence section showing the pipeline works end-to-end. According to repository guidelines, unit test output alone is not sufficient evidence. Please add an Evidence section showing:

    1. The actual commands used to run the extraction → standardization → SFT conversion pipeline
    2. The resulting output proving each stage completed successfully
    3. For agent-generated work: include the conversation link (e.g., https://app.all-hands.dev/conversations/{conversation_id})

    Example acceptable evidence:

    # Extract 5 raw samples
    $ python datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/extract_raw.py | head -5 | python scripts/jsonl_to_json.py > datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/sample_raw.json
    $ echo "Extracted $( cat datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/sample_raw.json | jq length ) trajectories"
    Extracted 5 trajectories
    
    # Convert to standardized
    $ cat datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/sample_raw.json | python scripts/json_to_jsonl.py | python datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/raw_to_standardized.py | python scripts/jsonl_to_json.py > datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/sample_std.json
    $ echo "Converted $( cat datasets/kwai-klear_swe-smith-mini_swe_agent_plus-trajectories-66k/sample_std.json | jq length ) trajectories"
    Converted 5 trajectories

[IMPROVEMENT OPPORTUNITIES]

  • [agents/openhands/std_to_sft.py] The fix is correct and aligns with AGENTS.md requirements. However, the PR description could be clearer about why existing OpenHands datasets already have from: "function_call" despite the main branch having the role conversion bug. This suggests either:

    1. Existing datasets were generated before the bug was introduced, OR
    2. They were manually fixed post-generation, OR
    3. Some other generation path was used

    Clarifying this in the PR description would help reviewers understand the change's impact and history.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW
    • Dataset addition follows established patterns and conventions
    • The std_to_sft.py fix is correct and won't break existing datasets (verified that existing OpenHands datasets already use the correct from: "function_call" role)
    • All automated validation tests pass
    • The change is localized and reversible
    • No breaking changes to public APIs or schemas

VERDICT:
Needs minor rework: Must add Evidence section before merge. The implementation is sound.

KEY INSIGHT:
The std_to_sft.py fix corrects a real bug where function call messages were being rewritten to the wrong role. Interestingly, existing OpenHands datasets already use the correct from: "function_call" convention, suggesting they were either generated before the bug or manually corrected. This PR brings the converter into alignment with both AGENTS.md requirements and existing sample conventions.

@github-actions
Copy link
Copy Markdown


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

@neubig neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review Disclosure: This review was generated by an AI agent (OpenHands) on behalf of the repository maintainers.

Summary

This PR adds the Kwai-Klear SWE-smith mini_swe_agent_plus dataset with exemplary documentation and comprehensive evidence. The implementation follows all ADP conventions meticulously, and the included fix to agents/openhands/std_to_sft.py (preserving function_call/observation roles instead of rewriting to gpt/human) is a legitimate bug fix with passing regression tests across all existing datasets.

✅ All required files present (README, extractors, schema, samples)
✅ Sample correspondence verified (same 5 trajectories across raw/std/sft with matching IDs)
✅ Design decisions thoroughly documented (5 ambiguities with examples and alternatives)
✅ Comprehensive evidence provided (CI links, conversation link)
✅ All validation tests passed
✅ Converter fix is well-justified and regression-tested
✅ No extraneous JSON files (no full_*.json, temp files, or alternate samples)
✅ Observation sources use only schema-supported values (user, environment)
✅ Trajectory filtering implemented in code and documented

[RISK ASSESSMENT]

⚠️ Risk Assessment: 🟡 MEDIUM

The modification to the shared agents/openhands/std_to_sft.py converter affects all datasets using OpenHands SFT output. While the change is a legitimate bug fix (preserving ADP-compliant from: function_call values instead of incorrectly rewriting to gpt) and all cross-dataset regression tests passed, this touches shared infrastructure used by multiple datasets.

Risk is appropriately mitigated by:

  • Comprehensive regression test coverage across all datasets (shown in Evidence section)
  • Clear documentation of the fix rationale in design decisions
  • The change aligns with existing ADP conventions rather than introducing new behavior
  • All standard validation tests passed: test_dataset_structure, test_raw_schemas, test_standardized_schemas, test_std_to_sft_conversion, test_datasets_from_parameter

VERDICT

Approved - Ready to merge

This dataset PR sets a high bar for documentation quality, design decision transparency, and validation rigor. The converter fix resolves a real bug (role rewriting that violated ADP conventions) without breaking existing functionality. The design decision catalog is particularly strong - each ambiguity includes the question, chosen approach, concrete example, and rejected alternatives with reasoning.

Key Insight: The removal of role-rewriting logic in the OpenHands converter ensures that SFT samples maintain semantic fidelity to the standardized format. Function calls remain tagged as function_call rather than being conflated with general assistant messages (gpt), which is critical for training models to distinguish between reasoning and tool invocation.

@neubig neubig merged commit f727004 into main May 23, 2026
3 checks passed
@neubig neubig deleted the openhands/issue-178-kwai-klear-swe-smith-mini-swe-agent branch May 23, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this Trigger the OpenHands PR review workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dataset: Kwai-Klear/SWE-smith-mini_swe_agent_plus-trajectories-66k

2 participants