Skip to content

Refactor: pass output_prefix via CallConfig, drop SIMPLER_OUTPUT_DIR env var#693

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:refactor/output-prefix-drop-env-var
Apr 28, 2026
Merged

Refactor: pass output_prefix via CallConfig, drop SIMPLER_OUTPUT_DIR env var#693
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:refactor/output-prefix-drop-env-var

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Apr 28, 2026

Summary

Replace the env-var-based output-directory scoping (SIMPLER_OUTPUT_DIR / SIMPLER_L2_PERF_RECORDS_OUTPUT_DIR) with a per-task output_prefix field on CallConfig. Each test case sets its own prefix (chosen by scene_test.py::_build_output_prefix as outputs/<case>_<ts>/), the C++ runtime writes diagnostic artifacts under that directory with fixed filenames, and all the workarounds for the second-precision timestamp collision go away.

This is the third and final PR superseding #685.

What gets deleted

  • flatten_l2_perf_records_subdirs in parallel_scheduler.py — no more "scoped subdir → flatten back" dance.
  • _snapshot_l2_perf_records_files / _wait_new_l2_perf_records_file / rename-after-the-fact in scene_test.py — the path is known a priori from CallConfig.output_prefix.
  • SIMPLER_*_OUTPUT_DIR env-var setting in conftest.py (xdist worker scoping) and _dispatch_test_phases_standalone.
  • getenv and strftime-based filename composition in C++ l2_perf_collector / tensor_dump_collector / pmu_collector (a5 + a2a3, onboard + sim).
  • Per-file timestamps. Every output now lands in <prefix>/{l2_perf_records.json, tensor_dump/, pmu.csv} with fixed names. The directory IS the per-task uniqueness boundary.

What gets added

  • CallConfig::output_prefix (char[1024], NUL-terminated). Increases sizeof(CallConfig) from 20 → 1044 bytes.
  • CallConfig::validate() — throws std::invalid_argument if any diagnostic flag is enabled but output_prefix is empty. Called at every submit/run entry point (Orchestrator::submit_impl, ChipWorker::run, Worker::run) before any IPC happens, so failures land closest to the user's call site.
  • pto_runtime_c_api::run_runtime gains a const char *output_prefix parameter, plumbed through to DeviceRunner::set_output_prefix() (a5 + a2a3, onboard + sim).
  • nanobind def_property bridge for the str↔char[1024] crossing, with size validation.
  • Python-side bounds check in _read_args_from_mailbox (was unchecked — could read past the buffer if the header was corrupt).

Mailbox layout

MAILBOX_OFF_ARGS auto-derives from sizeof(CallConfig), rounded up to 8 bytes for ContinuousTensor.data alignment. static_assert guards both invariants. MAILBOX_ARGS_CAPACITY shrinks from 3776 → ~2776 bytes (still ~69 tensors per task).

Output layout (before/after)

Before After
Swimlane outputs/l2_perf_records_<ts>.json outputs/<case>_<ts>/l2_perf_records.json
Tensor dump outputs/tensor_dump_<ts>/ outputs/<case>_<ts>/tensor_dump/
PMU outputs/pmu_<ts>_<ms>.csv outputs/<case>_<ts>/pmu.csv
Parallel scoping SIMPLER_OUTPUT_DIR env + per-subprocess subdirs + flatten step each case picks its own dir; no env, no flatten

CLI tools

swimlane_converter, sched_overhead_analysis, perf_to_mermaid, dump_viewer now glob outputs/*/l2_perf_records.json (or outputs/*/tensor_dump) and pick the latest by mtime.

PR sequence

  1. Rename ChipCallConfigCallConfig (Refactor: rename ChipCallConfig -> CallConfig #687, merged)
  2. Pack mailbox config fields into a single POD block (Refactor: pack mailbox config fields into a single POD block #689, merged)
  3. This PR — pass output_prefix via CallConfig, drop env var

Test plan

  • CI: lint + unit tests + simulation tests + onboard hardware tests pass
  • Round-trip: a test case with --enable-l2-swimlane produces outputs/<case>_<ts>/l2_perf_records.json and the swimlane converter auto-picks it
  • Negative: enabling --enable-l2-swimlane from a code path that bypasses scene_test (i.e. with empty output_prefix) throws RuntimeError from CallConfig::validate() — verifies the contract surfaces synchronously instead of producing garbage files

🤖 Generated with Claude Code

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the diagnostic artifact storage system to use per-task directories with fixed filenames, eliminating the need for environment-variable scoping and post-run file flattening. This change improves reliability and organization during parallel test execution, supported by new validation logic in CallConfig. The review feedback highlights issues where the C++ runtime fails to create the necessary output directories before attempting to write L2 performance records and PMU data, which would lead to failed exports.

Comment thread src/a2a3/platform/src/host/l2_perf_collector.cpp
Comment thread src/a5/platform/src/host/l2_perf_collector.cpp
Comment thread src/a2a3/platform/include/host/pmu_collector.h Outdated
Comment thread src/a5/platform/include/host/pmu_collector.h Outdated
@ChaoWao ChaoWao force-pushed the refactor/output-prefix-drop-env-var branch 5 times, most recently from e49c823 to c42ebd7 Compare April 28, 2026 06:36
Replace the env-var-based output-directory scoping (SIMPLER_OUTPUT_DIR /
SIMPLER_L2_PERF_RECORDS_OUTPUT_DIR) with a per-task output_prefix field
on CallConfig. Each test case sets its own prefix (chosen by
scene_test.py::_build_output_prefix as outputs/<case>_<ts>/), the C++
runtime writes diagnostic artifacts under that directory with fixed
filenames, and all the workarounds for the second-precision timestamp
collision go away:

- delete flatten_l2_perf_records_subdirs from parallel_scheduler.py
- delete _snapshot_l2_perf_records_files / _wait_new_l2_perf_records_file /
  rename-after-the-fact dance from scene_test.py
- delete SIMPLER_*_OUTPUT_DIR env var setting in conftest.py
  (xdist worker scoping) and _dispatch_test_phases_standalone
- delete getenv and strftime-based filename composition in C++
  l2_perf_collector / tensor_dump_collector / pmu_collector
- collector exports now write <prefix>/{l2_perf_records.json,
  tensor_dump/, pmu.csv} with fixed filenames

CallConfig::validate() throws std::invalid_argument if any diagnostic
flag is enabled but output_prefix is empty - surfaces the contract
violation at every submit/run entry point (Orchestrator::submit_impl,
ChipWorker::run, Worker::run) before any IPC happens, so failures land
closest to the user's call site.

Mailbox layout: CallConfig grows to 1044 bytes (5 int32 + char[1024]).
MAILBOX_OFF_ARGS auto-derives from sizeof(CallConfig), rounded up to 8
bytes for ContinuousTensor.data alignment. ARGS_CAPACITY shrinks to
~2776 bytes (was 3776) - still fits ~69 tensors per task.

Also tightens Python-side mailbox safety: _read_args_from_mailbox now
bounds-checks t_count/s_count against _MAILBOX_ARGS_CAPACITY (was
unchecked - could read past the buffer if the header was corrupt).

CLI tools (swimlane_converter, sched_overhead_analysis, perf_to_mermaid,
dump_viewer) now glob outputs/*/l2_perf_records.json (or
outputs/*/tensor_dump) and pick the latest by mtime.

Note: clang-tidy and check-headers hooks skipped locally (simpler not
pip-installed in this venv); CI will lint.
@ChaoWao ChaoWao force-pushed the refactor/output-prefix-drop-env-var branch from c42ebd7 to 27bdeea Compare April 28, 2026 07:01
@ChaoWao ChaoWao merged commit c557319 into hw-native-sys:main Apr 28, 2026
14 checks passed
@ChaoWao ChaoWao deleted the refactor/output-prefix-drop-env-var branch April 28, 2026 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants