Refactor: pass output_prefix via CallConfig, drop SIMPLER_OUTPUT_DIR env var#693
Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom Apr 28, 2026
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the diagnostic artifact storage system to use per-task directories with fixed filenames, eliminating the need for environment-variable scoping and post-run file flattening. This change improves reliability and organization during parallel test execution, supported by new validation logic in CallConfig. The review feedback highlights issues where the C++ runtime fails to create the necessary output directories before attempting to write L2 performance records and PMU data, which would lead to failed exports.
e49c823 to
c42ebd7
Compare
Replace the env-var-based output-directory scoping (SIMPLER_OUTPUT_DIR /
SIMPLER_L2_PERF_RECORDS_OUTPUT_DIR) with a per-task output_prefix field
on CallConfig. Each test case sets its own prefix (chosen by
scene_test.py::_build_output_prefix as outputs/<case>_<ts>/), the C++
runtime writes diagnostic artifacts under that directory with fixed
filenames, and all the workarounds for the second-precision timestamp
collision go away:
- delete flatten_l2_perf_records_subdirs from parallel_scheduler.py
- delete _snapshot_l2_perf_records_files / _wait_new_l2_perf_records_file /
rename-after-the-fact dance from scene_test.py
- delete SIMPLER_*_OUTPUT_DIR env var setting in conftest.py
(xdist worker scoping) and _dispatch_test_phases_standalone
- delete getenv and strftime-based filename composition in C++
l2_perf_collector / tensor_dump_collector / pmu_collector
- collector exports now write <prefix>/{l2_perf_records.json,
tensor_dump/, pmu.csv} with fixed filenames
CallConfig::validate() throws std::invalid_argument if any diagnostic
flag is enabled but output_prefix is empty - surfaces the contract
violation at every submit/run entry point (Orchestrator::submit_impl,
ChipWorker::run, Worker::run) before any IPC happens, so failures land
closest to the user's call site.
Mailbox layout: CallConfig grows to 1044 bytes (5 int32 + char[1024]).
MAILBOX_OFF_ARGS auto-derives from sizeof(CallConfig), rounded up to 8
bytes for ContinuousTensor.data alignment. ARGS_CAPACITY shrinks to
~2776 bytes (was 3776) - still fits ~69 tensors per task.
Also tightens Python-side mailbox safety: _read_args_from_mailbox now
bounds-checks t_count/s_count against _MAILBOX_ARGS_CAPACITY (was
unchecked - could read past the buffer if the header was corrupt).
CLI tools (swimlane_converter, sched_overhead_analysis, perf_to_mermaid,
dump_viewer) now glob outputs/*/l2_perf_records.json (or
outputs/*/tensor_dump) and pick the latest by mtime.
Note: clang-tidy and check-headers hooks skipped locally (simpler not
pip-installed in this venv); CI will lint.
c42ebd7 to
27bdeea
Compare
poursoul
approved these changes
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the env-var-based output-directory scoping (
SIMPLER_OUTPUT_DIR/SIMPLER_L2_PERF_RECORDS_OUTPUT_DIR) with a per-taskoutput_prefixfield onCallConfig. Each test case sets its own prefix (chosen byscene_test.py::_build_output_prefixasoutputs/<case>_<ts>/), the C++ runtime writes diagnostic artifacts under that directory with fixed filenames, and all the workarounds for the second-precision timestamp collision go away.This is the third and final PR superseding #685.
What gets deleted
flatten_l2_perf_records_subdirsinparallel_scheduler.py— no more "scoped subdir → flatten back" dance._snapshot_l2_perf_records_files/_wait_new_l2_perf_records_file/ rename-after-the-fact inscene_test.py— the path is known a priori fromCallConfig.output_prefix.SIMPLER_*_OUTPUT_DIRenv-var setting inconftest.py(xdist worker scoping) and_dispatch_test_phases_standalone.getenvandstrftime-based filename composition in C++l2_perf_collector/tensor_dump_collector/pmu_collector(a5 + a2a3, onboard + sim).<prefix>/{l2_perf_records.json, tensor_dump/, pmu.csv}with fixed names. The directory IS the per-task uniqueness boundary.What gets added
CallConfig::output_prefix(char[1024], NUL-terminated). Increasessizeof(CallConfig)from 20 → 1044 bytes.CallConfig::validate()— throwsstd::invalid_argumentif any diagnostic flag is enabled butoutput_prefixis empty. Called at every submit/run entry point (Orchestrator::submit_impl,ChipWorker::run,Worker::run) before any IPC happens, so failures land closest to the user's call site.pto_runtime_c_api::run_runtimegains aconst char *output_prefixparameter, plumbed through toDeviceRunner::set_output_prefix()(a5 + a2a3, onboard + sim).def_propertybridge for the str↔char[1024] crossing, with size validation._read_args_from_mailbox(was unchecked — could read past the buffer if the header was corrupt).Mailbox layout
MAILBOX_OFF_ARGSauto-derives fromsizeof(CallConfig), rounded up to 8 bytes forContinuousTensor.dataalignment.static_assertguards both invariants.MAILBOX_ARGS_CAPACITYshrinks from 3776 → ~2776 bytes (still ~69 tensors per task).Output layout (before/after)
outputs/l2_perf_records_<ts>.jsonoutputs/<case>_<ts>/l2_perf_records.jsonoutputs/tensor_dump_<ts>/outputs/<case>_<ts>/tensor_dump/outputs/pmu_<ts>_<ms>.csvoutputs/<case>_<ts>/pmu.csvSIMPLER_OUTPUT_DIRenv + per-subprocess subdirs + flatten stepCLI tools
swimlane_converter,sched_overhead_analysis,perf_to_mermaid,dump_viewernow globoutputs/*/l2_perf_records.json(oroutputs/*/tensor_dump) and pick the latest by mtime.PR sequence
Rename(Refactor: rename ChipCallConfig -> CallConfig #687, merged)ChipCallConfig→CallConfigPack mailbox config fields into a single POD block(Refactor: pack mailbox config fields into a single POD block #689, merged)output_prefixvia CallConfig, drop env varTest plan
--enable-l2-swimlaneproducesoutputs/<case>_<ts>/l2_perf_records.jsonand the swimlane converter auto-picks it--enable-l2-swimlanefrom a code path that bypasses scene_test (i.e. with emptyoutput_prefix) throwsRuntimeErrorfromCallConfig::validate()— verifies the contract surfaces synchronously instead of producing garbage files🤖 Generated with Claude Code