Support: upgrade profiling pipeline with TensorMap instrumentation by ChaoWao · Pull Request #167 · hw-native-sys/simpler

ChaoWao · 2026-03-03T07:45:09Z

Summary

Restore perf_aicpu_record_phase calls for all scheduler phases (complete, dispatch, scan, idle) lost during scheduler API migration
Replace old yield/orch_drain counters with notify/pop/idle metrics
Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats (fanout edges, fanin edges, pop hit/miss rates)
Track both fanout and fanin edges in on_task_complete profiling stats to explain complete-phase overhead
Add Thread 3: prefix to orchestrator/TensorMap log output for consistent device log format
Add TensorMap lookup/insert profiling counters behind new PTO2_TENSORMAP_PROFILING macro (default off)
Rename PTO2_ORCH_PROFILING → PTO2_PROFILING (controls both orchestrator and scheduler)
Gate all profiling DEV_ALWAYS output behind runtime->enable_profiling
Guard all per-phase perf recording with #if PTO2_PROFILING to ensure PTO2_PROFILING=0 compiles
Guard on_task_complete stats behind #if PTO2_PROFILING to avoid hot-path overhead when profiling is off
Unify release_fanin_and_check_ready / release_fanin_and_check_ready_counted into single method returning bool
Add div-by-zero guards for orchestrator profiling output and Python task count validation
Deduplicate TensorMap lookup chain stats via single exit point
Add task count validation in swimlane_converter to suppress misleading Sched CPU metric on device log mismatch
Rewrite sched_overhead_analysis.py parser for new output format with fanout/fanin separation
Update device_log_profiling.md examples

Closes #159

Testing

Hardware test on device-12 with --enable-profiling: profiling output appears
Hardware test on device-12 without --enable-profiling: profiling output suppressed
pytest tests -v (21 passed)

gemini-code-assist · 2026-03-03T07:45:35Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the scheduler's performance profiling capabilities, which were partially lost during a previous migration. It reintroduces fine-grained phase profiling, updates the metrics collected, and refines the output format for better analysis. These changes provide deeper insights into scheduler behavior, allowing for more effective identification of bottlenecks and optimization opportunities.

Highlights

Profiling Phase Renaming: The SCHED_EARLY_READY phase was renamed to SCHED_IDLE_WAIT across the profiling pipeline, including the enum definition, swimlane export, and Chrome trace generation, to more accurately reflect its purpose as an idle/spinning state.
Enhanced Task Completion Statistics: A new PTO2CompletionStats struct was introduced to track detailed metrics during task completion, specifically the number of fanout edges traversed and the count of consumer tasks that became ready and were enqueued.
Restored and Upgraded Phase Profiling: The perf_aicpu_record_phase calls were reinstated for all four scheduler phases (complete, dispatch, scan, idle) to restore fine-grained profiling. Old yield/orch_drain counters were replaced with new notify/pop/idle metrics in the executor.
Improved Scheduler Output Format: The DEV_ALWAYS output was upgraded to provide a comprehensive per-phase breakdown, including detailed statistics such as notify edges, maximum fanout degree, average fanout, and pop hit/miss rates.
Updated Analysis Script: The sched_overhead_analysis.py parser script was rewritten to align with the new DEV_ALWAYS output format, removing parsing for outdated lock contention and steal metrics and incorporating the new detailed phase statistics.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/platform/include/common/perf_profiling.h
- Renamed SCHED_EARLY_READY enum member to SCHED_IDLE_WAIT.
src/platform/src/host/performance_collector.cpp
- Updated the phase name mapping for SCHED_IDLE_WAIT in the swimlane JSON export.
src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
- Replaced old profiling counters (sched_orch_drain_cycle, sched_yield_cycle, sched_yield_count) with new ones (sched_idle_cycle, notify_edges_total, notify_max_degree, notify_tasks_enqueued, pop_hit, pop_miss, phase_complete_count, phase_dispatch_count).
- Added perf_aicpu_record_phase calls for SCHED_COMPLETE, SCHED_DISPATCH, SCHED_SCAN, and SCHED_IDLE_WAIT phases.
- Modified pto2_scheduler_on_task_complete call to use the new PTO2CompletionStats return value.
- Updated DEV_ALWAYS output to display detailed per-phase breakdown including notify and pop statistics.
src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
- Changed the return type of pto2_scheduler_on_task_complete to PTO2CompletionStats.
- Populated PTO2CompletionStats with fanout_edges and tasks_enqueued during task completion.
- Used release_fanin_and_check_ready_counted to track enqueued tasks.
src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
- Defined the PTO2CompletionStats struct.
- Added release_fanin_and_check_ready_counted method to PTO2SchedulerState to return whether a task was enqueued.
- Updated the function signature for pto2_scheduler_on_task_complete.
tools/sched_overhead_analysis.py
- Updated the expected log format to reflect the new DEV_ALWAYS output.
- Modified regular expressions to parse new notify and pop metrics.
- Removed parsing logic for early_ready, steal, and lock contention metrics.
- Updated phase labels and analysis insights to match the new profiling data.
tools/swimlane_converter.py
- Renamed the early_ready phase to idle in the Chrome trace JSON generation.

Activity

The pull request includes comprehensive testing information, with pytest tests -v passing all 21 tests and runtime compilation succeeding for all platforms (host + aicpu + aicore).
Hardware device tests are noted as pending, requiring an Ascend NPU.
A follow-up task for a lock-free orch_pending queue is planned for a separate PR.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively restores and enhances the scheduler's profiling capabilities, aiming to provide more detailed performance insights. A security review, however, identified a potential out-of-bounds access in the newly added release_fanin_and_check_ready_counted method due to missing validation of the worker_type field, which is read from shared memory and used as an array index. Additionally, there are a few areas for improvement: a minor correctness issue in an average calculation, an opportunity to reduce code duplication for better maintainability, and a correction to a mathematical formula in the Python analysis script.

ChaoWao · 2026-03-03T12:37:05Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors and enhances the profiling infrastructure for the PTO2 scheduler and orchestrator. Key changes include renaming the profiling macro from PTO2_ORCH_PROFILING to PTO2_PROFILING for broader applicability, and introducing a new PTO2_TENSORMAP_PROFILING macro for more granular TensorMap statistics. The SCHED_EARLY_READY phase has been renamed to SCHED_IDLE_WAIT to better reflect its purpose of tracking idle/spinning time, and corresponding updates were made in the performance collector and documentation. Significant additions were made to scheduler profiling metrics, including tracking pop_hit, pop_miss, notify_edges_total, notify_max_degree, and notify_tasks_enqueued, which are now captured via a new PTO2CompletionStats struct returned by on_task_complete. The scheduler's log output has been updated to reflect these new metrics and provide a more detailed phase breakdown, removing previous lock contention statistics. The Python analysis scripts (sched_overhead_analysis.py, swimlane_converter.py) and the device_log_profiling.md documentation have been updated to parse and interpret these new profiling outputs, removing references to old metrics like 'early_ready' and 'lock contention'. Review comments suggest adding comments for clarity on profiling increments, using static for global profiling variables to limit their scope, and refining the warning message for mismatched task counts in the swimlane_converter.py script.

- Restore perf_aicpu_record_phase calls for all scheduler phases - Replace old yield/orch_drain counters with notify/pop/idle metrics - Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats - Track both fanout and fanin edges in on_task_complete profiling stats - Add Thread 3 prefix to orchestrator/TensorMap log output - Add TensorMap lookup/insert profiling counters (PTO2_TENSORMAP_PROFILING) - Rename PTO2_ORCH_PROFILING to PTO2_PROFILING - Gate all profiling DEV_ALWAYS output behind runtime->enable_profiling - Guard all per-phase perf recording with #if PTO2_PROFILING - Unify release_fanin_and_check_ready into single method returning bool - Add div-by-zero guards and task count validation - Rewrite sched_overhead_analysis.py parser for new output format - Update device_log_profiling.md examples

…w-native-sys#167) - Restore perf_aicpu_record_phase calls for all scheduler phases - Replace old yield/orch_drain counters with notify/pop/idle metrics - Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats - Track both fanout and fanin edges in on_task_complete profiling stats - Add Thread 3 prefix to orchestrator/TensorMap log output - Add TensorMap lookup/insert profiling counters (PTO2_TENSORMAP_PROFILING) - Rename PTO2_ORCH_PROFILING to PTO2_PROFILING - Gate all profiling DEV_ALWAYS output behind runtime->enable_profiling - Guard all per-phase perf recording with #if PTO2_PROFILING - Unify release_fanin_and_check_ready into single method returning bool - Add div-by-zero guards and task count validation - Rewrite sched_overhead_analysis.py parser for new output format - Update device_log_profiling.md examples

gemini-code-assist Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h Outdated

Comment thread src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated

Comment thread tools/sched_overhead_analysis.py Outdated

ChaoWao force-pushed the profiling/scheduler-phase-upgrade branch from 9204e19 to 6cdea13 Compare March 3, 2026 10:50

ChaoWao changed the title ~~Support: restore phase profiling and upgrade scheduler output format~~ Support: upgrade profiling pipeline with TensorMap instrumentation Mar 3, 2026

ChaoWao force-pushed the profiling/scheduler-phase-upgrade branch 2 times, most recently from 3f507f8 to 2c6fe5d Compare March 3, 2026 11:30

ChaoWao changed the title ~~Support: upgrade profiling pipeline with TensorMap instrumentation~~ Support: restore phase profiling and upgrade scheduler output format Mar 3, 2026

ChaoWao force-pushed the profiling/scheduler-phase-upgrade branch 2 times, most recently from cc33cc5 to 0510109 Compare March 3, 2026 12:36

gemini-code-assist Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp

Comment thread src/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.cpp

Comment thread tools/swimlane_converter.py

ChaoWao changed the title ~~Support: restore phase profiling and upgrade scheduler output format~~ Support: upgrade profiling pipeline with TensorMap instrumentation Mar 3, 2026

ChaoWao force-pushed the profiling/scheduler-phase-upgrade branch from 0510109 to d6f26b9 Compare March 3, 2026 13:17

ChaoWao force-pushed the profiling/scheduler-phase-upgrade branch from d6f26b9 to 4e2f1db Compare March 3, 2026 13:21

ChaoWao merged commit 7a16d1c into main Mar 3, 2026
3 checks passed

ChaoWao deleted the profiling/scheduler-phase-upgrade branch March 5, 2026 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support: upgrade profiling pipeline with TensorMap instrumentation#167

Support: upgrade profiling pipeline with TensorMap instrumentation#167
ChaoWao merged 1 commit into
mainfrom
profiling/scheduler-phase-upgrade

ChaoWao commented Mar 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChaoWao commented Mar 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist Bot commented Mar 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChaoWao commented Mar 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoWao commented Mar 3, 2026 •

edited

Loading