Skip to content

Support: upgrade profiling pipeline with TensorMap instrumentation#167

Merged
ChaoWao merged 1 commit into
mainfrom
profiling/scheduler-phase-upgrade
Mar 3, 2026
Merged

Support: upgrade profiling pipeline with TensorMap instrumentation#167
ChaoWao merged 1 commit into
mainfrom
profiling/scheduler-phase-upgrade

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented Mar 3, 2026

Summary

  • Restore perf_aicpu_record_phase calls for all scheduler phases (complete, dispatch, scan, idle) lost during scheduler API migration
  • Replace old yield/orch_drain counters with notify/pop/idle metrics
  • Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats (fanout edges, fanin edges, pop hit/miss rates)
  • Track both fanout and fanin edges in on_task_complete profiling stats to explain complete-phase overhead
  • Add Thread 3: prefix to orchestrator/TensorMap log output for consistent device log format
  • Add TensorMap lookup/insert profiling counters behind new PTO2_TENSORMAP_PROFILING macro (default off)
  • Rename PTO2_ORCH_PROFILINGPTO2_PROFILING (controls both orchestrator and scheduler)
  • Gate all profiling DEV_ALWAYS output behind runtime->enable_profiling
  • Guard all per-phase perf recording with #if PTO2_PROFILING to ensure PTO2_PROFILING=0 compiles
  • Guard on_task_complete stats behind #if PTO2_PROFILING to avoid hot-path overhead when profiling is off
  • Unify release_fanin_and_check_ready / release_fanin_and_check_ready_counted into single method returning bool
  • Add div-by-zero guards for orchestrator profiling output and Python task count validation
  • Deduplicate TensorMap lookup chain stats via single exit point
  • Add task count validation in swimlane_converter to suppress misleading Sched CPU metric on device log mismatch
  • Rewrite sched_overhead_analysis.py parser for new output format with fanout/fanin separation
  • Update device_log_profiling.md examples

Closes #159

Testing

  • Hardware test on device-12 with --enable-profiling: profiling output appears
  • Hardware test on device-12 without --enable-profiling: profiling output suppressed
  • pytest tests -v (21 passed)

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the scheduler's performance profiling capabilities, which were partially lost during a previous migration. It reintroduces fine-grained phase profiling, updates the metrics collected, and refines the output format for better analysis. These changes provide deeper insights into scheduler behavior, allowing for more effective identification of bottlenecks and optimization opportunities.

Highlights

  • Profiling Phase Renaming: The SCHED_EARLY_READY phase was renamed to SCHED_IDLE_WAIT across the profiling pipeline, including the enum definition, swimlane export, and Chrome trace generation, to more accurately reflect its purpose as an idle/spinning state.
  • Enhanced Task Completion Statistics: A new PTO2CompletionStats struct was introduced to track detailed metrics during task completion, specifically the number of fanout edges traversed and the count of consumer tasks that became ready and were enqueued.
  • Restored and Upgraded Phase Profiling: The perf_aicpu_record_phase calls were reinstated for all four scheduler phases (complete, dispatch, scan, idle) to restore fine-grained profiling. Old yield/orch_drain counters were replaced with new notify/pop/idle metrics in the executor.
  • Improved Scheduler Output Format: The DEV_ALWAYS output was upgraded to provide a comprehensive per-phase breakdown, including detailed statistics such as notify edges, maximum fanout degree, average fanout, and pop hit/miss rates.
  • Updated Analysis Script: The sched_overhead_analysis.py parser script was rewritten to align with the new DEV_ALWAYS output format, removing parsing for outdated lock contention and steal metrics and incorporating the new detailed phase statistics.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/platform/include/common/perf_profiling.h
    • Renamed SCHED_EARLY_READY enum member to SCHED_IDLE_WAIT.
  • src/platform/src/host/performance_collector.cpp
    • Updated the phase name mapping for SCHED_IDLE_WAIT in the swimlane JSON export.
  • src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
    • Replaced old profiling counters (sched_orch_drain_cycle, sched_yield_cycle, sched_yield_count) with new ones (sched_idle_cycle, notify_edges_total, notify_max_degree, notify_tasks_enqueued, pop_hit, pop_miss, phase_complete_count, phase_dispatch_count).
    • Added perf_aicpu_record_phase calls for SCHED_COMPLETE, SCHED_DISPATCH, SCHED_SCAN, and SCHED_IDLE_WAIT phases.
    • Modified pto2_scheduler_on_task_complete call to use the new PTO2CompletionStats return value.
    • Updated DEV_ALWAYS output to display detailed per-phase breakdown including notify and pop statistics.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
    • Changed the return type of pto2_scheduler_on_task_complete to PTO2CompletionStats.
    • Populated PTO2CompletionStats with fanout_edges and tasks_enqueued during task completion.
    • Used release_fanin_and_check_ready_counted to track enqueued tasks.
  • src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
    • Defined the PTO2CompletionStats struct.
    • Added release_fanin_and_check_ready_counted method to PTO2SchedulerState to return whether a task was enqueued.
    • Updated the function signature for pto2_scheduler_on_task_complete.
  • tools/sched_overhead_analysis.py
    • Updated the expected log format to reflect the new DEV_ALWAYS output.
    • Modified regular expressions to parse new notify and pop metrics.
    • Removed parsing logic for early_ready, steal, and lock contention metrics.
    • Updated phase labels and analysis insights to match the new profiling data.
  • tools/swimlane_converter.py
    • Renamed the early_ready phase to idle in the Chrome trace JSON generation.
Activity
  • The pull request includes comprehensive testing information, with pytest tests -v passing all 21 tests and runtime compilation succeeding for all platforms (host + aicpu + aicore).
  • Hardware device tests are noted as pending, requiring an Ascend NPU.
  • A follow-up task for a lock-free orch_pending queue is planned for a separate PR.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively restores and enhances the scheduler's profiling capabilities, aiming to provide more detailed performance insights. A security review, however, identified a potential out-of-bounds access in the newly added release_fanin_and_check_ready_counted method due to missing validation of the worker_type field, which is read from shared memory and used as an array index. Additionally, there are a few areas for improvement: a minor correctness issue in an average calculation, an opportunity to reduce code duplication for better maintainability, and a correction to a mathematical formula in the Python analysis script.

Comment thread src/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h Outdated
Comment thread src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated
Comment thread tools/sched_overhead_analysis.py Outdated
@ChaoWao ChaoWao force-pushed the profiling/scheduler-phase-upgrade branch from 9204e19 to 6cdea13 Compare March 3, 2026 10:50
@ChaoWao ChaoWao changed the title Support: restore phase profiling and upgrade scheduler output format Support: upgrade profiling pipeline with TensorMap instrumentation Mar 3, 2026
@ChaoWao ChaoWao force-pushed the profiling/scheduler-phase-upgrade branch 2 times, most recently from 3f507f8 to 2c6fe5d Compare March 3, 2026 11:30
@ChaoWao ChaoWao changed the title Support: upgrade profiling pipeline with TensorMap instrumentation Support: restore phase profiling and upgrade scheduler output format Mar 3, 2026
@ChaoWao ChaoWao force-pushed the profiling/scheduler-phase-upgrade branch 2 times, most recently from cc33cc5 to 0510109 Compare March 3, 2026 12:36
@ChaoWao
Copy link
Copy Markdown
Collaborator Author

ChaoWao commented Mar 3, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors and enhances the profiling infrastructure for the PTO2 scheduler and orchestrator. Key changes include renaming the profiling macro from PTO2_ORCH_PROFILING to PTO2_PROFILING for broader applicability, and introducing a new PTO2_TENSORMAP_PROFILING macro for more granular TensorMap statistics. The SCHED_EARLY_READY phase has been renamed to SCHED_IDLE_WAIT to better reflect its purpose of tracking idle/spinning time, and corresponding updates were made in the performance collector and documentation. Significant additions were made to scheduler profiling metrics, including tracking pop_hit, pop_miss, notify_edges_total, notify_max_degree, and notify_tasks_enqueued, which are now captured via a new PTO2CompletionStats struct returned by on_task_complete. The scheduler's log output has been updated to reflect these new metrics and provide a more detailed phase breakdown, removing previous lock contention statistics. The Python analysis scripts (sched_overhead_analysis.py, swimlane_converter.py) and the device_log_profiling.md documentation have been updated to parse and interpret these new profiling outputs, removing references to old metrics like 'early_ready' and 'lock contention'. Review comments suggest adding comments for clarity on profiling increments, using static for global profiling variables to limit their scope, and refining the warning message for mismatched task counts in the swimlane_converter.py script.

Comment thread src/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Comment thread src/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.cpp
Comment thread tools/swimlane_converter.py
@ChaoWao ChaoWao changed the title Support: restore phase profiling and upgrade scheduler output format Support: upgrade profiling pipeline with TensorMap instrumentation Mar 3, 2026
@ChaoWao ChaoWao force-pushed the profiling/scheduler-phase-upgrade branch from 0510109 to d6f26b9 Compare March 3, 2026 13:17
- Restore perf_aicpu_record_phase calls for all scheduler phases
- Replace old yield/orch_drain counters with notify/pop/idle metrics
- Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats
- Track both fanout and fanin edges in on_task_complete profiling stats
- Add Thread 3 prefix to orchestrator/TensorMap log output
- Add TensorMap lookup/insert profiling counters (PTO2_TENSORMAP_PROFILING)
- Rename PTO2_ORCH_PROFILING to PTO2_PROFILING
- Gate all profiling DEV_ALWAYS output behind runtime->enable_profiling
- Guard all per-phase perf recording with #if PTO2_PROFILING
- Unify release_fanin_and_check_ready into single method returning bool
- Add div-by-zero guards and task count validation
- Rewrite sched_overhead_analysis.py parser for new output format
- Update device_log_profiling.md examples
@ChaoWao ChaoWao force-pushed the profiling/scheduler-phase-upgrade branch from d6f26b9 to 4e2f1db Compare March 3, 2026 13:21
@ChaoWao ChaoWao merged commit 7a16d1c into main Mar 3, 2026
3 checks passed
@ChaoWao ChaoWao deleted the profiling/scheduler-phase-upgrade branch March 5, 2026 13:41
PKUZHOU pushed a commit to PKUZHOU/simpler that referenced this pull request Mar 31, 2026
…w-native-sys#167)

- Restore perf_aicpu_record_phase calls for all scheduler phases
- Replace old yield/orch_drain counters with notify/pop/idle metrics
- Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats
- Track both fanout and fanin edges in on_task_complete profiling stats
- Add Thread 3 prefix to orchestrator/TensorMap log output
- Add TensorMap lookup/insert profiling counters (PTO2_TENSORMAP_PROFILING)
- Rename PTO2_ORCH_PROFILING to PTO2_PROFILING
- Gate all profiling DEV_ALWAYS output behind runtime->enable_profiling
- Guard all per-phase perf recording with #if PTO2_PROFILING
- Unify release_fanin_and_check_ready into single method returning bool
- Add div-by-zero guards and task count validation
- Rewrite sched_overhead_analysis.py parser for new output format
- Update device_log_profiling.md examples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant