Support: upgrade profiling pipeline with TensorMap instrumentation#167
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the scheduler's performance profiling capabilities, which were partially lost during a previous migration. It reintroduces fine-grained phase profiling, updates the metrics collected, and refines the output format for better analysis. These changes provide deeper insights into scheduler behavior, allowing for more effective identification of bottlenecks and optimization opportunities. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively restores and enhances the scheduler's profiling capabilities, aiming to provide more detailed performance insights. A security review, however, identified a potential out-of-bounds access in the newly added release_fanin_and_check_ready_counted method due to missing validation of the worker_type field, which is read from shared memory and used as an array index. Additionally, there are a few areas for improvement: a minor correctness issue in an average calculation, an opportunity to reduce code duplication for better maintainability, and a correction to a mathematical formula in the Python analysis script.
9204e19 to
6cdea13
Compare
3f507f8 to
2c6fe5d
Compare
cc33cc5 to
0510109
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors and enhances the profiling infrastructure for the PTO2 scheduler and orchestrator. Key changes include renaming the profiling macro from PTO2_ORCH_PROFILING to PTO2_PROFILING for broader applicability, and introducing a new PTO2_TENSORMAP_PROFILING macro for more granular TensorMap statistics. The SCHED_EARLY_READY phase has been renamed to SCHED_IDLE_WAIT to better reflect its purpose of tracking idle/spinning time, and corresponding updates were made in the performance collector and documentation. Significant additions were made to scheduler profiling metrics, including tracking pop_hit, pop_miss, notify_edges_total, notify_max_degree, and notify_tasks_enqueued, which are now captured via a new PTO2CompletionStats struct returned by on_task_complete. The scheduler's log output has been updated to reflect these new metrics and provide a more detailed phase breakdown, removing previous lock contention statistics. The Python analysis scripts (sched_overhead_analysis.py, swimlane_converter.py) and the device_log_profiling.md documentation have been updated to parse and interpret these new profiling outputs, removing references to old metrics like 'early_ready' and 'lock contention'. Review comments suggest adding comments for clarity on profiling increments, using static for global profiling variables to limit their scope, and refining the warning message for mismatched task counts in the swimlane_converter.py script.
0510109 to
d6f26b9
Compare
- Restore perf_aicpu_record_phase calls for all scheduler phases - Replace old yield/orch_drain counters with notify/pop/idle metrics - Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats - Track both fanout and fanin edges in on_task_complete profiling stats - Add Thread 3 prefix to orchestrator/TensorMap log output - Add TensorMap lookup/insert profiling counters (PTO2_TENSORMAP_PROFILING) - Rename PTO2_ORCH_PROFILING to PTO2_PROFILING - Gate all profiling DEV_ALWAYS output behind runtime->enable_profiling - Guard all per-phase perf recording with #if PTO2_PROFILING - Unify release_fanin_and_check_ready into single method returning bool - Add div-by-zero guards and task count validation - Rewrite sched_overhead_analysis.py parser for new output format - Update device_log_profiling.md examples
d6f26b9 to
4e2f1db
Compare
…w-native-sys#167) - Restore perf_aicpu_record_phase calls for all scheduler phases - Replace old yield/orch_drain counters with notify/pop/idle metrics - Upgrade DEV_ALWAYS output with per-phase breakdown and detailed stats - Track both fanout and fanin edges in on_task_complete profiling stats - Add Thread 3 prefix to orchestrator/TensorMap log output - Add TensorMap lookup/insert profiling counters (PTO2_TENSORMAP_PROFILING) - Rename PTO2_ORCH_PROFILING to PTO2_PROFILING - Gate all profiling DEV_ALWAYS output behind runtime->enable_profiling - Guard all per-phase perf recording with #if PTO2_PROFILING - Unify release_fanin_and_check_ready into single method returning bool - Add div-by-zero guards and task count validation - Rewrite sched_overhead_analysis.py parser for new output format - Update device_log_profiling.md examples
Summary
perf_aicpu_record_phasecalls for all scheduler phases (complete, dispatch, scan, idle) lost during scheduler API migrationon_task_completeprofiling stats to explain complete-phase overheadThread 3:prefix to orchestrator/TensorMap log output for consistent device log formatPTO2_TENSORMAP_PROFILINGmacro (default off)PTO2_ORCH_PROFILING→PTO2_PROFILING(controls both orchestrator and scheduler)runtime->enable_profiling#if PTO2_PROFILINGto ensurePTO2_PROFILING=0compileson_task_completestats behind#if PTO2_PROFILINGto avoid hot-path overhead when profiling is offrelease_fanin_and_check_ready/release_fanin_and_check_ready_countedinto single method returning boolsched_overhead_analysis.pyparser for new output format with fanout/fanin separationdevice_log_profiling.mdexamplesCloses #159
Testing
--enable-profiling: profiling output appears--enable-profiling: profiling output suppressedpytest tests -v(21 passed)