Update: progress-aware ring buffer spin detection with env-configurable sizes#275
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and diagnostic capabilities of the dependency pool management system. By introducing explicit reclamation and deadlock detection mechanisms, it addresses potential silent hangs and provides clearer guidance for capacity issues. Additionally, the expanded test suite for paged attention ensures that the system is thoroughly validated under diverse and demanding conditions, improving overall stability and developer experience. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces important improvements for dependency pool management, including deadlock detection and more robust error reporting. The core change is the pto2_dep_pool_ensure_space function, which proactively checks for available space in the dependency pool and spin-waits if necessary, with a timeout to detect and report deadlocks. The reclamation logic has also been nicely refactored into a reusable helper. The changes are applied to both a2a3 and a5 runtime variants, and I've noted the significant code duplication between them as a point for future improvement. Additionally, the new paged attention test cases are a valuable addition for stress-testing the system under various loads. My review includes suggestions to prevent potential division-by-zero issues in the new error logging paths.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
Outdated
Show resolved
Hide resolved
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
Outdated
Show resolved
Hide resolved
81378cb to
b9b2c45
Compare
579855e to
5edaa65
Compare
|
新的死锁/overflow “Solution” 文案没有体现你新增的运行时 knob 现在 dep pool / task window / heap 的 fatal 日志仍主要提示改宏(PTO2_*_SIZE),但你已经引入了运行时覆盖(至少 a2a3 的 PTO2_RING_DEP_POOL / PTO2_RING_TASK_WINDOW / PTO2_RING_HEAP),对于现场排障的人不友好。
|
…le sizes - Add progress-aware spin detection to all three ring buffers (TaskRing, HeapRing, DepPool): reset spin counter when last_task_alive or heap_tail advances, only report deadlock when truly stuck - Revert PTO2_TASK_WINDOW_SIZE from 131072 to 65536 (progress-aware detection eliminates the doubled window workaround from PR hw-native-sys#273) - Add PTO2_RING_DEP_POOL env var for runtime dep pool size control, completing the set with PTO2_RING_TASK_WINDOW and PTO2_RING_HEAP - Thread dep_pool_capacity through runtime creation APIs - Extract dep pool reclamation into reusable pto2_dep_pool_reclaim() - Guard against division by zero in dep pool diagnostic logging - Update all deadlock/overflow Solution messages to show both compile-time macro and runtime env var (PTO2_RING_HEAP/TASK_WINDOW/DEP_POOL) overrides - Add paged_attention_ringbuffer test with small ring sizes (TW=1024, HP=1MB, DP=1024) to guard rotation/reclamation logic - Add 3 new paged attention test cases (batch=512, context=16384, batch=32)
5edaa65 to
c5adb6e
Compare
| LOG_ERROR(" task_window_size parameter."); | ||
| LOG_ERROR(" Increase task window size (current: %d, recommended: %d)", window_size, active_count * 2); | ||
| LOG_ERROR(" Compile-time: PTO2_TASK_WINDOW_SIZE in pto_runtime2_types.h"); | ||
| LOG_ERROR(" Runtime env: PTO2_RING_TASK_WINDOW=<power-of-2> (e.g. %d)", active_count * 2); |
There was a problem hiding this comment.
active_count * 2 -> total_count * 2, to be fixed later
…HEAD) Synchronize A5 tensormap_and_ringbuffer runtime and platform with a2a3 improvements introduced after 56a2c61. Follows the sync pattern established in hw-native-sys#250 and hw-native-sys#300. Platform (src/a5/platform/): - 2f58a2f (hw-native-sys#267): add AICPU thread affinity (platform_aicpu_affinity.h/cpp), PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp, CMakeLists.txt - b903e7b: sync perf_profiling.h for multi-ring support - 334d355 (hw-native-sys#254): sync performance_collector_aicore.h for slim dispatch Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 334d355 (hw-native-sys#254): slim dispatch payload in aicore_executor.cpp - dd7ada4: standardize register init and exit handshake in aicore_executor.cpp - 2f58a2f (hw-native-sys#267): AICPU affinity gate in aicpu_executor.cpp Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - e2e38b9 (hw-native-sys#249): cluster-based mixed-task dispatch; add pto_submit_types.h and SUBMIT_BY_CLUSTER.md - a842263 (hw-native-sys#255): separate local ready queue by CoreType in pto_scheduler.h - cf6462c (hw-native-sys#268): consolidate per-task state into PTO2TaskSlotState (pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp) - b903e7b: multi-ring buffer architecture (pto_shared_memory, MULTI_RING.md, aicpu_executor.cpp, perf_profiling.h) - 5d92137 (hw-native-sys#264): DepListPool ring buffer reclamation (pto_ring_buffer.h/cpp) - 54d082c (hw-native-sys#281): replace task_id with slot-state pointer across scheduler, orchestrator, ring buffer, executor, RUNTIME_LOGIC.md - d305376 (hw-native-sys#277): add scope deadlock detection in pto_orchestrator - 1e41a3a (hw-native-sys#274): per-thread orchestrator phase profiling - f5da078 (hw-native-sys#275): progress-aware ring buffer spin detection (pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp) - 10f6415 (hw-native-sys#284): tighten PTO2_PROFILING macro guards; sync profiling_levels.md - 9c158e0 (hw-native-sys#291): emergency shutdown on fatal error (aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory) - 94f39ff (hw-native-sys#301): refactor PTOParam to aggregated container with parallel arrays (pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory, pto_tensormap, pto_orchestrator, runtime2) - 15e6034 (hw-native-sys#308): refactor Tensor fields and pto_tensormap for cache locality - 77a81aa (hw-native-sys#306): replace PTOParam assert with orchestration error handling Examples & tests (examples/a5/, tests/device_tests/a5/): - 8cf8981 (hw-native-sys#293): replace PipeSyncFunc with FULL_MEMORY_BARRIER in kernels - b88eed3 (hw-native-sys#302): optimize paged attention pipeline, eliminate GM round-trips - 94f39ff (hw-native-sys#301) + 15e6034 (hw-native-sys#308): update orchestration to new PTOParam API
…HEAD) Synchronize A5 tensormap_and_ringbuffer runtime and platform with a2a3 improvements introduced after 56a2c61. Follows the sync pattern established in hw-native-sys#250 and hw-native-sys#300. Platform (src/a5/platform/): - 2f58a2f (hw-native-sys#267): add AICPU thread affinity (platform_aicpu_affinity.h/cpp), PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp, CMakeLists.txt - b903e7b: sync perf_profiling.h for multi-ring support - 334d355 (hw-native-sys#254): sync performance_collector_aicore.h for slim dispatch Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 334d355 (hw-native-sys#254): slim dispatch payload in aicore_executor.cpp - dd7ada4: standardize register init and exit handshake in aicore_executor.cpp - 2f58a2f (hw-native-sys#267): AICPU affinity gate in aicpu_executor.cpp Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - e2e38b9 (hw-native-sys#249): cluster-based mixed-task dispatch; add pto_submit_types.h and SUBMIT_BY_CLUSTER.md - a842263 (hw-native-sys#255): separate local ready queue by CoreType in pto_scheduler.h - cf6462c (hw-native-sys#268): consolidate per-task state into PTO2TaskSlotState (pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp) - b903e7b: multi-ring buffer architecture (pto_shared_memory, MULTI_RING.md, aicpu_executor.cpp, perf_profiling.h) - 5d92137 (hw-native-sys#264): DepListPool ring buffer reclamation (pto_ring_buffer.h/cpp) - 54d082c (hw-native-sys#281): replace task_id with slot-state pointer across scheduler, orchestrator, ring buffer, executor, RUNTIME_LOGIC.md - d305376 (hw-native-sys#277): add scope deadlock detection in pto_orchestrator - 1e41a3a (hw-native-sys#274): per-thread orchestrator phase profiling - f5da078 (hw-native-sys#275): progress-aware ring buffer spin detection (pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp) - 10f6415 (hw-native-sys#284): tighten PTO2_PROFILING macro guards; sync profiling_levels.md - 9c158e0 (hw-native-sys#291): emergency shutdown on fatal error (aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory) - 94f39ff (hw-native-sys#301): refactor PTOParam to aggregated container with parallel arrays (pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory, pto_tensormap, pto_orchestrator, runtime2) - 15e6034 (hw-native-sys#308): refactor Tensor fields and pto_tensormap for cache locality - 77a81aa (hw-native-sys#306): replace PTOParam assert with orchestration error handling Examples & tests (examples/a5/, tests/device_tests/a5/): - 8cf8981 (hw-native-sys#293): replace PipeSyncFunc with FULL_MEMORY_BARRIER in kernels - b88eed3 (hw-native-sys#302): optimize paged attention pipeline, eliminate GM round-trips - 94f39ff (hw-native-sys#301) + 15e6034 (hw-native-sys#308): update orchestration to new PTOParam API
Summary
last_task_aliveorheap_tailadvances, only report deadlock when truly stuckPTO2_TASK_WINDOW_SIZEfrom 131072 to 65536 (progress-aware detection eliminates the doubled window workaround from PR Fix: support batch=256 by doubling task window and fixing deferred release #273)PTO2_RING_DEP_POOLenv var for runtime dep pool size control, completing the set withPTO2_RING_TASK_WINDOWandPTO2_RING_HEAPdep_pool_capacitythrough runtime creation APIspto2_dep_pool_reclaim()PTO2_RING_HEAP/PTO2_RING_TASK_WINDOW/PTO2_RING_DEP_POOL) overrides for easier on-site troubleshootingpaged_attention_ringbuffertest with small ring sizes (TW=1024, HP=1MB, DP=1024) to guard rotation/reclamation logicTesting
paged_attention_ringbufferCase1 (batch=32) passes with 1MB heap (empirically verified minimum)