Skip to content

Fix: harden ring heap boundary checks and add allocator diagnostics#544

Merged
poursoul merged 1 commit into
hw-native-sys:mainfrom
yanghaoran29:fix-heap
Apr 14, 2026
Merged

Fix: harden ring heap boundary checks and add allocator diagnostics#544
poursoul merged 1 commit into
hw-native-sys:mainfrom
yanghaoran29:fix-heap

Conversation

@yanghaoran29

Copy link
Copy Markdown
Contributor

Disallow full-gap allocation when tail - top == alloc_size to preserve unambiguous top/tail semantics, and add targeted allocator logs for wrap-around and allocation-failure paths to simplify field debugging.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logging for allocation events in the PTO2TaskAllocator and modifies the allocation boundary condition in both a2a3 and a5 runtimes. It also removes a heap configuration from a test file. The review feedback identifies that the added LOG_INFO statements in the allocation failure paths will cause significant performance degradation and log spam because they are executed within a spin loop. It is recommended to remove these logs as diagnostic information is already captured by existing deadlock reporting mechanisms.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
Disallow full-gap allocation when `tail - top == alloc_size` to preserve unambiguous top/tail semantics, and add targeted allocator logs for wrap-around and allocation-failure paths to simplify field debugging.

Made-with: Cursor
@yanghaoran29

Copy link
Copy Markdown
Contributor Author

Ring Buffer Heap Fix: Resolve Ambiguity of top == tail

Background

In the sample simpler/tests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention, the test fails when heap_size=256 MiB and passes when heap_size=1 GiB.

Theoretical Analysis

Theoretical Maximum Memory (Extreme Case: Allocation Only, No Reclamation)

Assume heap_ring only performs allocation without reclamation during runtime. The peak memory can be estimated by the following formulas:

  • IN_CORE_BATCH = 16
  • q_tile = min(num_heads, 128)
  • N = chunk_bc * q_tile
  • max_bn = ceil(context_len / block_size)
  • Per-chunk persistent accumulation area (oi_batch/li_batch/mi_batch):
    • M_acc = N * (4 * head_dim + 8) bytes
  • Total temporary allocation for a single bn (sij/pij/mij/lij/oi_new):
    • M_bn = N * (6 * block_size + 4 * head_dim + 8) bytes
  • Theoretical maximum memory:
    • M_max = q_loop * num_chunks * (M_acc + max_bn * M_bn)

Substitute the 3 cases from golden.py:

  • Case1 (batch=256,num_heads=16,head_dim=128,block_size=128,context_len=8192)
    M_max = 339,771,392 B = 324.03 MiB
  • Case2 (batch=64,num_heads=64,head_dim=128,block_size=64,context_len=8192)
    M_max = 476,086,272 B = 454.03 MiB
  • Case3 (batch=64,num_heads=64,head_dim=256,block_size=64,context_len=8192)
    M_max = 746,618,880 B = 712.03 MiB

Conclusion: The theoretical upper bounds of all three cases fall between 256 MiB and 1 GiB. If the memory reclamation mechanism is abnormal, the phenomenon of failure at 256 MiB and success at 1 GiB is reasonable.

Theoretical Minimum Memory (Ideal Case: Timely Reclamation)

Assume heap_ring reclaims memory fully and timely. The peak memory is determined by the persistent accumulation area + single-step pipeline peak:

  • M_min = M_acc + max(sij, sij+pij+mij+lij, pij+mij+lij+oi_new)
  • Where:
    • sij = N * block_size * 4
    • pij = N * block_size * 2
    • mij+lij = N * 8
    • oi_new = N * head_dim * 4

Calculations for the 3 cases:

  • Case1: 331,776 B = 0.316 MiB
  • Case2: 1,196,032 B = 1.141 MiB
  • Case3: 2,244,608 B = 2.141 MiB

Conclusion: The theoretical lower bound is only at the MiB level, far less than 256 MiB. Therefore, the sample failure is not directly caused by insufficient theoretical minimum memory.

Error Reproduction

Add logging for a specific heap_top_ value (e.g., 69111808) in PTO2TaskAllocResult alloc(int32_t output_size):

if (heap_top_ == 69111808) {
    LOG_ERROR(
        "alloc request: size=%d bytes, aligned=%" PRIu64 " bytes, heap_top=%" PRIu64 ", heap_tail=%" PRIu64,
        output_size, aligned_size, heap_top_, heap_tail_
    );
}

Sample log:

[pto_ring_buffer.h:134] alloc request: size=131072 bytes, aligned=131072 bytes, heap_top=69111808, heap_tail=329728
[pto_ring_buffer.h:134] alloc request: size=131072 bytes, aligned=131072 bytes, heap_top=69111808, heap_tail=61659136

The first log corresponds to the normal state; the second log shows that allocation can still proceed after heap_tail changes while heap_top remains unchanged. This indicates an ambiguity issue after ring wrapping in the ring buffer state, leading to incorrect behavior.

Note: It is not recommended to print logs for every alloc call. High-frequency logs significantly alter the timing, which may grant heap_tail more reclamation time and cause false positives.

Root Cause Analysis

In the heap ring allocator of tensormap_and_ringbuffer, heap_top_ denotes the allocation cursor, and heap_tail_ denotes the reclamation boundary. The core issue is that the original condition for the top < tail branch is too permissive:

  • Original logic: if (tail - top >= alloc_size) { ... }

Allocation is allowed even when tail - top == alloc_size, resulting in top == tail after allocation. In ring buffer semantics, this state can represent either exact alignment or value equality after one full lap, which is indistinguishable by numerical value alone and causes state ambiguity.

Illustrated with a simplified example (heap_size=16, allocation size = 4 per time):

  • After task 1: top=4, tail=0
  • After task 2: top=8, tail=0
  • After task 3: top=12, tail=0
  • Reclaim task1: top=12, tail=4
  • After task 4: top=16, tail=4
  • Reclaim task2: top=16, tail=8
  • After task 5: top=4, tail=8
  • After task 6: top=8, tail=8 (numerical overlap of top and tail)
  • After task 7: top=12, tail=8 (incorrect allocation may continue and overwrite old data in use)

Fix Details

Fix Principle: When top < tail, at least 1 byte of gap must be reserved after allocation; exact full utilization is prohibited.

Code Changes

The same fix is applied to the following two files:

  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h

Core modification:

  • tail - top >= alloc_size
    changed to
    tail - top > alloc_size

Logging Addition

To facilitate locating boundary behaviors of the ring allocator, two types of logs are added to the same function:

  • Printed before return nullptr in try_bump_heap() for allocation failures, including top, tail, alloc_size and available space fields.
  • Printed when the wrap-around allocation branch else if (tail > alloc_size) is hit, recording key states during wrap-around allocation.

Affected files:

  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h

Post-Fix Semantics

  • When top < tail, allocation is allowed only if the strict greater-than (>) condition is satisfied.
  • The wrap-around path for top >= tail inherently uses tail > alloc_size, consistent with the semantics after this fix.
  • By reserving a minimal gap, unambiguous judgment of the ring buffer is restored, avoiding conflicting top == tail states.

@poursoul poursoul merged commit c83754f into hw-native-sys:main Apr 14, 2026
13 checks passed
@yanghaoran29 yanghaoran29 deleted the fix-heap branch April 14, 2026 07:27
luohuan19 pushed a commit that referenced this pull request Apr 14, 2026
…544)

Disallow full-gap allocation when `tail - top == alloc_size` to preserve unambiguous top/tail semantics, and add targeted allocator logs for wrap-around and allocation-failure paths to simplify field debugging.

Made-with: Cursor
@yanghaoran29 yanghaoran29 moved this to Done in pto project Apr 17, 2026
@yanghaoran29 yanghaoran29 self-assigned this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants