Skip to content

Missed optimization: Clang keeps large fixed local array alive across recursive calls, wasting stack space #171606

@SkyWaveSun

Description

@SkyWaveSun

Summary

Clang keeps a large fixed-size local array in the stack frame of a recursive function alive across recursive calls, even though the array is only used after both recursive calls return and its address does not escape into them.

This leads to stack usage of roughly depth * sizeof(array) instead of something close to sizeof(array) + O(depth) for a classic mergesort implementation, and seems to be a missed optimization in Clang/LLVM's lifetime analysis / stack allocation.

Environment

  • Clang version: 21.1.0
  • Platform: x86_64 (as used by Compiler Explorer / godbolt.org)
  • Command line used on Compiler Explorer:
  clang++ -std=c++23 -O3 -S mgsort.cpp -o mgsort.s

(On Compiler Explorer this corresponds to choosing Clang and passing -std=c++23 -O3.)

Test case

#include <algorithm>
#include <cstring>

constexpr int N = 100000 + 1;

int a[N];

void mgsort(int l, int r) {
    if (l == r) {
        return ;
    }
    int mid = (l + r) >> 1;
    mgsort(l, mid);
    mgsort(mid + 1, r);

    int b[N];
    std::merge(a + l, a + mid + 1, a + mid + 1, a + r + 1, b + l);
    std::memcpy(a + l, b + l, sizeof(int) * (r - l + 1));
}

This is intended to be standard C++ (no VLA).

Actual behavior

For non‑leaf calls, Clang allocates the full int b[N] array in the prologue of mgsort before making any recursive calls and keeps this stack space reserved until the function returns.

On x86_64 with -O3, the generated assembly looks like (simplified):

mgsort(int, int):
    mov     eax, esi
    sub     eax, edi
    je      .LBB0_14               # base case: l == r

    push    rbp
    push    r15
    push    r14
    push    r13
    push    r12
    push    rbx
    sub     rsp, 400024            # reserve ~400k of stack for b[N]

    ... compute mid, save l / r ...

    call    mgsort                 # recursive call on left half
    ...
    call    mgsort                 # recursive call on right half

    ... later, compute the address of b + l on the stack ...
    lea     r14, [rsp + 4*rax]
    add     r14, 16                # r14 = b + l (b is based at rsp+16)

    ... merge + memcpy ...

    add     rsp, 400024
    pop     rbx
    pop     r12
    pop     r13
    pop     r14
    pop     r15
    pop     rbp
.LBB0_14:
    ret

So every recursive frame reserves a full b[N] array on its stack frame, and this space is live across the recursive calls. With N = 100000 + 1, sizeof(b) is about 400 kB; the recursion depth is O(log N) (around 17 here), so the total stack consumed by this single variable is roughly 17 * 400kB ≈ 6–7 MB.

However, b is only accessed after both recursive calls have returned, and there is no way for a recursive call to access its caller's b.

Expected behavior

Since the first use of b is strictly after both recursive calls, and no pointer to b is passed into those calls (nor stored anywhere that
they could access), it should be legal for the compiler to shrink the allocation lifetime of b so that its storage does not overlap the recursive calls.

Conceptually, the compiler could:

  • move the actual stack allocation for b from the function prologue to just before the std::merge / memcpy code, and
  • free it immediately after memcpy returns.

With such an optimization, at run time there would only be a single active instance of b on the stack at any time, reducing stack usage from approximately depth * sizeof(b) to about sizeof(b) + O(depth).

This appears to be a missed optimization in Clang/LLVM's handling of large fixed-size local arrays in recursive functions.

Additional notes

  • The issue reproduces for me on Compiler Explorer (see the link above) with -std=c++23 -O3. It also appears with other high optimization levels (e.g. -O3 vs -Ofast) where stack allocation is similar.

  • A related GCC bug report with the same test case is: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123082

In that report, GCC developers classify this as a middle-end missed-optimization and suggest a workaround using a GNU C++ VLA extension. The intention here is similar: to see whether LLVM/Clang could perform a comparable lifetime-shrinking optimization for the standard C++ fixed-array case, without relying on extensions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions