[Clang] stack frame is way too large in coroutine at low optimization levels #57638
Description
Our internal build system just shipped opaque pointers, by removing -Xclang=-no-opaque-pointers from our build arguments. When this happened I noticed a regression: at the default optimization level, clang makes coroutine resume function stacks much larger than necessary.
Here is a simple program with a function ArrayOnCoroutineFrame that creates a large local array that must go on the coroutine frame because it may need to survive a suspension:
#include <array>
#include <coroutine>
#include <optional>
struct MyTask{
struct promise_type {
MyTask get_return_object() { return {std::coroutine_handle<promise_type>::from_promise(*this)}; }
std::suspend_always initial_suspend() { return {}; }
void unhandled_exception();
void return_void() {}
auto await_transform(MyTask task) {
struct Awaiter {
bool await_ready() { return false; }
std::coroutine_handle<promise_type> await_suspend(std::coroutine_handle<promise_type> h) {
caller.resume_when_done = h;
return std::coroutine_handle<promise_type>::from_promise(callee);
}
void await_resume() {
std::coroutine_handle<promise_type>::from_promise(callee).destroy();
}
promise_type& caller;
promise_type& callee;
};
return Awaiter{*this, task.handle.promise()};
}
auto final_suspend() noexcept {
struct Awaiter {
bool await_ready() noexcept { return false; }
std::coroutine_handle<promise_type> await_suspend(std::coroutine_handle<promise_type> h) noexcept {
return to_resume;
}
void await_resume() noexcept;
std::coroutine_handle<promise_type> to_resume;
};
return Awaiter{resume_when_done};
}
// The coroutine to resume when we're done.
std::coroutine_handle<promise_type> resume_when_done;
};
// A handle for the coroutine that returned this task.
std::coroutine_handle<promise_type> handle;
};
MyTask DoSomething();
MyTask ArrayOnCoroutineFrame() {
std::array<std::optional<int>, 10'000> vals;
for (auto& val : vals) {
(void)val;
co_await DoSomething();
}
}When compiled with -std=c++20 -Xclang=-no-opaque-pointers, clang correctly observes that ArrayOnCoroutineFrame.resume needs only a small stack size, since the array is on the coroutine frame:
ArrayOnCoroutineFrame() [clone .resume]: # @ArrayOnCoroutineFrame() [clone .resume]
push rbp
mov rbp, rsp
sub rsp, 304
mov qword ptr [rbp - 168], rdi # 8-byte Spill
mov qword ptr [rbp - 8], rdi
[...]But when you compile with just -std=c++20 it fails to do this, giving it a huge stack frame despite the fact that it does seem to build the array on the coroutine frame:
ArrayOnCoroutineFrame() [clone .resume]: # @ArrayOnCoroutineFrame() [clone .resume]
push rbp
mov rbp, rsp
sub rsp, 80368
mov qword ptr [rbp - 80240], rdi # 8-byte Spill
mov qword ptr [rbp - 8], rdi
mov rax, rdi
add rax, 80081
mov qword ptr [rbp - 80232], rax # 8-byte Spill
mov rax, rdi
add rax, 80
mov qword ptr [rbp - 80224], rax # 8-byte Spill
[...]
mov rdi, qword ptr [rbp - 80224] # 8-byte Reload
call std::array<std::optional<int>, 10000ul>::array() [base object constructor]
[...]This is not a bug so much as a missed optimization. I don't know if it's reasonable to expect this optimization to be done at the default optimization level, but it was done before opaque pointers shipped so it's sort of a regression. Is it easy to make it work again?
Metadata
Assignees
Labels
Type
Projects
Status
No status
Activity