Sanitizer test failures in symbolize_stack.cpp on AArch64 #55460

ilovepi · 2022-05-14T17:53:11Z

Fuchsia's clang CI has found a test failure in symbolize_stack.cpp for AArch64.

Due to another failure, we're not exactly sure when this first appeared, but the failing bot can be found here: https://luci-milo.appspot.com/ui/p/fuchsia/builders/toolchain.ci/clang-linux-arm64/b8814202726650721633/overview

For both UBSAN and LSAN, the test fails due to an unexpected warning causing matching to fail

==180122==WARNING: Symbolizer buffer too small

We saw the failure disappear briefly between 1ecc3d8 and c74753f, but I don't see anything in that blame list, or following blame list (7ff7001 to ac7a9ef) that suggests this behavior should be different.

The text was updated successfully, but these errors were encountered:

llvmbot · 2022-05-14T17:53:27Z

@llvm/issue-subscribers-bug

llvmbot · 2022-05-14T21:10:43Z

@llvm/issue-subscribers-backend-aarch64

ilovepi · 2022-05-16T17:03:42Z

This appears to be a more fundamental issue, since we're now occasionally seeing this on x64 linux bots as well.

https://luci-milo.appspot.com/ui/p/fuchsia/builders/toolchain.ci/clang-linux-x64/b8814193995900305713/overview

Some of the failing bots only have unrelated changes, such as a change to clangd (71cb8c8)

I'm looking at reproducing the issue since it also shows up on x64 linux. Since it seems to come and go, my worry is that there is some subtle race in the runtime code, or that a precondition may no longer hold because of a change elsewhere.

ilovepi · 2022-05-20T00:00:10Z

So after a lot of digging we determined the root cause of the issue. The test in symbolize_stack.cpp(

llvm-project/compiler-rt/test/sanitizer_common/TestCases/symbolize_stack.cpp

Line 30 in 1dfd8e9

A<10>().RecursiveTemplateFunction(0);

) calls a recursive template, designed to create a function with a long function name. The purpose of the test is to ensure that the symbolizer does not crash when the name is very long.

The SymbolizerProcess class defines a buffer of fixed length(16K)(

llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_internal.h

Line 116 in 1dfd8e9

static const uptr kBufferSize = 16 * 1024;

), and uses that buffer during symbolization.

However, in ReadFromSymbolizer() (

llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cpp

Line 520 in 1dfd8e9

while (true) {

) when reading into the buffer the symbolizer code checks if the buffer is full, and if so, basically aborts reading(

llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cpp

Line 533 in 1dfd8e9

if (read_len + 1 == max_length) {

), and discards all data read so far. The buffer has the null terminator written to its first character, and a pointer to it is returned back up the call chain. Thus, when the symbolizer data is eventually flushed , it is truncated. The long function name in question can be over 45K long, far exceeding the size of the underlying buffer.

The test in symbolize_stack.cpp is trying to match a set of nested c++ vectors in FileCheck, essentially matching vector<.*vector.*vector<.*vector.*vector<. Now typically the test seems to have been lucky in how the diagnostic output makes it into FileCheck, and in general there are usually at least 5 nested vectors available in the output, and the test passes.

In our CI, though, we have some fairly long path names, and further there have been some changes to how demangling is done recently. I think the confluence of these factors started causing us to hit the test failure with more regularity in our CI.

I do want to point out that this doesn't appear to be a bug in the demangler, but to be a subtle design flaw in how online symbolization works. Typically when reading into a fixed buffer, you keep working until the buffer is full (or you reach the end of the input), and then process whatever you've already received and then after you can discard the old contents, you start filing the buffer again until you've reached the end of the input. Now this is code that is doing online symbolization, and it seems unlikely to me that this code would generally be allowed to block. Given that I believe blocking would be required to operate in the way I've described, I'm unsure of the correct course of action here.

Sure, we can increase the buffer size. Setting it to the size of a huge page seems like a reasonable compromise. But that only kind of punts the problem down the line. Maybe the fix here is to stop dropping the read contents, and instead allow the symbolizer to flush what it's already ready -- even though it may be incomplete. Afterall, it's already printed a warning that the buffer was to small.

All of the quick solutions I can see probably render the test useless.

increasing the buffer size means we'd need to also increase the number of recursions
reducing the number of recursions means we are no longer testing that long function names don't trigger a crash
matching on earlier function names may silently ignore crashes

Addresses tests flakes described in #55460 The test being updated can fail in FileCheck to match when given long enough stack traces. This can be problematic when file system paths become long enough to cause the majority of the long function name to become truncated. We found in our CI that the truncated output would often fail to match, thereby causing the test to fail when it should not. Here we change the test to match on sybolizer output that should be more reliable than matching inside the long function name. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D126102

ilovepi · 2022-05-25T20:09:12Z

7f54399 lands a suppression to remove the flake. I'm going to leave this open until we land a more thorough fix of the underlying issue.

Addresses tests flakes described in llvm/llvm-project#55460 The test being updated can fail in FileCheck to match when given long enough stack traces. This can be problematic when file system paths become long enough to cause the majority of the long function name to become truncated. We found in our CI that the truncated output would often fail to match, thereby causing the test to fail when it should not. Here we change the test to match on sybolizer output that should be more reliable than matching inside the long function name. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D126102

ilovepi added bug Indicates an unexpected problem or unintended behavior compiler-rt labels May 14, 2022

EugeneZelenko added the backend:AArch64 label May 14, 2022

ilovepi closed this as completed in acfeb1a Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitizer test failures in symbolize_stack.cpp on AArch64 #55460

Sanitizer test failures in symbolize_stack.cpp on AArch64 #55460

ilovepi commented May 14, 2022

llvmbot commented May 14, 2022

llvmbot commented May 14, 2022

ilovepi commented May 16, 2022

ilovepi commented May 20, 2022

ilovepi commented May 25, 2022

Sanitizer test failures in symbolize_stack.cpp on AArch64 #55460

Sanitizer test failures in symbolize_stack.cpp on AArch64 #55460

Comments

ilovepi commented May 14, 2022

llvmbot commented May 14, 2022

llvmbot commented May 14, 2022

ilovepi commented May 16, 2022

ilovepi commented May 20, 2022

ilovepi commented May 25, 2022