Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Unix?) Registering signals is not MT-safe #5271

Closed
MarijnS95 opened this issue Jun 6, 2023 · 1 comment · Fixed by #5272
Closed

(Unix?) Registering signals is not MT-safe #5271

MarijnS95 opened this issue Jun 6, 2023 · 1 comment · Fixed by #5272

Comments

@MarijnS95
Copy link
Contributor

MarijnS95 commented Jun 6, 2023

We recently upgraded our DXC build from 0392e60 to ea3623f, and are now observing spurious SEGFAULTs. Backtraces lead to the -ftime-trace profiler introduced in #4873, even though we never set this command-line parameter and never initialize the function, and the non-null pointer value it reads from TimeTraceProfilerInstance is similarly garbage, hence the segfault.

We dlopen("libdxcompiler.so", RTLD_LAZY | RTLD_LOCAL) multiple times in our process (resulting in the same mapped library!), and access the library from many threads at once, and that has already been causing concurrency problems with statics such as #4818.

Backtrace
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc54ad0a0a8 in llvm::Entry::Entry (this=0x4864fe8941000000) at DirectXShaderCompiler/lib/Support/TimeProfiler.cpp:56
56	struct Entry {
[Current thread is 1 (Thread 0x7fc65cfff6c0 (LWP 178339))]
(gdb)
(gdb) bt
#0  0x00007fc54ad0a0a8 in llvm::Entry::Entry (this=0x4864fe8941000000) at DirectXShaderCompiler/lib/Support/TimeProfiler.cpp:56
#1  std::__new_allocator<llvm::Entry>::construct<llvm::Entry, llvm::Entry> (this=0x7fc54ad08ba0 <SignalHandler(int)>, __p=0x4864fe8941000000, __args=...) at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/13.1.1/../../../../include/c++/13.1.1/bits/new_allocator.h:187
#2  std::allocator_traits<std::allocator<llvm::Entry> >::construct<llvm::Entry, llvm::Entry> (__a=..., __p=0x4864fe8941000000, __args=...) at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/13.1.1/../../../../include/c++/13.1.1/bits/alloc_traits.h:537
#3  std::vector<llvm::Entry, std::allocator<llvm::Entry> >::emplace_back<llvm::Entry> (this=0x7fc54ad08ba0 <SignalHandler(int)>, __args=...) at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/13.1.1/../../../../include/c++/13.1.1/bits/vector.tcc:117
#4  std::vector<llvm::Entry, std::allocator<llvm::Entry> >::push_back (this=0x7fc54ad08ba0 <SignalHandler(int)>, __x=...) at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/13.1.1/../../../../include/c++/13.1.1/bits/stl_vector.h:1296
#5  llvm::TimeTraceProfiler::begin(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::function_ref<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>) (
    this=this@entry=0x7fc54ad08ba0 <SignalHandler(int)>, Name="Frontend", Detail=...) at DirectXShaderCompiler/lib/Support/TimeProfiler.cpp:72
#6  0x00007fc54ad09f85 in llvm::timeTraceProfilerBegin (Name=..., Detail=...) at DirectXShaderCompiler/lib/Support/TimeProfiler.cpp:172
#7  0x00007fc54b5d78eb in llvm::TimeTraceScope::TimeTraceScope (this=0x7fc65cff6b18, Name=..., Detail=...) at DirectXShaderCompiler/include/llvm/Support/TimeProfiler.h:58
#8  clang::ParseAST (S=..., PrintStats=false, SkipFunctionBodies=false) at DirectXShaderCompiler/tools/clang/lib/Parse/ParseAST.cpp:104
#9  0x00007fc54ad6aeb8 in clang::CodeGenAction::ExecuteAction (this=0x7fc65cff7148) at DirectXShaderCompiler/tools/clang/lib/CodeGen/CodeGenAction.cpp:807
#10 0x00007fc54aed10ae in clang::FrontendAction::Execute (this=0x7fc65cff7148) at DirectXShaderCompiler/tools/clang/lib/Frontend/FrontendAction.cpp:455
#11 0x00007fc54a929a7a in DxcCompiler::Compile (this=this@entry=0x7fc648973920, pSource=pSource@entry=0x7fc65cff7db0, pArguments=pArguments@entry=0x564f5b60b900, argCount=20, pIncludeHandler=0x7fc648838b20, riid=..., ppResult=0x7fc65cff7d60)
    at DirectXShaderCompiler/tools/clang/tools/dxcompiler/dxcompilerobj.cpp:1061
#12 0x00007fc54a924afb in hlsl::DxcCompilerAdapter::WrapCompile (this=<optimized out>, bPreprocess=<optimized out>, pSource=<optimized out>, pSourceName=<optimized out>, pEntryPoint=<optimized out>, pTargetProfile=0x7fc648985610 L"cs_6_6", pArguments=<optimized out>,
    argCount=<optimized out>, pDefines=<optimized out>, defineCount=<optimized out>, pIncludeHandler=<optimized out>, ppResult=<optimized out>, ppDebugBlobName=<optimized out>, ppDebugBlob=<optimized out>)
    at DirectXShaderCompiler/tools/clang/tools/dxcompiler/dxcompilerobj.cpp:1869
#13 0x00007fc54a925b2f in hlsl::DxcCompilerAdapter::CompileWithDebug (this=0x7fc65cff6a20, pSource=0x8, pSourceName=0x7fc65cff6a30 L"\x48979600翆\x48978b80翆\x222b4300\xdc25a39c\b", pEntryPoint=0x0, pTargetProfile=0x1 <error: Cannot access memory at address 0x1>,
    pArguments=0x7fc65cff6a20, argCount=<optimized out>, pDefines=<optimized out>, defineCount=<optimized out>, pIncludeHandler=<optimized out>, ppResult=<optimized out>, ppDebugBlobName=<optimized out>, ppDebugBlob=<optimized out>)
    at DirectXShaderCompiler/tools/clang/tools/dxcompiler/dxcompilerobj.cpp:1773
#14 0x00007fc54a9264b8 in hlsl::DxcCompilerAdapter::Compile (this=0x7fc65cff6a20, pSource=0x7fc65cff6a90, pSourceName=0x8 <error: Cannot access memory at address 0x8>, pEntryPoint=0x7fc65cff6a30 L"\x48979600翆\x48978b80翆\x222b4300\xdc25a39c\b", pTargetProfile=0x0,
    pArguments=0x1, argCount=11, pDefines=0x7fc55ce19830, defineCount=2, pIncludeHandler=0x7fc648838b20, ppResult=0x7fc65cff8970) at DirectXShaderCompiler/tools/clang/tools/dxcompiler/dxcompileradapter.h:69
#15 0x0000564f58ae170e in hassle_rs::wrapper::DxcCompiler::compile (self=<optimized out>, blob=<optimized out>, source_name=..., entry_point=..., target_profile=..., args=..., include_handler=..., defines=...) at src/wrapper.rs:271

Notice how the above stacktrace says the this pointer for TimeTraceProfiler (acquired from TimeTraceProfilerInstance) is suddenly a SignalHandler<int> object. Printing out and comparing the pointers shows that TimeTraceProfilerInstance sits directly after RegisteredSignalInfo in memory:

(gdb) p &RegisteredSignalInfo + 1
$21 = (struct {...} (*)[16]) 0x7fc54beb78e0 <llvm::TimeTraceProfilerInstance>
(gdb) p &TimeTraceProfilerInstance
$22 = (llvm::TimeTraceProfiler **) 0x7fc54beb78e0 <llvm::TimeTraceProfilerInstance>
(gdb) p &RegisteredSignalInfo + 1 == &TimeTraceProfilerInstance
$23 = true

RegisteredSignalInfo is an array of 16 elements, yet when planting some printfs in RegisterHandler() NumRegisteredSignals runs up to 32 or 48 instead of 16, concurrently from two/three threads. This is not caught by the assert() because those are disabled in release builds (and causing other issues I'll file shortly). It works most (±60%) of the time, likely because one of the threads gets to updating NumRegisteredSignals quick enough before any other checks for 0:

// If the handlers are already registered, we're done.
if (NumRegisteredSignals != 0) return;

Presumably this needs to be guarded with a lock or call_once? Additionally the new time profiler pointer might have to be guarded by a lock or call_once as well.


I'm curious why this wasn't surfacing for us earlier: are the handlers registered differently/more often now? Signals.inc hasn't changed for a very long time, and perhaps RegisteredSignalInfo was previously followed by more-or-less irrelevant static data members?

(We do have a longstanding issue where ctrl+c'ing our application after it compiled lots of shaders runs into a jmp to an invalid (or null) memory address from a signal handler, but have never figured out where it was coming from: that could be related to this.)


And how about enabling asserts in release builds? That would have preempted lots of digging as I was originally dismissing this assert as irrelevant "debug" noise (🤦) while debugging on a release build to match/repro our CI environment.

@MarijnS95
Copy link
Contributor Author

We already dereference SignalMutex and might as well just lock it there: no other function locks it concurrently before calling RegisterHandlers().

// We need to dereference the signals mutex during handler registration so
// that we force its construction. This is to prevent the first use being
// during handling an actual signal because you can't safely call new in a
// signal handler.
*SignalsMutex;

This is in fact exactly what upstream LLVM ended up doing: llvm-mirror/llvm@fa92ec8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant