-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port stack tracing to Windows #168
Conversation
Also note: I updated the |
13d0440
to
c2b2dde
Compare
This relies on an updated `glog-da816ea70.tar.gz` tarball in the Mesos 3rdparty repository, which is itself generated from the `da816ea70` commit on github.com/google/glog. Additionally, this applies a patch that includes stacktrace support to glog on Windows. The code in this patch is included in an upstream pull request: google/glog#168 If and when that pull request gets merged, we will update the glog tarball and remove the need for the extra patch file. Review: https://reviews.apache.org/r/57490/
This relies on an updated `glog-da816ea70.tar.gz` tarball in the Mesos 3rdparty repository, which is itself generated from the `da816ea70` commit on github.com/google/glog. Additionally, this applies a patch that includes stacktrace support to glog on Windows. The code in this patch is included in an upstream pull request: google/glog#168 If and when that pull request gets merged, we will update the glog tarball and remove the need for the extra patch file. Review: https://reviews.apache.org/r/57490/
Note: The contents of this PR have been applied to the Windows build of Apache Mesos (apache/mesos@ef3ebf3). |
Ping @ukai 😄 |
@ukai can a maintainer please take a look? |
Adding @sergiud for a possible review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could you update AUTHORS and CONTRIBUTORS as well (see CONTRIBUTING.md)?
src/signalhandler.cc
Outdated
@@ -145,7 +145,7 @@ void DumpTimeInfo() { | |||
void DumpStackFrameInfo(const char* prefix, void* pc) { | |||
// Get the symbol name. | |||
const char *symbol = "(unknown)"; | |||
char symbolized[1024]; // Big enough for a sane symbol. | |||
char symbolized[1024] = {}; // Big enough for a sane symbol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't understand why this change is necessary. When Symbolize returns true, SymbolizeAndDemangle fills something to this buffer on both Windows and non-Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this was not the case on Linux, but on Windows, the Symbolize()
only filled the buffer as far as the symbol took. So you'd run into this kind of problem:
[uninitializedbufferofrandomdata]
[main()ializedbufferofrandomdata]
[someLongSymbol()ferofrandomdata]
[short()gSymbol()ferofrandomdata]
Since the buffer was not being initialized, but the same data location was being reused and only partially filled out, previous longer symbols left gunk in buffer with new shorter symbols.
I solved this by ensuring all elements were initialized to zero each time DumpStackFrameInfo
was called. In retrospect, if I ensure a the terminating null is added manually, this becomes unnecessary.
src/symbolize.cc
Outdated
symbol->MaxNameLen = MAX_SYM_NAME; | ||
bool ret = SymFromAddr(symInitializer.process, reinterpret_cast<DWORD64>(pc), 0, symbol); | ||
if (ret && static_cast<int>(symbol->NameLen) < out_size) { | ||
strncpy(out, symbol->Name, symbol->NameLen + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two space indentation please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also thanks. Not sure how I messed that up.
src/symbolize.cc
Outdated
SYMBOL_INFO *symbol = reinterpret_cast<SYMBOL_INFO*>(buf); | ||
symbol->SizeOfStruct = sizeof(SYMBOL_INFO); | ||
symbol->MaxNameLen = MAX_SYM_NAME; | ||
bool ret = SymFromAddr(symInitializer.process, reinterpret_cast<DWORD64>(pc), 0, symbol); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you break after the first comma? We usually don't use more than 80 columns per a line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, thanks.
src/symbolize_unittest.cc
Outdated
} | ||
#endif // HAVE_STACKTRACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add // HAVE_SYMBOLIZE? (apparently the previous comment was wrong..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This took me a little bit to figure out haha, fixing.
src/stacktrace_windows-inl.h
Outdated
if (max_depth > 64) { | ||
max_depth = 64; | ||
} | ||
std::vector<void*> stack(max_depth); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: On ELF systems, Symbolize and Demangle are meant to be async-signal-safe. As signal handler can be called even while malloc's internal lock is held, I think we use no malloc in Symbolize and Demangle. I don't know Windows signal handler. I assume it's OK to call any functions?
Hmm it seems to be no? https://msdn.microsoft.com/en-us/library/aa272921(v=vs.60).aspx
I'm also worried by Windows API calls to obtain stacktrace/symbol. I guess some of them are calling malloc internally.
src/signalhandler.cc
Outdated
#elif defined(OS_WINDOWS) | ||
typedef void (*SignalHandlerPointer)(int); | ||
SignalHandlerPointer previousHandler; | ||
previousHandler = signal(SIGABRT, &FailureSignalHandler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you only care SIGABRT? I think handling SIGFPE, SIGILL, and SIGSEGV would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where Windows gets weird. While SIGNAL.H
defines all of these signals as valid parameters to signal()
, it does so only for compatibility. Explicitly, the signals SIGILL
and SIGTERM
are not generated on Windows (though you can explicitly raise them).
However, I think that since you can explicitly raise these signals (and it appears that SIGFPE
and SIGSEGV
might be raised on Windows), I'll change this to handle them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note:
I ran into a major problem on Windows with signal handling. Let me reproduce the commit message here, as the commit is going to change when I update the PR:
On Windows, the signal handler is invoked on the thread on which it was installed, which is not necessarily the thread where the test failed. So instead of using the signal handler to dump the stack trace, which would dump it from the wrong thread, we dump the information immediately.
Doing this causes us to use a few more frames, so
DumpStackInfo
needed to configurable skip number.
The nicer approach would be to refactor the
MinimalFormatter
and info dumping functions out of the signal handler and into the logging library, which is a more relevant home for them, but it is less invasive to simply move the two functions Windows needed into the internal namespace.
So the usual code path that a test failure takes now skips signal handler entirely, and goes straight to dumping stack and symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is because the Windows signal handler is not like what is found on Linux like systems. Instead, it uses structured exception handling (SEH). I don't know much more than what I've researched on this (ironically I am not a Windows dev, I am a Linux dev), but I know this was quite an annoying problem to solve. Signaling on Windows switches threads (to the SEH thread, I guess?) and CaptureStackBackTrace
explicitly gets the backtrace for the current thread. I'm looking for a way to do this right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh here it is.
The signal handler was being on the thread on which it was installed, not a dedicated SEH thread (I was mis-remembering), and importantly, not the thread on which the signal was raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this. I did a bit more investigating, and now this works as expected. You install the failure signal handlers, and thing work like they do on Linux.
src/demangle.cc
Outdated
// only the mangled symbol, `?func@Foo@@SAXH@Z`. Fortunately, the | ||
// mangled symbol is guaranteed not to have parentheses, | ||
// so we search for `(` and extract up to `)`. | ||
std::string symbol(mangled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, it'd be better to avoid malloc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm confused. I think I've left a comment in stacktrace_windows-inl.h or somewhere, but the comment has disappeared :(
Anyway, I think calling malloc in signal handler isn't safe, so it'd be nicer to avoid using std::string. https://msdn.microsoft.com/en-us/library/aa272921(v=vs.60).aspx
I'm also worried by Windows APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I need to fix this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I found your other comment by the way, it was getting hidden under "show outdated" even though I don't think the code was actually outdated. Not sure what GitHub is up to.)
src/signalhandler.cc
Outdated
void DumpStackFrameInfo(const char* prefix, void* pc) { | ||
// Get the symbol name. | ||
const char *symbol = "(unknown)"; | ||
char symbolized[1024] = {}; // Big enough for a sane symbol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't understand why this change is necessary. When Symbolize returns true, SymbolizeAndDemangle fills something to this buffer on both Windows and non-Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment got duplicated.
Thanks for the patch! Sorry for the huge latency... Could you update AUTHORS and CONTRIBUTORS as well (see CONTRIBUTING.md)? I'm a bit worried by the use of malloc and Windows APIs in signal handler, but I think it's definitely much nicer than nothing. So I'm planning to merge this PR if others don't objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpicking, but otherwise looks good to me.
CMakeLists.txt
Outdated
if (WIN32) | ||
set (HAVE_STACKTRACE 1) | ||
set (HAVE_SYMBOLIZE 1) | ||
target_link_libraries (glog PUBLIC Dbghelp.lib) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the .lib
suffix here and replace Dbghelp
by proper capitalization of library's name, i.e., DbgHelp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and even more so, I think I should instead use:
#pragma comment(lib, "DbgHelp")
In an appropriate header, eliminating the need for us to target_link_libraries
or for consumers of Glog to do so either.
src/demangle.cc
Outdated
#include "demangle.h" | ||
|
||
#if defined(OS_WINDOWS) | ||
#include <Dbghelp.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency reasons, same as before, i.e., DbgHelp.h
instead of Dbghelp.h
. Windows Dev Center also suggests this header name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. This is where I'm going to add the pragma
for automatic linking as well.
src/symbolize.cc
Outdated
@@ -837,6 +837,57 @@ static ATTRIBUTE_NOINLINE bool SymbolizeAndDemangle(void *pc, char *out, | |||
|
|||
_END_GOOGLE_NAMESPACE_ | |||
|
|||
#elif defined(OS_WINDOWS) | |||
|
|||
#include "Dbghelp.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
src/symbolize.cc
Outdated
|
||
_START_GOOGLE_NAMESPACE_ | ||
|
||
class SymInitializer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class should not be copied, correct? To ensure no one does this in the future, I'd delete the copy constructor and copy assignment operator by adding the declarations
SymInitializer(const SymInitializer&);
SymInitializer& operator=(const SymInitializer&);
to the private section of the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
@serguid and @shinh Thank you so much for getting to this and reviewing it! I'll commit to updating it per reviews (plus feedback on some confusing Windows-y parts) by the end of the week. Talk to you soon! |
With a fresh set of eyes (and more Windows knowledge) I actually managed to get the signal handler working properly today (just tested an integration with Mesos). I'll get it cleaned up and the PR posted. |
2a173b0
to
1553f33
Compare
@shinh and @sergiud this is ready for a final review I believe. I fixed the mentioned issues, and fixed the deeper issue of the signal handler not being all the way hooked up. From my previous comments, I mentioned that I'd plumbed around the signal handler "not working as expected" on Windows; but I now have it working as expected, and undid the unnecessary changes. Users should use |
1553f33
to
0691a92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'm fine with merging this PR, but I'd like to ask a couple of questions.
src/symbolize.cc
Outdated
|
||
__declspec(noinline) static bool SymbolizeAndDemangle(void *pc, char *out, | ||
int out_size) { | ||
const SymInitializer symInitializer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess SymInitialize is a fairly complex function. Does it have no malloc?
Also, is it a good idea to construct SymInitializer every time? Perhaps the results of SymInitialize will be cached internally and the second SymInitialize call is fast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I meant this to be a const static
variable so it was initialized only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess SymInitialize is a fairly complex function. Does it have no malloc?
I'm unsure. I suppose we could construct SymInitialize
at startup (i.e. move symInitializer
out of the function here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure. I suppose we could construct SymInitialize at startup (i.e. move symInitializer out of the function here).
Yeah, but MSDN told me SymInitialize could be slow. Perhaps it might be a good idea to add a function which initializes SymInitialize if a library user wants to be safe. Anyway, it's up to you. I'm OK with the current patch. Probably I'll merge this PR tomorrow unless you want me to wait a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shinh that's where I'm at with it; I initialized it in the function since it can be slow. While I think adding a function to pre-initialize it for safety is a good idea, it's also not necessary for most use cases (judging by how I'm using this in Mesos anyways), so I'd rather not add it now.
I am good with the current contents of this PR for you to merge them. Thank you for your help!
Necessary when building with BUILD_SHARED_LIBS=1.
This method ensure that all users of glog get automatically linked to the DbgHelp library without needing to set compiler flags.
0691a92
to
d1f49ba
Compare
Just to add some test results: CentOS 7 (ensuring I didn't break anything): $ ./autogen.sh
$ ./configure
$ make -j8
$ make check
============================================================================
Testsuite summary for glog 0.3.5
============================================================================
# TOTAL: 6
# PASS: 6
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
$ mkdir build
$ cd build
$ cmake3 ..
$ make -j8
$ ctest
Test project /home/andschwa/src/glog/build
Start 1: demangle
1/6 Test #1: demangle ......................... Passed 0.00 sec
Start 2: logging
2/6 Test #2: logging .......................... Passed 0.53 sec
Start 3: signalhandler
3/6 Test #3: signalhandler .................... Passed 0.01 sec
Start 4: stacktrace
4/6 Test #4: stacktrace ....................... Passed 0.00 sec
Start 5: stl_logging
5/6 Test #5: stl_logging ...................... Passed 0.00 sec
Start 6: symbolize
6/6 Test #6: symbolize ........................ Passed 0.00 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 0.55 sec Windows 10: $ mkdir build
$ cd build
$ cmake ..
$ cmake --build . --config Debug
$ ctest -C Debug
Test project C:/Users/andschwa/src/glog/build
Start 1: demangle
1/6 Test #1: demangle ......................... Passed 0.05 sec
Start 2: logging
2/6 Test #2: logging ..........................***Failed 0.47 sec
Start 3: signalhandler
3/6 Test #3: signalhandler .................... Passed 0.02 sec
Start 4: stacktrace
4/6 Test #4: stacktrace .......................***Failed 0.97 sec
Start 5: stl_logging
5/6 Test #5: stl_logging ...................... Passed 0.01 sec
Start 6: symbolize
6/6 Test #6: symbolize ........................ Passed 0.03 sec
67% tests passed, 2 tests failed out of 6
Total Test time (real) = 1.58 sec
The following tests FAILED:
2 - logging (Failed)
4 - stacktrace (Failed)
Errors while running CTest The stacktrace test tends to fail in $ cmake --build . --config Release
$ ctest -C Release
Test project C:/Users/andschwa/src/glog/build
Start 1: demangle
1/6 Test #1: demangle ......................... Passed 0.00 sec
Start 2: logging
2/6 Test #2: logging ..........................***Failed 0.45 sec
Start 3: signalhandler
3/6 Test #3: signalhandler .................... Passed 0.02 sec
Start 4: stacktrace
4/6 Test #4: stacktrace ....................... Passed 0.02 sec
Start 5: stl_logging
5/6 Test #5: stl_logging ...................... Passed 0.01 sec
Start 6: symbolize
6/6 Test #6: symbolize ........................***Exception: Other 3.15 sec
67% tests passed, 2 tests failed out of 6
Total Test time (real) = 3.69 sec
The following tests FAILED:
2 - logging (Failed)
6 - symbolize (OTHER_FAULT)
Errors while running CTest And finally, the logging test fails just due to a difference in padding:
The golden file has Finally, I've also linked this library into Mesos and tested some expected failures with an |
Yeah, further changes can be done in a separate PR if necessary. Thanks! |
Woo! Thank you :) |
struct sigaction sig_action; | ||
memset(&sig_action, 0, sizeof(sig_action)); | ||
sigemptyset(&sig_action.sa_mask); | ||
sig_action.sa_handler = SIG_DFL; | ||
sigaction(SIGABRT, &sig_action, NULL); | ||
} | ||
#elif defined(OS_WINDOWS) | ||
signal(SIGABRT, SIG_DFL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this should set kFailureSignalHandlerInstalled
back to false
.
Port stack tracing to Windows
This set of patches implements the stack trace, symbolizer, and demangler for Windows, and their unit tests. It utilizes the Windows debugging APIs in
"Dbghelp.h"
.There is one
TODO
forIsFailureSignalHandlerInstalled
as I'm unsure yet how to implement it for Windows. That said, in order to make stack trace reporting work "as expected" for test failures, the data dump needs to be done immediately, instead of in the signal handler (see commit message for details).I've integration tested this with Mesos, and so far it is working well. I've also updated the unit tests where appropriate.