Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port stack tracing to Windows #168

Merged
merged 7 commits into from
Jun 28, 2017
Merged

Conversation

andyleejordan
Copy link
Contributor

This set of patches implements the stack trace, symbolizer, and demangler for Windows, and their unit tests. It utilizes the Windows debugging APIs in "Dbghelp.h".

There is one TODO for IsFailureSignalHandlerInstalled as I'm unsure yet how to implement it for Windows. That said, in order to make stack trace reporting work "as expected" for test failures, the data dump needs to be done immediately, instead of in the signal handler (see commit message for details).

I've integration tested this with Mesos, and so far it is working well. I've also updated the unit tests where appropriate.

@andyleejordan
Copy link
Contributor Author

Also note: I updated the CMakeLists.txt build system, as that's what I'm familiar with, but there appear to be several Windows build systems maintained in parallel.

@andyleejordan andyleejordan force-pushed the windows-stacktrace branch 4 times, most recently from 13d0440 to c2b2dde Compare March 9, 2017 22:31
kaysoky pushed a commit to kaysoky/mesos that referenced this pull request Mar 14, 2017
This relies on an updated `glog-da816ea70.tar.gz` tarball
in the Mesos 3rdparty repository, which is itself generated from
the `da816ea70` commit on github.com/google/glog.

Additionally, this applies a patch that includes stacktrace support
to glog on Windows.  The code in this patch is included in an upstream
pull request:
google/glog#168

If and when that pull request gets merged, we will update the glog
tarball and remove the need for the extra patch file.

Review: https://reviews.apache.org/r/57490/
kaysoky pushed a commit to kaysoky/mesos that referenced this pull request Mar 14, 2017
This relies on an updated `glog-da816ea70.tar.gz` tarball
in the Mesos 3rdparty repository, which is itself generated from
the `da816ea70` commit on github.com/google/glog.

Additionally, this applies a patch that includes stacktrace support
to glog on Windows.  The code in this patch is included in an upstream
pull request:
google/glog#168

If and when that pull request gets merged, we will update the glog
tarball and remove the need for the extra patch file.

Review: https://reviews.apache.org/r/57490/
@kaysoky
Copy link

kaysoky commented Mar 14, 2017

Note: The contents of this PR have been applied to the Windows build of Apache Mesos (apache/mesos@ef3ebf3).

@andyleejordan
Copy link
Contributor Author

Ping @ukai 😄

@andyleejordan
Copy link
Contributor Author

@ukai can a maintainer please take a look?

@andyleejordan
Copy link
Contributor Author

@ukai and @shinh anything blocking this, please, can it at least be reviewed?

@andyleejordan
Copy link
Contributor Author

Adding @sergiud for a possible review.

@andyleejordan
Copy link
Contributor Author

Ping @sergiud @ukai @shinh

Copy link
Collaborator

@shinh shinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you update AUTHORS and CONTRIBUTORS as well (see CONTRIBUTING.md)?

@@ -145,7 +145,7 @@ void DumpTimeInfo() {
void DumpStackFrameInfo(const char* prefix, void* pc) {
// Get the symbol name.
const char *symbol = "(unknown)";
char symbolized[1024]; // Big enough for a sane symbol.
char symbolized[1024] = {}; // Big enough for a sane symbol.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't understand why this change is necessary. When Symbolize returns true, SymbolizeAndDemangle fills something to this buffer on both Windows and non-Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this was not the case on Linux, but on Windows, the Symbolize() only filled the buffer as far as the symbol took. So you'd run into this kind of problem:

[uninitializedbufferofrandomdata]
[main()ializedbufferofrandomdata]
[someLongSymbol()ferofrandomdata]
[short()gSymbol()ferofrandomdata]

Since the buffer was not being initialized, but the same data location was being reused and only partially filled out, previous longer symbols left gunk in buffer with new shorter symbols.

I solved this by ensuring all elements were initialized to zero each time DumpStackFrameInfo was called. In retrospect, if I ensure a the terminating null is added manually, this becomes unnecessary.

src/symbolize.cc Outdated
symbol->MaxNameLen = MAX_SYM_NAME;
bool ret = SymFromAddr(symInitializer.process, reinterpret_cast<DWORD64>(pc), 0, symbol);
if (ret && static_cast<int>(symbol->NameLen) < out_size) {
strncpy(out, symbol->Name, symbol->NameLen + 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two space indentation please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also thanks. Not sure how I messed that up.

src/symbolize.cc Outdated
SYMBOL_INFO *symbol = reinterpret_cast<SYMBOL_INFO*>(buf);
symbol->SizeOfStruct = sizeof(SYMBOL_INFO);
symbol->MaxNameLen = MAX_SYM_NAME;
bool ret = SymFromAddr(symInitializer.process, reinterpret_cast<DWORD64>(pc), 0, symbol);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you break after the first comma? We usually don't use more than 80 columns per a line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, thanks.

}
#endif // HAVE_STACKTRACE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add // HAVE_SYMBOLIZE? (apparently the previous comment was wrong..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took me a little bit to figure out haha, fixing.

if (max_depth > 64) {
max_depth = 64;
}
std::vector<void*> stack(max_depth);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: On ELF systems, Symbolize and Demangle are meant to be async-signal-safe. As signal handler can be called even while malloc's internal lock is held, I think we use no malloc in Symbolize and Demangle. I don't know Windows signal handler. I assume it's OK to call any functions?

Hmm it seems to be no? https://msdn.microsoft.com/en-us/library/aa272921(v=vs.60).aspx

I'm also worried by Windows API calls to obtain stacktrace/symbol. I guess some of them are calling malloc internally.

#elif defined(OS_WINDOWS)
typedef void (*SignalHandlerPointer)(int);
SignalHandlerPointer previousHandler;
previousHandler = signal(SIGABRT, &FailureSignalHandler);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you only care SIGABRT? I think handling SIGFPE, SIGILL, and SIGSEGV would be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where Windows gets weird. While SIGNAL.H defines all of these signals as valid parameters to signal(), it does so only for compatibility. Explicitly, the signals SIGILL and SIGTERM are not generated on Windows (though you can explicitly raise them).

However, I think that since you can explicitly raise these signals (and it appears that SIGFPE and SIGSEGV might be raised on Windows), I'll change this to handle them all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note:

I ran into a major problem on Windows with signal handling. Let me reproduce the commit message here, as the commit is going to change when I update the PR:

On Windows, the signal handler is invoked on the thread on which it was installed, which is not necessarily the thread where the test failed. So instead of using the signal handler to dump the stack trace, which would dump it from the wrong thread, we dump the information immediately.

Doing this causes us to use a few more frames, so DumpStackInfo needed to configurable skip number.

The nicer approach would be to refactor the MinimalFormatter and info dumping functions out of the signal handler and into the logging library, which is a more relevant home for them, but it is less invasive to simply move the two functions Windows needed into the internal namespace.

So the usual code path that a test failure takes now skips signal handler entirely, and goes straight to dumping stack and symbols.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is because the Windows signal handler is not like what is found on Linux like systems. Instead, it uses structured exception handling (SEH). I don't know much more than what I've researched on this (ironically I am not a Windows dev, I am a Linux dev), but I know this was quite an annoying problem to solve. Signaling on Windows switches threads (to the SEH thread, I guess?) and CaptureStackBackTrace explicitly gets the backtrace for the current thread. I'm looking for a way to do this right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh here it is.

The signal handler was being on the thread on which it was installed, not a dedicated SEH thread (I was mis-remembering), and importantly, not the thread on which the signal was raised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this. I did a bit more investigating, and now this works as expected. You install the failure signal handlers, and thing work like they do on Linux.

src/demangle.cc Outdated
// only the mangled symbol, `?func@Foo@@SAXH@Z`. Fortunately, the
// mangled symbol is guaranteed not to have parentheses,
// so we search for `(` and extract up to `)`.
std::string symbol(mangled);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, it'd be better to avoid malloc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm confused. I think I've left a comment in stacktrace_windows-inl.h or somewhere, but the comment has disappeared :(

Anyway, I think calling malloc in signal handler isn't safe, so it'd be nicer to avoid using std::string. https://msdn.microsoft.com/en-us/library/aa272921(v=vs.60).aspx

I'm also worried by Windows APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I need to fix this up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I found your other comment by the way, it was getting hidden under "show outdated" even though I don't think the code was actually outdated. Not sure what GitHub is up to.)

void DumpStackFrameInfo(const char* prefix, void* pc) {
// Get the symbol name.
const char *symbol = "(unknown)";
char symbolized[1024] = {}; // Big enough for a sane symbol.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't understand why this change is necessary. When Symbolize returns true, SymbolizeAndDemangle fills something to this buffer on both Windows and non-Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment got duplicated.

@shinh
Copy link
Collaborator

shinh commented Jun 10, 2017

Thanks for the patch! Sorry for the huge latency... Could you update AUTHORS and CONTRIBUTORS as well (see CONTRIBUTING.md)? I'm a bit worried by the use of malloc and Windows APIs in signal handler, but I think it's definitely much nicer than nothing. So I'm planning to merge this PR if others don't objects.

Copy link
Collaborator

@sergiud sergiud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicking, but otherwise looks good to me.

CMakeLists.txt Outdated
if (WIN32)
set (HAVE_STACKTRACE 1)
set (HAVE_SYMBOLIZE 1)
target_link_libraries (glog PUBLIC Dbghelp.lib)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the .lib suffix here and replace Dbghelp by proper capitalization of library's name, i.e., DbgHelp .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and even more so, I think I should instead use:

#pragma comment(lib, "DbgHelp")

In an appropriate header, eliminating the need for us to target_link_libraries or for consumers of Glog to do so either.

src/demangle.cc Outdated
#include "demangle.h"

#if defined(OS_WINDOWS)
#include <Dbghelp.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency reasons, same as before, i.e., DbgHelp.h instead of Dbghelp.h. Windows Dev Center also suggests this header name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. This is where I'm going to add the pragma for automatic linking as well.

src/symbolize.cc Outdated
@@ -837,6 +837,57 @@ static ATTRIBUTE_NOINLINE bool SymbolizeAndDemangle(void *pc, char *out,

_END_GOOGLE_NAMESPACE_

#elif defined(OS_WINDOWS)

#include "Dbghelp.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

src/symbolize.cc Outdated

_START_GOOGLE_NAMESPACE_

class SymInitializer {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should not be copied, correct? To ensure no one does this in the future, I'd delete the copy constructor and copy assignment operator by adding the declarations

SymInitializer(const SymInitializer&);
SymInitializer& operator=(const SymInitializer&);

to the private section of the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@andyleejordan
Copy link
Contributor Author

@serguid and @shinh Thank you so much for getting to this and reviewing it! I'll commit to updating it per reviews (plus feedback on some confusing Windows-y parts) by the end of the week. Talk to you soon!

@andyleejordan
Copy link
Contributor Author

With a fresh set of eyes (and more Windows knowledge) I actually managed to get the signal handler working properly today (just tested an integration with Mesos). I'll get it cleaned up and the PR posted.

@andyleejordan
Copy link
Contributor Author

@shinh and @sergiud this is ready for a final review I believe. I fixed the mentioned issues, and fixed the deeper issue of the signal handler not being all the way hooked up. From my previous comments, I mentioned that I'd plumbed around the signal handler "not working as expected" on Windows; but I now have it working as expected, and undid the unnecessary changes. Users should use InstallFailureSignalHandler() and expect LOG(FATAL), CHECK(...) and abort() etc. to all work.

Copy link
Collaborator

@shinh shinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm fine with merging this PR, but I'd like to ask a couple of questions.

src/symbolize.cc Outdated

__declspec(noinline) static bool SymbolizeAndDemangle(void *pc, char *out,
int out_size) {
const SymInitializer symInitializer;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess SymInitialize is a fairly complex function. Does it have no malloc?

Also, is it a good idea to construct SymInitializer every time? Perhaps the results of SymInitialize will be cached internally and the second SymInitialize call is fast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I meant this to be a const static variable so it was initialized only once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess SymInitialize is a fairly complex function. Does it have no malloc?

I'm unsure. I suppose we could construct SymInitialize at startup (i.e. move symInitializer out of the function here).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure. I suppose we could construct SymInitialize at startup (i.e. move symInitializer out of the function here).

Yeah, but MSDN told me SymInitialize could be slow. Perhaps it might be a good idea to add a function which initializes SymInitialize if a library user wants to be safe. Anyway, it's up to you. I'm OK with the current patch. Probably I'll merge this PR tomorrow unless you want me to wait a bit more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shinh that's where I'm at with it; I initialized it in the function since it can be slow. While I think adding a function to pre-initialize it for safety is a good idea, it's also not necessary for most use cases (judging by how I'm using this in Mesos anyways), so I'd rather not add it now.

I am good with the current contents of this PR for you to merge them. Thank you for your help!

Necessary when building with BUILD_SHARED_LIBS=1.
This method ensure that all users of glog get automatically linked to
the DbgHelp library without needing to set compiler flags.
@andyleejordan
Copy link
Contributor Author

Just to add some test results:

CentOS 7 (ensuring I didn't break anything):

$ ./autogen.sh
$ ./configure
$ make -j8
$ make check
============================================================================
Testsuite summary for glog 0.3.5
============================================================================
# TOTAL: 6
# PASS:  6
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================

$ mkdir build
$ cd build
$ cmake3 ..
$ make -j8
$ ctest
Test project /home/andschwa/src/glog/build
    Start 1: demangle
1/6 Test #1: demangle .........................   Passed    0.00 sec
    Start 2: logging
2/6 Test #2: logging ..........................   Passed    0.53 sec
    Start 3: signalhandler
3/6 Test #3: signalhandler ....................   Passed    0.01 sec
    Start 4: stacktrace
4/6 Test #4: stacktrace .......................   Passed    0.00 sec
    Start 5: stl_logging
5/6 Test #5: stl_logging ......................   Passed    0.00 sec
    Start 6: symbolize
6/6 Test #6: symbolize ........................   Passed    0.00 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) =   0.55 sec

Windows 10:

$ mkdir build
$ cd build
$ cmake ..
$ cmake --build . --config Debug
$ ctest -C Debug
Test project C:/Users/andschwa/src/glog/build
    Start 1: demangle
1/6 Test #1: demangle .........................   Passed    0.05 sec
    Start 2: logging
2/6 Test #2: logging ..........................***Failed    0.47 sec
    Start 3: signalhandler
3/6 Test #3: signalhandler ....................   Passed    0.02 sec
    Start 4: stacktrace
4/6 Test #4: stacktrace .......................***Failed    0.97 sec
    Start 5: stl_logging
5/6 Test #5: stl_logging ......................   Passed    0.01 sec
    Start 6: symbolize
6/6 Test #6: symbolize ........................   Passed    0.03 sec

67% tests passed, 2 tests failed out of 6

Total Test time (real) =   1.58 sec

The following tests FAILED:
          2 - logging (Failed)
          4 - stacktrace (Failed)
Errors while running CTest

The stacktrace test tends to fail in Debug because of the incremental linking table making function and return addresses unreliable. The symbolizer should now be expected to fail because the Release configuration doesn't have symbol information.

$ cmake --build . --config Release
$ ctest -C Release
Test project C:/Users/andschwa/src/glog/build
    Start 1: demangle
1/6 Test #1: demangle .........................   Passed    0.00 sec
    Start 2: logging
2/6 Test #2: logging ..........................***Failed    0.45 sec
    Start 3: signalhandler
3/6 Test #3: signalhandler ....................   Passed    0.02 sec
    Start 4: stacktrace
4/6 Test #4: stacktrace .......................   Passed    0.02 sec
    Start 5: stl_logging
5/6 Test #5: stl_logging ......................   Passed    0.01 sec
    Start 6: symbolize
6/6 Test #6: symbolize ........................***Exception: Other  3.15 sec

67% tests passed, 2 tests failed out of 6

Total Test time (real) =   3.69 sec

The following tests FAILED:
          2 - logging (Failed)
          6 - symbolize (OTHER_FAULT)
Errors while running CTest

And finally, the logging test fails just due to a difference in padding:

$ ctest -V -C Debug
2: Test with golden file failed. We'll try to show the diff:
2: --- C:/Users/andschwa/src/glog/src/logging_unittest.err.munged       Tue Jun 27 11:36:39 2017
2: +++ C:\Users\andschwa\AppData\Local\Temp\/captured.err.munged        Tue Jun 27 11:36:39 2017
2: @@ -78,7 +78,7 @@
2:  IDATE TIME__ THREADID logging_unittest.cc:LINE] RAW: foo bar 10 3.400000
2:  WDATE TIME__ THREADID logging_unittest.cc:LINE] RAW: array
2:  IDATE TIME__ THREADID logging_unittest.cc:LINE] RAW: const array
2: -IDATE TIME__ THREADID logging_unittest.cc:LINE] RAW: ptr 12345678
2: +IDATE TIME__ THREADID logging_unittest.cc:LINE] RAW: ptr 0000000012345678
2:  IDATE TIME__ THREADID logging_unittest.cc:LINE] RAW: ptr 0000000000000000
2:  EDATE TIME__ THREADID logging_unittest.cc:LINE] RAW: foo 1000 0000001000 3e8
2:  IDATE TIME__ THREADID logging_unittest.cc:LINE] RAW: foo 1000
2: diff command was failed.
2: Check failed: MungeAndDiffTestStderr(FLAGS_test_srcdir + "/src/logging_unittest.err")
2/6 Test #2: logging ..........................***Failed    0.45 sec

The golden file has 12345678 and we print 0000000012345678.

Finally, I've also linked this library into Mesos and tested some expected failures with an abort(), a CHECK_EQ(true, false) and a LOG(FATAL) << Failing, and stack traces are being generated properly.

@shinh shinh merged commit 5cfa9a2 into google:master Jun 28, 2017
@shinh
Copy link
Collaborator

shinh commented Jun 28, 2017

Yeah, further changes can be done in a separate PR if necessary. Thanks!

@andyleejordan
Copy link
Contributor Author

Woo! Thank you :)

@andyleejordan andyleejordan deleted the windows-stacktrace branch June 28, 2017 03:34
struct sigaction sig_action;
memset(&sig_action, 0, sizeof(sig_action));
sigemptyset(&sig_action.sa_mask);
sig_action.sa_handler = SIG_DFL;
sigaction(SIGABRT, &sig_action, NULL);
}
#elif defined(OS_WINDOWS)
signal(SIGABRT, SIG_DFL);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this should set kFailureSignalHandlerInstalled back to false.

durswd pushed a commit to durswd/glog that referenced this pull request Sep 2, 2019
@bmahler
Copy link

bmahler commented Mar 13, 2020

Filed to issues to improve this: issue 534 and issue 535.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants