Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node does not abort at the right time when using --abort-on-uncaught-exception #3035

Closed
misterdjules opened this issue Sep 24, 2015 · 10 comments
Labels
post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js.

Comments

@misterdjules
Copy link

nodejs/master's head does not abort at the right time when using --abort-on-uncaught-exception. Here's a reproduction of the problem on SmartOS:

$ git rev-parse --short HEAD
02448c6
[root@dev ~/node-1]# make
make -C out BUILDTYPE=Release V=1
make[1]: Entering directory '/root/node-1/out'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/root/node-1/out'
ln -fs out/Release/node node
$ ln -sf `pwd`/out/Release/node /opt/local/bin/node
[root@dev ~/node-1]# node --version
v5.0.0-pre
$ node --abort-on-uncaught-exception -e 'setTimeout(function () { function boom() { throw new Error("foo") } boom(); }, 10);'
[eval]:1
setTimeout(function () { function boom() { throw new Error("foo") } boom(); }, 10);
                                           ^

Error: foo
    at boom ([eval]:1:50)
    at null._onTimeout ([eval]:1:69)
    at Timer.listOnTimeout (timers.js:89:15)
Abort (core dumped)
$ mdb /var/cores/core.node.8212 
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> ::load /root/mdb_v8/build/amd64/mdb_v8.so
mdb_v8 version: 1.0.0 (dev)
V8 version: 4.5.103.33
Autoconfigured V8 support from target
C++ symbol demangling enabled
> ::jsstack
native: libc.so.1`_lwp_kill+0xa
native: libc.so.1`raise+0x20
native: libc.so.1`abort+0x98
native: node::FatalException+0xd7
native: v8::internal::MessageHandler::ReportMessage+0x1ee
native: v8::internal::Isolate::ReportPendingMessages+0x234
native: v8::internal::Execution::Call+0x4a3
native: v8::Function::Call+0xff
native: v8::Function::Call+0x41
native: node::AsyncWrap::MakeCallback+0x23a
native: node::TimerWrap::OnTimeout+0x96
native: uv__run_timers+0x7d
native: uv_run+0x35a
native: node::Start+0x538
native: _start+0x6c
>

Note that the call stack doesn't contain any JavaScript frame, and indicates that node aborted in node::FatalException, not when the error was thrown.

This is a problem because users of post-mortem debuggers and --abort-on-uncaught-exception need to have core dumps that are generated when the exception is thrown, and the core dumps need to have the frame that throws the error in the call stack. Otherwise, it becomes much more difficult, if not impossible to determine the root cause of the problem.

In the example above, there's no way to know that the error was thrown by the function named foo, we just know that one timer's callback threw.

This regression was introduced by #922, which made V8 ignore --abort-on-uncaught-exception. An attempt at fixing that regression was made with #2776, but instead of letting V8 abort in Isolate::Throw (when it actually throws the error and when all the relevant frames are active on the stack), it throws in node::FatalException as shown above.

I will submit two different PRs that fix this issue in two different ways so that we can discuss the pros and cons of the two different approaches I came up with.

/cc @nodejs/post-mortem

@misterdjules misterdjules added the post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js. label Sep 24, 2015
misterdjules pushed a commit to misterdjules/node-1 that referenced this issue Sep 24, 2015
This PR fixes 0af4c9e so that node
aborts at the right time when throwing an error and using
--abort-on-uncaught-exception.

Basically, it wraps most node internal callbacks with:

if (!domain || domain.emittingTopLevelError)
  runCallback();
else {
  try {
    runCallback();
  } catch (err) {
    process._fatalException(err);
  }
}

so that V8 can abort properly in Isolate::Throw if
--abort-on-uncaught-exception was passed on the command line, and domain
can handle the error if one is active and not already in the top level
domain's error handler.

It also reverts 921f2de partially:
node::FatalException does not abort anymore because at that time, it's
already too late.

It adds process._forceTickDone, which is really a hack to allow
test-next-tick-error-spin.js to pass. It's here to basically avoid an
infinite recursion when throwing in a domain from a nextTick callback,
and queuing the same callback on the next tick from the domain's error
handler.

This change is an alternative approach to nodejs#3036 for fixing nodejs#3035.

Fixes nodejs#3035.
@misterdjules
Copy link
Author

I implemented two different approaches with #3036 and #3038 to solve this problem. They're both drafts of PRs, and I didn't pay too much attention to details. I'm mainly looking for feedback on what would be the preferred approach.

I personally have a preference for #3036, because I find it less intrusive and it's been tested with node v0.10. However I would welcome any other opinion/feedback, and there may be other valid approaches.

@yunong
Copy link
Member

yunong commented Sep 24, 2015

Just to weigh in here -- this is critical for us at Netflix and our production stack -- and will potentially prevent us from moving to 4.x until this is fixed. Thanks for all the hardwork @misterdjules and we'd love to see this integrated soon!

@evanlucas
Copy link
Contributor

Thanks @misterdjules! I feel like #3036 is a much cleaner way of doing it and it doesn't require messing with timers and repl.

@Raynos
Copy link
Contributor

Raynos commented Sep 25, 2015

👍

It's important that we have the right stack. --abort-on-uncaught-exception is critical for debugging "Maximum call stack size exceeded".

We need the stack in the core file to have the entire call stack that we blew up.

@misterdjules
Copy link
Author

If the two approaches described in #3036 and #3038 are not viable, a third option is to revert 0af4c9e and 921f2de, and to display a warning when domains and --abort-on-uncaught-exception are used at the same time.

This would bring us back to the point where domains and --abort-on-uncaught-exception work independently, but not when used together in some use cases.

@Raynos
Copy link
Contributor

Raynos commented Sep 29, 2015

If a domain catches an exception and then determines there is no active domain and rethrows it into the global uncaught you now have a new stack trace that is not the exception you want to debug with a corefile.

Is it possible to use domains and --abort-on-uncaught-exception together in a useful fashion for stack debugging?

From what I understand whenever you have domains enabled the stacktrace will always be a rethrow instead of the stack trace you actually want.

That being said; if all you care about is heap analysis instead of stack analysis then domains + --abort-on-uncaught-exception can work together.

@misterdjules
Copy link
Author

If a domain catches an exception and then determines there is no active domain and rethrows it into the global uncaught you now have a new stack trace that is not the exception you want to debug with a corefile.

Just to make sure we're on the same page, and because wording can be tricky when discussing these topics, if you mean that given the following code:

var domain = require('domain');
var d = domain.create();

d.on('error', function onError(err) {
  throw new Error('boom');
});

d.run(function someFunction() {
  throw new Error('original error');
});

running it with --abort-on-uncaught-exception and examining the stack from the resulting core file will not give you someFunction as an active stack frame, that is correct.

Is it possible to use domains and --abort-on-uncaught-exception together in a useful fashion for stack debugging?

Not with the above -mentioned use case as far as I know.

From what I understand whenever you have domains enabled the stacktrace will always be a rethrow instead of the stack trace you actually want.

Yes, if an error is thrown from the top-level domain's error handler, the stack trace won't contain the frame where the original error was thrown.

That being said; if all you care about is heap analysis instead of stack analysis then domains + --abort-on-uncaught-exception can work together.

Exactly. In general, using --abort-on-uncaught-exception should not break domains. However it's acceptable that using domains will change the execution flow in such a way that makes post-mortem debugging less practical when investigating core files generated from processes that crashed due to uncaught exceptions.That's what domains do and it's documented. Outputting a warning to the console might be a good idea when they're used together regardless of the approach taken to fix this issue.

This issue is about two things:

  1. Fixing --abort-on-uncaught-exception which is completely broken right now, whether or not domains are used.
  2. In doing so, fix the non-acceptable issues of using domains with --abort-on-uncaught-exception that previous changes who broke 1) tried to fix (these issues were originally filed as Domain error handler not preventing exception from bubbling up when --abort-on-uncaught-exception used node-v0.x-archive#8631 and Process exits with incorrect message when throwing error in top-level domain's error handler node-v0.x-archive#8630).

@Raynos
Copy link
Contributor

Raynos commented Sep 30, 2015

Sounds great. We are on the same page.

@misterdjules
Copy link
Author

#3036 was updated now that its V8-related changes landed upstream (see https://codereview.chromium.org/1375933003/).

misterdjules pushed a commit to misterdjules/node-1 that referenced this issue Oct 5, 2015
Revert 0af4c9e, parts of
921f2de and port
nodejs/node-v0.x-archive#25835 from v0.12 to
master so that node aborts at the right time when an error is thrown
and --abort-on-uncaught-exception is used.

Fixes nodejs#3035.
@misterdjules
Copy link
Author

Fixed by 49dec1a and 77a10ed.

jasnell pushed a commit that referenced this issue Oct 8, 2015
Revert 0af4c9e, parts of
921f2de and port
nodejs/node-v0.x-archive#25835 from v0.12 to
master so that node aborts at the right time when an error is thrown
and --abort-on-uncaught-exception is used.

Fixes #3035.

PR: #3036
PR-URL: #3036
Reviewed-By: Ben Noordhuis <ben@strongloop.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js.
Projects
None yet
4 participants