Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker: fix stream & termination race bug #42874

Conversation

airtable-keyhanvakil
Copy link
Contributor

@airtable-keyhanvakil airtable-keyhanvakil commented Apr 26, 2022

OnStreamAfterReqFinished uses v8::Object::Has to check if it needs
to call oncomplete. v8::Object::Has needs to execute Javascript.
However when worker threads are involved, OnStreamAfterReqFinished may
be called after the worker thread termination has begun via
worker.terminate(). This makes v8::Object::Has return Nothing,
which triggers an assert.

This diff fixes the issue by checking env->is_stopping() first.

Fixes: #38418

`OnStreamAfterReqFinished` uses `v8::Object::Has` to check if it needs
to call `oncomplete`. `v8::Object::Has` needs to execute Javascript.
However when worker threads are involved, `OnStreamAfterReqFinished` may
be called after the worker thread termination has begun via
`worker.terminate()`. This makes `v8::Object::Has` return `Nothing`,
which triggers an assert.

This diff fixes the issue by simply defaulting us to `false` in the case
where `Nothing` is returned. This is sound because we can't execute
`oncomplete` anyway as the isolate is terminating.

Fixes: nodejs#38418
@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Apr 26, 2022
@airtable-keyhanvakil
Copy link
Contributor Author

I used the repro in the referenced issue: https://github.com/Prinzhorn/electron-sigabrt

Here's the backtrace on a debug build & also some GDB output:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f2bab11a859 in __GI_abort () at abort.c:79
#2  0x0000564469142bfd in node::Abort () at ../src/node_errors.cc:260
#3  0x0000564469143b32 in node::OnFatalError (location=0x56446b25644a "v8::FromJust",
    message=0x56446b256432 "Maybe value is Nothing.") at ../src/node_errors.cc:450
#4  0x000056446941deff in v8::Utils::ReportApiFailure (
    location=0x56446b25644a "v8::FromJust", message=0x56446b256432 "Maybe value is Nothing.")
    at ../deps/v8/src/api/api.cc:296
#5  0x0000564468fe35b9 in v8::Maybe<bool>::FromJust (this=0x7f2ba88ebeb6)
    at ../deps/v8/include/v8.h:10469
#6  0x00005644692b5bea in node::ReportWritesToJSStreamListener::OnStreamAfterReqFinished (
    this=0x7f2b94414730, req_wrap=0x7f2b94320590, status=0) at ../src/stream_base.cc:603
#7  0x00005644692b5c87 in node::ReportWritesToJSStreamListener::OnStreamAfterWrite (
    this=0x7f2b94414730, req_wrap=0x7f2b94320590, status=0) at ../src/stream_base.cc:609
#8  0x00005644692b5e81 in node::StreamListener::OnStreamAfterWrite (this=0x7f2b94414828,
    w=0x7f2b94320590, status=0) at ../src/stream_base.cc:639
#9  0x00005644692b641f in node::StreamResource::EmitAfterWrite (this=0x7f2b94414708,
    w=0x7f2b94320590, status=0) at ../src/stream_base-inl.h:108
#10 0x00005644692b5d32 in node::WriteWrap::OnDone (this=0x7f2b94320590, status=0)
    at ../src/stream_base.cc:623
#11 0x00005644690c7786 in node::StreamReq::Done (this=0x7f2b94320590, status=0, error_str=0x0)
    at ../src/stream_base-inl.h:288
#12 0x0000564469175df0 in node::http2::Http2Session::ClearOutgoing (this=0x7f2b942f1ef0,
    status=-16) at ../src/node_http2.cc:1616
#13 0x000056446917667d in node::http2::Http2Session::SendPendingData (this=0x7f2b942f1ef0)
    at ../src/node_http2.cc:1736
#14 0x0000564469175907 in node::http2::Http2Session::<lambda(node::Environment*)>::operator()(node::Environment *) const (__closure=0x7f2b94310268, env=0x7f2b9414af60)
    at ../src/node_http2.cc:1582
#15 0x000056446919529b in node::CallbackQueue<void, node::Environment*>::CallbackImpl<node::http2::Http2Session::MaybeScheduleWrite()::<lambda(node::Environment*)> >::Call(node::Environment *) (this=0x7f2b94310250, args#0=0x7f2b9414af60) at ../src/callback_queue-inl.h:90
#16 0x000056446906c170 in node::Environment::<lambda(node::Environment::NativeImmediateQueue*)>::operator()(node::Environment::NativeImmediateQueue *) const (__closure=0x7f2ba88ec5d0,
    queue=0x7f2b9414b8f0) at ../src/env.cc:764
#17 0x000056446906c384 in node::Environment::RunAndClearNativeImmediates (
    this=0x7f2b9414af60, only_refed=false) at ../src/env.cc:777
#18 0x000056446906cbfa in node::Environment::CheckImmediate (handle=0x7f2b9414b1b0)
    at ../src/env.cc:917
#19 0x000056446a23ec2c in uv__run_check (loop=0x7f2ba88eca28)
    at ../deps/uv/src/unix/loop-watcher.c:67
#20 0x000056446a2341ed in uv_run (loop=0x7f2ba88eca28, mode=UV_RUN_DEFAULT)
    at ../deps/uv/src/unix/core.c:398
#21 0x0000564468fd87fe in node::SpinEventLoop (env=0x7f2b9414af60)
    at ../src/api/embed_helpers.cc:36
#22 0x000056446928c3c8 in node::worker::Worker::Run (this=0x56446f1184a0)
    at ../src/node_worker.cc:344
#23 0x000056446928e38b in node::worker::Worker::<lambda(void*)>::operator()(void *) const (
    __closure=0x0, arg=0x56446f1184a0) at ../src/node_worker.cc:612
(gdb) p env->isolate()->IsExecutionTerminating()
$27 = true

Copy link
Contributor

@RaisinTen RaisinTen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a test?

However when worker threads are involved, OnStreamAfterReqFinished may
be called after the worker thread termination has begun via
worker.terminate(). This makes v8::Object::Has return Nothing,
which triggers an assert.

Why would calling this function after worker.terminate() return Nothing?

src/stream_base.cc Outdated Show resolved Hide resolved
@airtable-keyhanvakil
Copy link
Contributor Author

Would it be possible to add a test?

Possibly. The repro only crashes rarely (something like 1% of the time
on my system). The exact conditions are somewhat rare: the
worker.terminate in the main thread needs to interleave in a specific
way with the worker thread's execution. Are you aware of any similar
tests that have similar needs so I could copy from them?

However when worker threads are involved, OnStreamAfterReqFinished may
be called after the worker thread termination has begun via
worker.terminate(). This makes v8::Object::Has return Nothing,
which triggers an assert.

Why would calling this function after worker.terminate() return Nothing?

Because v8::Object::Has uses this macro:

ENTER_V8(isolate, context, Object, Has, Nothing<bool>(), i::HandleScope);

which in turn uses this macro:

#define ENTER_V8_HELPER_DO_NOT_USE(isolate, context, class_name,  \
                                   function_name, bailout_value,  \
                                   HandleScopeClass, do_callback) \
  if (IsExecutionTerminatingCheck(isolate)) {                     \
    return bailout_value;                                         \
  }                                                               \
...

and worker.terminate() makes IsExecutionTerminatingCheck(isolate)
true.

@airtable-keyhanvakil
Copy link
Contributor Author

OK. I was able to add a test. On my machine, it fails 100/100 times
without this patch and 0/100 times with the patch.

I styled this after some of the other "terminate" tests. cc @addaleax as
it looks like you wrote a lot of these tests and may have opinions.

@RaisinTen
Copy link
Contributor

Because v8::Object::Has uses this macro:

ENTER_V8(isolate, context, Object, Has, Nothing<bool>(), i::HandleScope);

which in turn uses this macro:

#define ENTER_V8_HELPER_DO_NOT_USE(isolate, context, class_name,  \
                                   function_name, bailout_value,  \
                                   HandleScopeClass, do_callback) \
  if (IsExecutionTerminatingCheck(isolate)) {                     \
    return bailout_value;                                         \
  }                                                               \
...

and worker.terminate() makes IsExecutionTerminatingCheck(isolate) true.

Right - I mean what exception is causing this? IsExecutionTerminatingCheck() would return true only if there is an exception scheduled to be thrown.

inline bool IsExecutionTerminatingCheck(i::Isolate* isolate) {
if (isolate->has_scheduled_exception()) {
return isolate->scheduled_exception() ==
i::ReadOnlyRoots(isolate).termination_exception();
}
return false;
}

OK. I was able to add a test. On my machine, it fails 100/100 times without this patch and 0/100 times with the patch.

I'm not able to reproduce the crash locally with this test. 🤔

@airtable-keyhanvakil
Copy link
Contributor Author

Because v8::Object::Has uses this macro:

ENTER_V8(isolate, context, Object, Has, Nothing<bool>(), i::HandleScope);

which in turn uses this macro:

#define ENTER_V8_HELPER_DO_NOT_USE(isolate, context, class_name,  \
                                   function_name, bailout_value,  \
                                   HandleScopeClass, do_callback) \
  if (IsExecutionTerminatingCheck(isolate)) {                     \
    return bailout_value;                                         \
  }                                                               \
...

and worker.terminate() makes IsExecutionTerminatingCheck(isolate) true.

Right - I mean what exception is causing this? IsExecutionTerminatingCheck() would return true only if there is an exception scheduled to be thrown.

My thoughts were that the current flow is something like this:

  1. Main thread calls worker.terminate, which invokes stopThread -> Worker::StopThread -> Worker::Exit -> Stop(env) -> Environment::ExitEnv -> Isolate::TerminateExecution.
    this[kHandle].stopThread();

    isolate->stack_guard()->RequestTerminateExecution();
  2. Worker thread eventually hits a stack guard which invokes internal::Isolate::TerminateExecution
    return isolate_->TerminateExecution();
  3. That throws terminate_exception, which is the thing that IsExecutionTerminatingCheck is checking:
    Object Isolate::TerminateExecution() {

So I think the answer to your question is "the terminate_exception".

I suppose another option here would be to check env->can_call_into_js as suggested in the README: https://github.com/nodejs/node/blob/master/src/README.md#calling-into-javascript

OK. I was able to add a test. On my machine, it fails 100/100 times without this patch and 0/100 times with the patch.

I'm not able to reproduce the crash locally with this test. 🤔

Can you share what system you are on? For me, that test case reproduces 100% of the time on both a 2019 Macbook Pro (MacBookPro16,1 running Big Sur 11.6) and a cloud VM (AWS m5a.2xlarge, Ubuntu 20.04.3 LTS, Linux version 5.11.0-1027-aws). (It does seem to be pretty timing-dependent so looping could help with flakiness -- if I do it while ninja is compiling, it drops to only reproducing ~70% of the time.)

Copy link
Contributor

@RaisinTen RaisinTen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for elaborating! This seems like the same case as #42910.

Can you share what system you are on?

I had initially tested this on my Intel MBP running Monterey 12.3.1 and later on Ubuntu 20.04 running on Parallels Linux because I saw your GDB logs and assumed that you were on Linux.

I have suggested a test that crashes on all of these platforms consistently.

src/stream_base.cc Outdated Show resolved Hide resolved
@airtable-keyhanvakil
Copy link
Contributor Author

Thank you for the suggestions! I pasted the test code in & changed to use env->is_stopping() instead.

Copy link
Contributor

@RaisinTen RaisinTen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nodejs-github-bot

This comment was marked as outdated.

@RaisinTen RaisinTen added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label May 4, 2022
@RaisinTen
Copy link
Contributor

@addaleax PTAL.

@nodejs-github-bot
Copy link
Collaborator

@airtable-keyhanvakil
Copy link
Contributor Author

So it looks like the two CI builds both failed on the new test with a timeout:

  1. https://ci.nodejs.org/job/node-test-commit-custom-suites-freestyle/25122/testReport/junit/(root)/test/parallel_test_worker_http2_stream_terminate/
  2. https://ci.nodejs.org/job/node-test-commit-custom-suites-freestyle/25114/testReport/junit/(root)/test/parallel_test_worker_http2_stream_terminate/

Any thoughts here? it takes much less than 2 minutes on my system, but I suppose there's a lot of contention in CI:

$ time ./node test/parallel/test-worker-http2-stream-terminate.js 

real    0m10.303s
user    0m25.510s
sys     0m3.807s

@nodejs-github-bot
Copy link
Collaborator

@RaisinTen
Copy link
Contributor

I think you could try to quit the process in a setTimeout() based on the timeout value returned by common.platformTimeout() but maybe there's a better way. @richardlau could you please confirm?

@richardlau
Copy link
Member

So it looks like the two CI builds both failed on the new test with a timeout:

  1. https://ci.nodejs.org/job/node-test-commit-custom-suites-freestyle/25122/testReport/junit/(root)/test/parallel_test_worker_http2_stream_terminate/
  2. https://ci.nodejs.org/job/node-test-commit-custom-suites-freestyle/25114/testReport/junit/(root)/test/parallel_test_worker_http2_stream_terminate/

Any thoughts here? it takes much less than 2 minutes on my system, but I suppose there's a lot of contention in CI:

$ time ./node test/parallel/test-worker-http2-stream-terminate.js 

real    0m10.303s
user    0m25.510s
sys     0m3.807s

https://ci.nodejs.org/job/node-test-commit-custom-suites-freestyle/ URLs are going to be the tests that run in Worker threads, so you'll want to try with

 ./tools/test.py --worker parallel/test-worker-http2-stream-terminate

@airtable-keyhanvakil
Copy link
Contributor Author

airtable-keyhanvakil commented May 5, 2022

So it looks like the two CI builds both failed on the new test with a timeout:

  1. https://ci.nodejs.org/job/node-test-commit-custom-suites-freestyle/25122/testReport/junit/(root)/test/parallel_test_worker_http2_stream_terminate/
  2. https://ci.nodejs.org/job/node-test-commit-custom-suites-freestyle/25114/testReport/junit/(root)/test/parallel_test_worker_http2_stream_terminate/

Any thoughts here? it takes much less than 2 minutes on my system, but I suppose there's a lot of contention in CI:

$ time ./node test/parallel/test-worker-http2-stream-terminate.js 

real    0m10.303s
user    0m25.510s
sys     0m3.807s

https://ci.nodejs.org/job/node-test-commit-custom-suites-freestyle/ URLs are going to be the tests that run in Worker threads, so you'll want to try with

 ./tools/test.py --worker parallel/test-worker-http2-stream-terminate

Thanks, that did it. I cribbed the HAS_STARTED_WORKER pattern from the other tests in this [e: directory]. Presumably the same sort of pattern needs to be applied to #42910.

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@aduh95 aduh95 added commit-queue Add this label to land a pull request using GitHub Actions. commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. labels May 6, 2022
@nodejs-github-bot nodejs-github-bot removed the commit-queue Add this label to land a pull request using GitHub Actions. label May 6, 2022
@nodejs-github-bot nodejs-github-bot merged commit 0fc1cf4 into nodejs:master May 6, 2022
@nodejs-github-bot
Copy link
Collaborator

Landed in 0fc1cf4

RafaelGSS pushed a commit that referenced this pull request May 10, 2022
`OnStreamAfterReqFinished` uses `v8::Object::Has` to check if it needs
to call `oncomplete`. `v8::Object::Has` needs to execute Javascript.
However when worker threads are involved, `OnStreamAfterReqFinished` may
be called after the worker thread termination has begun via
`worker.terminate()`. This makes `v8::Object::Has` return `Nothing`,
which triggers an assert.

This diff fixes the issue by simply defaulting us to `false` in the case
where `Nothing` is returned. This is sound because we can't execute
`oncomplete` anyway as the isolate is terminating.

Fixes: #38418

PR-URL: #42874
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
@RafaelGSS RafaelGSS mentioned this pull request May 10, 2022
juanarbol pushed a commit that referenced this pull request May 31, 2022
`OnStreamAfterReqFinished` uses `v8::Object::Has` to check if it needs
to call `oncomplete`. `v8::Object::Has` needs to execute Javascript.
However when worker threads are involved, `OnStreamAfterReqFinished` may
be called after the worker thread termination has begun via
`worker.terminate()`. This makes `v8::Object::Has` return `Nothing`,
which triggers an assert.

This diff fixes the issue by simply defaulting us to `false` in the case
where `Nothing` is returned. This is sound because we can't execute
`oncomplete` anyway as the isolate is terminating.

Fixes: #38418

PR-URL: #42874
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
danielleadams pushed a commit that referenced this pull request Jun 27, 2022
`OnStreamAfterReqFinished` uses `v8::Object::Has` to check if it needs
to call `oncomplete`. `v8::Object::Has` needs to execute Javascript.
However when worker threads are involved, `OnStreamAfterReqFinished` may
be called after the worker thread termination has begun via
`worker.terminate()`. This makes `v8::Object::Has` return `Nothing`,
which triggers an assert.

This diff fixes the issue by simply defaulting us to `false` in the case
where `Nothing` is returned. This is sound because we can't execute
`oncomplete` anyway as the isolate is terminating.

Fixes: #38418

PR-URL: #42874
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
@airtable-keyhanvakil
Copy link
Contributor Author

Somebody in #43304 asked if this could be backported to earlier releases (v14 / v16). I think that is a good idea. I am not sure how that process works, but it seems like this bug fix would cherry-pick cleanly onto those branches. How can I get these changes backported?

@aduh95
Copy link
Contributor

aduh95 commented Jul 6, 2022

## What can be backported?
The "Current" release line is much more lenient than the LTS release lines in
what can be landed. Our LTS release lines (see the [Release Plan][])
require that commits mature in the Current release for at least 2 weeks before
they can be landed in an LTS staging branch. Only after "maturation" will those
commits be cherry-picked or backported.

It should be included in the next version of Node.js 16.x without any action from you; regarding v14.x, it's in maintenance mode so it won't land there unless someone opens a backport PR to v14.x-staging. The process for doing that is explained at https://github.com/nodejs/node/blob/main/doc/contributing/backporting-to-release-lines.md#how-to-submit-a-backport-pull-request.

targos pushed a commit that referenced this pull request Jul 12, 2022
`OnStreamAfterReqFinished` uses `v8::Object::Has` to check if it needs
to call `oncomplete`. `v8::Object::Has` needs to execute Javascript.
However when worker threads are involved, `OnStreamAfterReqFinished` may
be called after the worker thread termination has begun via
`worker.terminate()`. This makes `v8::Object::Has` return `Nothing`,
which triggers an assert.

This diff fixes the issue by simply defaulting us to `false` in the case
where `Nothing` is returned. This is sound because we can't execute
`oncomplete` anyway as the isolate is terminating.

Fixes: #38418

PR-URL: #42874
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
targos pushed a commit that referenced this pull request Jul 31, 2022
`OnStreamAfterReqFinished` uses `v8::Object::Has` to check if it needs
to call `oncomplete`. `v8::Object::Has` needs to execute Javascript.
However when worker threads are involved, `OnStreamAfterReqFinished` may
be called after the worker thread termination has begun via
`worker.terminate()`. This makes `v8::Object::Has` return `Nothing`,
which triggers an assert.

This diff fixes the issue by simply defaulting us to `false` in the case
where `Nothing` is returned. This is sound because we can't execute
`oncomplete` anyway as the isolate is terminating.

Fixes: #38418

PR-URL: #42874
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
@tony19 tony19 mentioned this pull request Aug 17, 2022
9 tasks
guangwong pushed a commit to noslate-project/node that referenced this pull request Oct 10, 2022
`OnStreamAfterReqFinished` uses `v8::Object::Has` to check if it needs
to call `oncomplete`. `v8::Object::Has` needs to execute Javascript.
However when worker threads are involved, `OnStreamAfterReqFinished` may
be called after the worker thread termination has begun via
`worker.terminate()`. This makes `v8::Object::Has` return `Nothing`,
which triggers an assert.

This diff fixes the issue by simply defaulting us to `false` in the case
where `Nothing` is returned. This is sound because we can't execute
`oncomplete` anyway as the isolate is terminating.

Fixes: nodejs/node#38418

PR-URL: nodejs/node#42874
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. c++ Issues and PRs that require attention from people who are familiar with C++. commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calling worker.terminate() while http/2 reads on a stream crashes the whole process with FATAL ERROR / SIGABRT
6 participants