Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-api: fix shutdown crashes #38492

Closed
wants to merge 6 commits into from
Closed

Conversation

@mhdawson
Copy link
Member

@mhdawson mhdawson commented Apr 30, 2021

Refs: nodejs/node-addon-api#906

Ensure that finalization is not defered during shutdown.
The env for the addon is deleted immediately after
iterating the list of finalizers to be run. Defering
causes crashes as the finalization uses the already
deleted env.

Signed-off-by: Michael Dawson mdawson@devrus.com

Refs: nodejs/node-addon-api#906

Ensure that finalization is not defered during shutdown.
The env for the addon is deleted immediately after
iterating the list of finalizers to be run. Defering
causes crashes as the finalization uses the already
deleted env.

Signed-off-by: Michael Dawson <mdawson@devrus.com>
@mhdawson mhdawson marked this pull request as draft Apr 30, 2021
@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented Apr 30, 2021

Converted to draft as I'm still validating that it fixes the crashes we saw with version 14.x

On master it looked like it resolved the issue, but I only ran for 3-4 hours (544 runs). Last time I got about 16 failures in 1311 runs on 14.x and 2 failures in 386 runs on main.

Want to run on 14.x and to run for longer to confirm. If it still looks good will move out of WIP.

For the run on main, I saw no crashes but valgrind still report some different issues, those will be addressed in follow up PRs. We already discussed one additional fix in the node-api team meeting today that might address the remaining issue. It will be good to get this one in first, as I'm hoping it will resolve the intermittent failures we are seeing in the node-addon-api tests so that we can do some releases.

Copy link
Member

@addaleax addaleax left a comment

If the PR description is accurate, the proper fix would be calling Ref() when scheduling the SetImmediate call, and calling Unref() after it's contents have executed.

Either way, this is going to need tests as well

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented Apr 30, 2021

@gabrielschulhof you know the shutdown better than me. During env shutdown does it use the ref/unref or call the destructor directly ? I'd tried moving the iteration of the lists into the section with the ref counting but that seemed to cause it not to be run.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented Apr 30, 2021

Already failing on 14.x after not too long, going to try to pull in #38469 to see if that makes a difference.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 1, 2021

Seems like still failing on 14.x 40 crashes in 2125 runs. On quick inspection the stack traces still look like nodejs/node-addon-api#906 (comment).

@addaleax
Copy link
Member

@addaleax addaleax commented May 1, 2021

During env shutdown does it use the ref/unref or call the destructor directly ?

I guess my point is, more specifically, if the lifetime management of napi_env is the issue here, than that’s also what the fix should account for.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 1, 2021

@addaleax

The original code looks like:

 virtual ~napi_env__() {
    // First we must finalize those references that have `napi_finalizer`
    // callbacks. The reason is that addons might store other references which
    // they delete during their `napi_finalizer` callbacks. If we deleted such
    // references here first, they would be doubly deleted when the
    // `napi_finalizer` deleted them subsequently.
    v8impl::RefTracker::FinalizeAll(&finalizing_reflist);
    v8impl::RefTracker::FinalizeAll(&reflist);
  }

  inline void Unref() { if ( --refs == 0) delete this; }

Once ~napi_env__() completes there should be nothing that was deferred and needs the env.

The original suggestion we'd discussed in the Node-api meeting was moving the FinalizeAll into the Unref before the delete in addition to the changes I made. I had tried that but it broke the test that validated that the finalizer actually ran. My question to Gabriel was if he knew why that might be.

I could be the right answer is that we add some Ref/Unrefs, but the code above would also have to be able to handle refs going to 0, running the finalizers, then only deleting the env if refs was still 0, and then possibly running the FinalizeAll again when refs reached 0 again.

Gabriel suggested avoiding the SetImmediate during shutdown since we are not actually in GC so it's not necessarily needed. That would still leave issues if though if there were already Finalizers that had been deferred.

I think the suggestion of removing the SetImmediate during env teardown was to make sure the finalizers being run all completed before the life of napi_env was over.

@addaleax
Copy link
Member

@addaleax addaleax commented May 1, 2021

@mhdawson Yeah, I think that's fair. Could you leave a comment in the source explaining why in this specific situation it's okay to skip the SetImmediate() here then?

(I'll just leave the "request changes" here, because, as I mentioned, this should come with tests.)

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 1, 2021

@addaleax will do. I'm still seeing failures on 14.x so want to figure out what the else we might need from main before pulling this out of WIP.

Signed-off-by: Michael Dawson <mdawson@devrus.com>
@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 1, 2021

Pushed another commit to cover a second case where Finalization was being deferred.

src/node_api.cc Outdated
});
});
});
}
Copy link
Member

@addaleax addaleax May 1, 2021

Can you store the lambda above in a variable and either pass it to SetImmediate or call it directly, instead of duplicating the code?

Copy link
Member Author

@mhdawson mhdawson May 2, 2021

sure, just pushed a an update which looks like it resolved the crashes on master. Will look at refactoring once I confirm that it fixes the crashses we were seeing on master.

Copy link
Member Author

@mhdawson mhdawson May 2, 2021

The code is not quite the same anymore, but there is still some that should be able to be shared.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 3, 2021

Ok, more details in nodejs/node-addon-api#906 but looks like the current version of this patch resolves the crash in main and 14.x (for 14.x I had also pulled back other changes to finalization as well).

Now I'll look at refactoring and how we might test.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 6, 2021

@legendecas, @gabrielschulhof I've spent a number of hours trying to create a test and I'm not making any progress. I can trigger the paths where the changes have been made but not the combination where they are triggered and the env has already been deleted. I'm wondering if it takes the gc running at just the wrong time.

It would be great of you two could take a look and confirm that the changes make sense to you and if you can give me any suggestions on how a test might trigger the problematic case.

@legendecas
Copy link
Member

@legendecas legendecas commented May 6, 2021

@mhdawson when I was working on #38000 I encountered the exact same issue. We are unable to produce a stable output with mere node-api c interfaces. What I am thinking about is that maybe we can create a different set of tests, in which we can facilitate both node-api c interface and node c++ interface, so that we can interleave both environments (node-api env and node Environment) to ensure the code running into the path we are expecting, i.e. directly calling into v8 for requesting gc without calling into js or picking any timing on teardown to run arbitrary native code with node Environment. As the new test set is testing on node's own specific implementation, for sure those tests should not be considered as part of the node-api behavior tests, which we can extract from node-core repo to test other node-api implementations as well. If you are thinking this is a good approach, I'd like to start working on it since we can still not test #38000 in a stable sense.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 6, 2021

a different set of tests, in which we can facilitate both node-api c interface and node c++ interface,

@legendecas I had been working on a new test in test/js-native-api/test_finalization, copying code from the other tests to try to trigger the case. I think from an addon we can run both node-api code as well as node internals. In the extreme case the test.js file might just call one native method on the addon.

Is that what you had in mind or something else ?

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 6, 2021

Since I built up the PR to something that resolve the problem, I'm now validating what parts or combinations of parts are needed to resolve the problem. From my attempts to recreate, I'm not sure that the check for is terminating is needed, although it may still make sense. Similarly I'd added a Ref/Unref for both setImmediates not sure if they are both needed to resolve the issue, although may also still make sense.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 6, 2021

Will take a bit as each check after a change requires in the order of 6 hours running on 100 cpus to validate.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 7, 2021

Removed the check for the env teardown, no failures in about 450 runs.

Next removed the Ref/UnRef in CallFinalizer in node-api.cc, 1 crash within the first 300 runs without it. So it is needed.

Next adding back in the Ref/UnRef in CallFinalizer and commenting it out in the Buffer Finalizer section. 600 runs so far without a failure. Will let run overnight. If no failures that would mean that the minimum change to address the crashes we were seeing in master is the Ref/Unref in CallFinalizer. We'll see.

Will discuss in the node-api meeting tomorrow as the other changes may still make sense even if they don' address the specific failure we were encountering.

@legendecas
Copy link
Member

@legendecas legendecas commented May 7, 2021

a different set of tests, in which we can facilitate both node-api c interface and node c++ interface,

@legendecas I had been working on a new test in test/js-native-api/test_finalization, copying code from the other tests to try to trigger the case. I think from an addon we can run both node-api code as well as node internals. In the extreme case the test.js file might just call one native method on the addon.

Is that what you had in mind or something else ?

Yes.

I believe there is no such constraints that we have to use single addon in one test right? In that case we can build two target, one in node-api and one with node internals.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 7, 2021

@legendecas the tests do already have addon tests that are with/without Node-api, so yes, but were you thinking they would be used together in a single test or test independently?

I think we could also figure out testing that would use both node-api and node internal in a single addon. It's not something you would do in a real add-on but for a test we could make it work.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 7, 2021

Next adding back in the Ref/UnRef in CallFinalizer and commenting it out in the Buffer Finalizer section. 600 runs so far without a failure. Will let run overnight. If no failures that would mean that the minimum change to address the crashes we were seeing in master is the Ref/Unref in CallFinalizer. We'll see.

At least for main it looks like only the change in CallFinalizer is required to address the crashes, 2529 runs overnight and no crashes seen.

@legendecas
Copy link
Member

@legendecas legendecas commented May 7, 2021

the tests do already have addon tests that are with/without Node-api, so yes, but were you thinking they would be used together in a single test or test independently?

Yes, we can use them together in a single test.

I think we could also figure out testing that would use both node-api and node internal in a single addon. It's not something you would do in a real add-on but for a test we could make it work.

IIRC, one single addon dynamic library can only have single add-on entry been valid and recognizable by Node.js. See https://github.com/nodejs/node/blob/master/src/node_binding.cc#L475-478 . So it doesn't seem to be quite feasible to put them two in a single binary?

@@ -122,6 +122,37 @@ struct napi_env__ {
void* instance_data = nullptr;
};

// This class is used to keep a napi_env live in a way that
// is exception safe versus calling Ref/Unref directly
class LiveEnv {
Copy link
Member

@legendecas legendecas May 11, 2021

Maybe we can call this EnvRefHolder or similar in a more explicit way?

Copy link
Member Author

@mhdawson mhdawson May 12, 2021

@gabrielschulhof, @addaleax, @KevinEady, thoughts on the name? Does EnvRefHolder sound good to you?

Copy link
Member Author

@mhdawson mhdawson May 13, 2021

I'm good with EnvRefHolder and see Anna was +1 as well. Will update.

Copy link
Member Author

@mhdawson mhdawson May 13, 2021

@legendecas, @addaleax updated if you want to take another look.

@nodejs-github-bot
Copy link
Contributor

@nodejs-github-bot nodejs-github-bot commented May 11, 2021

Signed-off-by: Michael Dawson <mdawson@devrus.com>
@gabrielschulhof
Copy link
Contributor

@gabrielschulhof gabrielschulhof commented May 13, 2021

@addaleax I'm not sure if we can create a situation that deterministically creates this scenario. After all, in @mhdawson's testing it took hours of running the test case to produce a crash.

@gabrielschulhof
Copy link
Contributor

@gabrielschulhof gabrielschulhof commented May 14, 2021

@addaleax ... and with the fix, he ran it for hours more, without crashing or leaking.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 19, 2021

@addaleax From the comments above we don't have any ideas of how to add a test at this point, and it has a few approvals. Just wondering if your "request changes" is still mean to block this from landing.

Copy link
Member

@addaleax addaleax left a comment

@mhdawson I think it’s okay, but it feels like it should be possible to add a cctest that just calls the method in the specfic order that leads to the crash if nothing else works

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 19, 2021

@addaleax

I'd really like to have a test:

  • I'd tried a number of different combinations of trying to cut down the node-addon-test suite, the only way I could recreate was by running the full test suite. Running just the tests which seemed to trigger the code path alone never recreated and even trying to cut out tests that might not have been related ended up with non-recreate (of course due to how long I had to run to recreate I did not try all combinations)
  • I also spent a bunch of time trying to build a cctest, pulling code from existing tests that I validated would that cause the code path being changed to be executed, but was not able to find a combination that would also trigger the crash. I think the GC has to kick in at just the wrong time and I don't know of a way to do that.

Thanks for being willing to let it move forward.

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 19, 2021

CI test since the last one had flakes and is a bit old: https://ci.nodejs.org/job/node-test-pull-request/38218/

@nodejs-github-bot
Copy link
Contributor

@nodejs-github-bot nodejs-github-bot commented May 19, 2021

@nodejs-github-bot
Copy link
Contributor

@nodejs-github-bot nodejs-github-bot commented May 20, 2021

mhdawson added a commit that referenced this issue May 20, 2021
Refs: nodejs/node-addon-api#906

Ensure that finalization is not defered during shutdown.
The env for the addon is deleted immediately after
iterating the list of finalizers to be run. Defering
causes crashes as the finalization uses the already
deleted env.

Signed-off-by: Michael Dawson <mdawson@devrus.com>

PR-URL: #38492
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 20, 2021

Landed in c32c5f0

@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 20, 2021

Cherry-picks cleanly on 16.x so we should not take extra work to get it there.

For 14.x we'll need to backport a number of changes in the same area. I'll take a look at what we have PRs for/not and if this will land cleanly after those have.

@targos
Copy link
Member

@targos targos commented May 30, 2021

Just landed cleanly on v14.x-staging.

targos added a commit that referenced this issue May 30, 2021
Refs: nodejs/node-addon-api#906

Ensure that finalization is not defered during shutdown.
The env for the addon is deleted immediately after
iterating the list of finalizers to be run. Defering
causes crashes as the finalization uses the already
deleted env.

Signed-off-by: Michael Dawson <mdawson@devrus.com>

PR-URL: #38492
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
@mhdawson
Copy link
Member Author

@mhdawson mhdawson commented May 31, 2021

@targos thanks for the update!

danielleadams added a commit that referenced this issue May 31, 2021
Refs: nodejs/node-addon-api#906

Ensure that finalization is not defered during shutdown.
The env for the addon is deleted immediately after
iterating the list of finalizers to be run. Defering
causes crashes as the finalization uses the already
deleted env.

Signed-off-by: Michael Dawson <mdawson@devrus.com>

PR-URL: #38492
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
@danielleadams danielleadams mentioned this pull request May 31, 2021
targos added a commit that referenced this issue Jun 5, 2021
Refs: nodejs/node-addon-api#906

Ensure that finalization is not defered during shutdown.
The env for the addon is deleted immediately after
iterating the list of finalizers to be run. Defering
causes crashes as the finalization uses the already
deleted env.

Signed-off-by: Michael Dawson <mdawson@devrus.com>

PR-URL: #38492
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
targos added a commit that referenced this issue Jun 5, 2021
Refs: nodejs/node-addon-api#906

Ensure that finalization is not defered during shutdown.
The env for the addon is deleted immediately after
iterating the list of finalizers to be run. Defering
causes crashes as the finalization uses the already
deleted env.

Signed-off-by: Michael Dawson <mdawson@devrus.com>

PR-URL: #38492
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
targos added a commit that referenced this issue Jun 11, 2021
Refs: nodejs/node-addon-api#906

Ensure that finalization is not defered during shutdown.
The env for the addon is deleted immediately after
iterating the list of finalizers to be run. Defering
causes crashes as the finalization uses the already
deleted env.

Signed-off-by: Michael Dawson <mdawson@devrus.com>

PR-URL: #38492
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
Reviewed-By: Gabriel Schulhof <gabrielschulhof@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants