-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful shutdown #1320
Graceful shutdown #1320
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although @pgavlin should also look, since this is pretty subtle stuff!
sdk/nodejs/runtime/resource.ts
Outdated
* | ||
* We can accomplish both by just doing nothing until the engine kills us. It's ugly, but it works. | ||
*/ | ||
function waitForDeath(): Promise<void> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have this return Promise<never>
? This should help with TS control flow inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do. I'm not sure this even needs to be a promise. Originally this returned a promise that resolved after a 30 second sleep but I think what we actually want to do here is starve the event loop so nothing else happens.
pkg/resource/deploy/source_eval.go
Outdated
case rm.regChan <- step: | ||
case <-rm.cancel: | ||
glog.V(5).Infof("ResourceMonitor.RegisterResource operation canceled, name=%s", name) | ||
return nil, rpcerror.New(codes.Unavailable, "resource monitor is shutting down") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use slightly different error messages for each of these cancellation conditions, just in case it comes in handy debugging based on logs and/or CLI error messages in the field? I'm just thinking something simple like resource monitor shut down while waiting on step
, ...while waiting on step's done channel
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds useful, will do.
log.debug(`RegisterResource RPC finished: ${label}; err: ${err}, resp: ${innerResponse}`); | ||
if (err) { | ||
// If the monitor is unavailable, it is in the process of shutting down or has already | ||
// shut down. Don't emit an error and don't do any more RPCs. | ||
if (err.code === grpc.status.UNAVAILABLE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need code like this inside of the invoke.ts
file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so - I forget about that endpoint...
6e47de7
to
380a4a0
Compare
@pgavlin do you mind taking a look when you get a chance? |
The latest commit fixes a goroutine shutdown race that caused the Linux test leg to fail in the previous commit. |
…g a goroutine that writes to it
25667e2
to
c30f9d3
Compare
Did you also take a look at the interactions between the |
Yes, we'll kill all plugins before shutting down the pulumi/pkg/resource/plugin/host.go Lines 378 to 406 in 0090962
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #701.
The context here: we are currently using gRPC 1.7, where an RPC server has a
GracefulStop
method that "gracefully" shuts down the RPC server. For whatever reason (probably a bug fix), gRPC 1.8 or above changed the semantics of this method to drain all incoming RPCs before shutting down the server. This causes us to deadlock on failed deployments because we are relying on the old behavior ofGracefulStop
to tear down the server without regard for in-flight RPCs. (Again, arguably a gRPC bug.)This PR does shutdown more properly. The shutdown procedure on the main goroutine when a deployment fails now looks like this:
planResult.Walk()
closes aPlanIterator
that it is iterating over.PlanIterator.Close()
closes theSourceIterator
that it is being used to generate resource registration events.evalSourceIterator.Close()
, an implementation ofSourceIterator
, callsCancel
onresmon
, the RPC server.resmon.Cancel
closes therm.cancel
channel and then reads from therm.done
channel.PlanIterator.Close()
returns,planResult.Walk()
returns an error, eventually bubbling up tomain
which causes the CLI to exit.When
resmon
is created, therm.cancel
channel is given to a goroutine that listens to it and, when it is closed or yieldstrue
, callsGracefulStop
on the RPC server and then sends theerror
return value ofGracefulStop
torm.done
. Therefore, when step 4 completes, the RPC server is no longer running.In order to not deadlock, we need to cancel all in-flight RPCs when cancellation is at step 4. To do this,
RegisterResource
andRegisterResourceOutputs
now select onrm.cancel
when performing blocking operations. If a read fromrm.cancel
occurs (i.e. the main goroutine calledCancel
and closed it), the RPC immediately returns an error with codeUNAVAILABLE
.Back in the language host, each language will notice that
registerResource
orregisterResourceOutputs
failed. If a language notices that either RPC call failed with anUNAVAILABLE
error code, it'll go into an infinite loop. This is a bit of a hack, but the objective is to 1) be sure that we don't do any additional RPCs, because the RPC server is shutting down and 2) not advance the program at all, since the last RPC did not complete successfully. The language host will get killed shortly when the CLI exits.All of this combines to produce a graceful exit when errors occur. This PR removes the gRPC version constraint in our
Gopkg.toml
, which ultimately gives us gRPC v1.11.