Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown #1320

Merged
merged 9 commits into from
May 16, 2018
Merged

Graceful shutdown #1320

merged 9 commits into from
May 16, 2018

Conversation

swgillespie
Copy link
Contributor

Fixes #701.

The context here: we are currently using gRPC 1.7, where an RPC server has a GracefulStop method that "gracefully" shuts down the RPC server. For whatever reason (probably a bug fix), gRPC 1.8 or above changed the semantics of this method to drain all incoming RPCs before shutting down the server. This causes us to deadlock on failed deployments because we are relying on the old behavior of GracefulStop to tear down the server without regard for in-flight RPCs. (Again, arguably a gRPC bug.)

This PR does shutdown more properly. The shutdown procedure on the main goroutine when a deployment fails now looks like this:

  1. An error occurs and planResult.Walk() closes a PlanIterator that it is iterating over.
  2. PlanIterator.Close() closes the SourceIterator that it is being used to generate resource registration events.
  3. evalSourceIterator.Close(), an implementation of SourceIterator, calls Cancel on resmon, the RPC server.
  4. resmon.Cancel closes the rm.cancel channel and then reads from the rm.done channel.
  5. Once PlanIterator.Close() returns, planResult.Walk() returns an error, eventually bubbling up to main which causes the CLI to exit.

When resmon is created, the rm.cancel channel is given to a goroutine that listens to it and, when it is closed or yields true, calls GracefulStop on the RPC server and then sends the error return value of GracefulStop to rm.done. Therefore, when step 4 completes, the RPC server is no longer running.

In order to not deadlock, we need to cancel all in-flight RPCs when cancellation is at step 4. To do this, RegisterResource and RegisterResourceOutputs now select on rm.cancel when performing blocking operations. If a read from rm.cancel occurs (i.e. the main goroutine called Cancel and closed it), the RPC immediately returns an error with code UNAVAILABLE.

Back in the language host, each language will notice that registerResource or registerResourceOutputs failed. If a language notices that either RPC call failed with an UNAVAILABLE error code, it'll go into an infinite loop. This is a bit of a hack, but the objective is to 1) be sure that we don't do any additional RPCs, because the RPC server is shutting down and 2) not advance the program at all, since the last RPC did not complete successfully. The language host will get killed shortly when the CLI exits.

All of this combines to produce a graceful exit when errors occur. This PR removes the gRPC version constraint in our Gopkg.toml, which ultimately gives us gRPC v1.11.

Copy link
Member

@joeduffy joeduffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although @pgavlin should also look, since this is pretty subtle stuff!

*
* We can accomplish both by just doing nothing until the engine kills us. It's ugly, but it works.
*/
function waitForDeath(): Promise<void> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have this return Promise<never>? This should help with TS control flow inference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. I'm not sure this even needs to be a promise. Originally this returned a promise that resolved after a 30 second sleep but I think what we actually want to do here is starve the event loop so nothing else happens.

case rm.regChan <- step:
case <-rm.cancel:
glog.V(5).Infof("ResourceMonitor.RegisterResource operation canceled, name=%s", name)
return nil, rpcerror.New(codes.Unavailable, "resource monitor is shutting down")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use slightly different error messages for each of these cancellation conditions, just in case it comes in handy debugging based on logs and/or CLI error messages in the field? I'm just thinking something simple like resource monitor shut down while waiting on step, ...while waiting on step's done channel, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds useful, will do.

log.debug(`RegisterResource RPC finished: ${label}; err: ${err}, resp: ${innerResponse}`);
if (err) {
// If the monitor is unavailable, it is in the process of shutting down or has already
// shut down. Don't emit an error and don't do any more RPCs.
if (err.code === grpc.status.UNAVAILABLE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need code like this inside of the invoke.ts file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so - I forget about that endpoint...

@swgillespie
Copy link
Contributor Author

@pgavlin do you mind taking a look when you get a chance?

@swgillespie
Copy link
Contributor Author

The latest commit fixes a goroutine shutdown race that caused the Linux test leg to fail in the previous commit.

@swgillespie swgillespie force-pushed the swgillespie/graceful-shutdown branch from 25667e2 to c30f9d3 Compare May 16, 2018 19:42
@pgavlin
Copy link
Member

pgavlin commented May 16, 2018

Did you also take a look at the interactions between the engine endpoint and the language plugins? Will we kill the language plugins/providers/etc. before shutting down that endpoint?

@swgillespie
Copy link
Contributor Author

Will we kill the language plugins/providers/etc. before shutting down that endpoint?

Yes, we'll kill all plugins before shutting down the engine endpoint:

func (host *defaultHost) Close() error {
// Close all plugins.
for _, plug := range host.analyzerPlugins {
if err := plug.Plugin.Close(); err != nil {
logging.Infof("Error closing '%s' analyzer plugin during shutdown; ignoring: %v", plug.Info.Name, err)
}
}
for _, plug := range host.resourcePlugins {
if err := plug.Plugin.Close(); err != nil {
logging.Infof("Error closing '%s' resource plugin during shutdown; ignoring: %v", plug.Info.Name, err)
}
}
for _, plug := range host.languagePlugins {
if err := plug.Plugin.Close(); err != nil {
logging.Infof("Error closing '%s' language plugin during shutdown; ignoring: %v", plug.Info.Name, err)
}
}
// Empty out all maps.
host.analyzerPlugins = make(map[tokens.QName]*analyzerPlugin)
host.languagePlugins = make(map[string]*languagePlugin)
host.resourcePlugins = make(map[tokens.Package]*resourcePlugin)
// Shut down the plugin loader.
close(host.loadRequests)
// Finally, shut down the host's gRPC server.
return host.server.Cancel()
}

Copy link
Member

@pgavlin pgavlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@swgillespie swgillespie merged commit 6891190 into master May 16, 2018
@swgillespie
Copy link
Contributor Author

thanks @pgavlin @joeduffy !

@swgillespie swgillespie deleted the swgillespie/graceful-shutdown branch May 16, 2018 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix RPC teardown (gRPC 1.7)
3 participants