Graceful shutdown #1320

swgillespie · 2018-05-04T00:12:32Z

Fixes #701.

The context here: we are currently using gRPC 1.7, where an RPC server has a GracefulStop method that "gracefully" shuts down the RPC server. For whatever reason (probably a bug fix), gRPC 1.8 or above changed the semantics of this method to drain all incoming RPCs before shutting down the server. This causes us to deadlock on failed deployments because we are relying on the old behavior of GracefulStop to tear down the server without regard for in-flight RPCs. (Again, arguably a gRPC bug.)

This PR does shutdown more properly. The shutdown procedure on the main goroutine when a deployment fails now looks like this:

An error occurs and planResult.Walk() closes a PlanIterator that it is iterating over.
PlanIterator.Close() closes the SourceIterator that it is being used to generate resource registration events.
evalSourceIterator.Close(), an implementation of SourceIterator, calls Cancel on resmon, the RPC server.
resmon.Cancel closes the rm.cancel channel and then reads from the rm.done channel.
Once PlanIterator.Close() returns, planResult.Walk() returns an error, eventually bubbling up to main which causes the CLI to exit.

When resmon is created, the rm.cancel channel is given to a goroutine that listens to it and, when it is closed or yields true, calls GracefulStop on the RPC server and then sends the error return value of GracefulStop to rm.done. Therefore, when step 4 completes, the RPC server is no longer running.

In order to not deadlock, we need to cancel all in-flight RPCs when cancellation is at step 4. To do this, RegisterResource and RegisterResourceOutputs now select on rm.cancel when performing blocking operations. If a read from rm.cancel occurs (i.e. the main goroutine called Cancel and closed it), the RPC immediately returns an error with code UNAVAILABLE.

Back in the language host, each language will notice that registerResource or registerResourceOutputs failed. If a language notices that either RPC call failed with an UNAVAILABLE error code, it'll go into an infinite loop. This is a bit of a hack, but the objective is to 1) be sure that we don't do any additional RPCs, because the RPC server is shutting down and 2) not advance the program at all, since the last RPC did not complete successfully. The language host will get killed shortly when the CLI exits.

All of this combines to produce a graceful exit when errors occur. This PR removes the gRPC version constraint in our Gopkg.toml, which ultimately gives us gRPC v1.11.

joeduffy

LGTM, although @pgavlin should also look, since this is pretty subtle stuff!

joeduffy · 2018-05-04T03:04:46Z

sdk/nodejs/runtime/resource.ts

+ *
+ * We can accomplish both by just doing nothing until the engine kills us. It's ugly, but it works.
+ */
+function waitForDeath(): Promise<void> {


Maybe have this return Promise<never>? This should help with TS control flow inference.

will do. I'm not sure this even needs to be a promise. Originally this returned a promise that resolved after a 30 second sleep but I think what we actually want to do here is starve the event loop so nothing else happens.

joeduffy · 2018-05-04T03:05:49Z

pkg/resource/deploy/source_eval.go

+	case rm.regChan <- step:
+	case <-rm.cancel:
+		glog.V(5).Infof("ResourceMonitor.RegisterResource operation canceled, name=%s", name)
+		return nil, rpcerror.New(codes.Unavailable, "resource monitor is shutting down")


Should we use slightly different error messages for each of these cancellation conditions, just in case it comes in handy debugging based on logs and/or CLI error messages in the field? I'm just thinking something simple like resource monitor shut down while waiting on step, ...while waiting on step's done channel, etc.

sounds useful, will do.

joeduffy · 2018-05-04T03:07:09Z

sdk/nodejs/runtime/resource.ts

                    log.debug(`RegisterResource RPC finished: ${label}; err: ${err}, resp: ${innerResponse}`);
                    if (err) {
+                        // If the monitor is unavailable, it is in the process of shutting down or has already
+                        // shut down. Don't emit an error and don't do any more RPCs.
+                        if (err.code === grpc.status.UNAVAILABLE) {


Do we need code like this inside of the invoke.ts file?

Yeah, I think so - I forget about that endpoint...

swgillespie · 2018-05-14T17:07:05Z

@pgavlin do you mind taking a look when you get a chance?

swgillespie · 2018-05-14T23:38:02Z

The latest commit fixes a goroutine shutdown race that caused the Linux test leg to fail in the previous commit.

…g a goroutine that writes to it

pgavlin · 2018-05-16T20:35:04Z

Did you also take a look at the interactions between the engine endpoint and the language plugins? Will we kill the language plugins/providers/etc. before shutting down that endpoint?

swgillespie · 2018-05-16T20:38:42Z

Will we kill the language plugins/providers/etc. before shutting down that endpoint?

Yes, we'll kill all plugins before shutting down the engine endpoint:

pulumi/pkg/resource/plugin/host.go

Lines 378 to 406 in 0090962

    
           func (host *defaultHost) Close() error { 
        
           	// Close all plugins. 
        
           	for _, plug := range host.analyzerPlugins { 
        
           		if err := plug.Plugin.Close(); err != nil { 
        
           			logging.Infof("Error closing '%s' analyzer plugin during shutdown; ignoring: %v", plug.Info.Name, err) 
        
           		} 
        
           	} 
        
           	for _, plug := range host.resourcePlugins { 
        
           		if err := plug.Plugin.Close(); err != nil { 
        
           			logging.Infof("Error closing '%s' resource plugin during shutdown; ignoring: %v", plug.Info.Name, err) 
        
           		} 
        
           	} 
        
           	for _, plug := range host.languagePlugins { 
        
           		if err := plug.Plugin.Close(); err != nil { 
        
           			logging.Infof("Error closing '%s' language plugin during shutdown; ignoring: %v", plug.Info.Name, err) 
        
           		} 
        
           	} 
        
           	// Empty out all maps. 
        
           	host.analyzerPlugins = make(map[tokens.QName]*analyzerPlugin) 
        
           	host.languagePlugins = make(map[string]*languagePlugin) 
        
           	host.resourcePlugins = make(map[tokens.Package]*resourcePlugin) 
        
           	// Shut down the plugin loader. 
        
           	close(host.loadRequests) 
        
           	// Finally, shut down the host's gRPC server. 
        
           	return host.server.Cancel() 
        
           }

pgavlin

LGTM

swgillespie · 2018-05-16T22:37:43Z

thanks @pgavlin @joeduffy !

swgillespie requested review from pgavlin and joeduffy May 4, 2018 00:12

joeduffy approved these changes May 4, 2018

View reviewed changes

swgillespie mentioned this pull request May 10, 2018

Update grpc version when upstream security issue is fixed #1350

Closed

swgillespie force-pushed the swgillespie/graceful-shutdown branch from 6e47de7 to 380a4a0 Compare May 14, 2018 17:06

swgillespie added 9 commits May 16, 2018 12:33

Graceful RPC shutdown: CLI side

a69134a

Handle unavailable resource monitor in language hosts

6b00302

Fix a comment

da0cd20

Don't commit package-lock.json

daf4f06

fix mangled pylint pragma

cad6d50

Rebase against master and fix Gopkg.lock

8798754

Code review feedback

048ceca

Fix a race between closing the callerEventsOpt channel and terminatin…

ba478bd

…g a goroutine that writes to it

glog -> logging

c30f9d3

swgillespie force-pushed the swgillespie/graceful-shutdown branch from 25667e2 to c30f9d3 Compare May 16, 2018 19:42

pgavlin approved these changes May 16, 2018

View reviewed changes

swgillespie merged commit 6891190 into master May 16, 2018

swgillespie deleted the swgillespie/graceful-shutdown branch May 16, 2018 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown #1320

Graceful shutdown #1320

swgillespie commented May 4, 2018

joeduffy left a comment

joeduffy May 4, 2018

swgillespie May 4, 2018 •

edited

joeduffy May 4, 2018

swgillespie May 4, 2018

joeduffy May 4, 2018

swgillespie May 4, 2018

swgillespie commented May 14, 2018

swgillespie commented May 14, 2018

pgavlin commented May 16, 2018

swgillespie commented May 16, 2018

pgavlin left a comment

swgillespie commented May 16, 2018

Graceful shutdown #1320

Graceful shutdown #1320

Conversation

swgillespie commented May 4, 2018

joeduffy left a comment

Choose a reason for hiding this comment

joeduffy May 4, 2018

Choose a reason for hiding this comment

swgillespie May 4, 2018 • edited

Choose a reason for hiding this comment

joeduffy May 4, 2018

Choose a reason for hiding this comment

swgillespie May 4, 2018

Choose a reason for hiding this comment

joeduffy May 4, 2018

Choose a reason for hiding this comment

swgillespie May 4, 2018

Choose a reason for hiding this comment

swgillespie commented May 14, 2018

swgillespie commented May 14, 2018

pgavlin commented May 16, 2018

swgillespie commented May 16, 2018

pgavlin left a comment

Choose a reason for hiding this comment

swgillespie commented May 16, 2018

swgillespie May 4, 2018 •

edited