Collect driver stats when driver plugins are restarted #5948

notnoop · 2019-07-11T06:31:42Z

Fixes #5936 .

This fixes two issues in handling stats in the case where a driver plugin is restarted.

a bug where we don't properly detect that driver plugin was shutdown during a TaskStats call; so we don't retry properly.
a bug in StatsHook where retry logic doesn't retry calling Stats! So it loops and then wait until context is done.

Interestingly, there are two retry layers for driver handler calls. Once in LazyHandle that special cases plugin shutdown and once again in the hook. Due to the bugs above, both layers fail to recover. The LazyHandle logic is here:

nomad/client/allocrunner/taskrunner/lazy_handle.go

Lines 141 to 149 in 8da0b91

    
           out, err := h.Stats(ctx, interval) 
        
           if err == bstructs.ErrPluginShutdown && first { 
        
           	first = false 
        
           	h, err = l.refreshHandle() 
        
           	if err == nil { 
        
           		goto TRY 
        
           	} 
        
           }

I considered, but decided against, making more substantial refactoring in LazyHandle. LazyHandle tracks its own handlers with its own locking. We can probably simplify the logic to always call tr.getDriverHandle, which is a simple synchornized field lookup - the refresh logic is probably unnecessary. I punted on that for another PR as I want to keep this PR less risky.

The driver plugin stub client must call `grpcutils.HandleGrpcErr` to handle plugin shutdown similar to other functions. This ensures that TaskStats returns `ErrPluginShutdown` when plugin shutdown.

Previously, if a channel is closed, we retry the Stats call. But, if that call fails, we go in a backoff loop without calling Stats ever again. Here, we use a utility function for calling driverHandle.Stats call that retries as one expects. I aimed to preserve the logging formats but made small improvements as I saw fit.

notnoop · 2019-07-11T06:33:31Z

client/allocrunner/taskrunner/stats_hook.go

-				// on the stop channel is the correct behavior
-				if err != bstructs.ErrPluginShutdown {
-					h.logger.Debug("error fetching stats of task", "error", err)
-					goto RETRY


In my rewrite, I flipped this conditional logic. The comment doesn't match the conditional. I assume the comment intends to say that if the plugin is shutdown, we want to log in debug level rather than an error level (done in L130). If so, the check here should be err == bstructs.ErrPluginShutdown

I think the comment is correct - its saying that if the err is ErrPluginShutdown we should not be logging because that logging is unnecessary. I am not sure about retrying here vs one layer up to call TaskStats once the plugin restarts though.

Not an expert in this code though, so @nickethier should chime in..

In the old code, if err != bstructs.ErrPluginShutdown we log in debug mode in L121, but if it's ErrPluginShutdown, then we will log with a Warn level in L130. I'm having a hard time reconciling that with comment.

The old logic for logging is in:

nomad/client/allocrunner/taskrunner/stats_hook.go

Lines 120 to 130 in 8da0b91

if err != bstructs.ErrPluginShutdown {

h.logger.Debug("error fetching stats of task", "error", err)

goto RETRY

}

// check if the error is terminal otherwise it's likely a

// transport error and we should retry

re, ok = err.(*structs.RecoverableError)

if ok && re.IsUnrecoverable() {

return

}

h.logger.Warn("stats collection for task failed", "error", err)

notnoop · 2019-07-11T06:38:08Z

client/allocrunner/taskrunner/stats_hook.go

+
+	limit := time.Second * 5
+	backoff := 1 << (2 * uint64(retry)) * time.Second
+	if backoff > limit || retry > 5 {


I added a retry check mostly to avoid dealing with integer overflow. If retry count is large enough, we get some wrapping effect, and backoff becomes a negative value! You can see the effects in https://play.golang.org/p/JYvHeAsSTUk .

The backoff computation computes the following backoff value for retry values.

0 1s 1 4s 2 16s 3 1m4s 4 4m16s 5 17m4s 6 1h8m16s 7 4h33m4s 8 18h12m16s 9 72h49m4s 10 291h16m16s 11 1165h5m4s 12 4660h20m16s 13 18641h21m4s 14 74565h24m16s 15 298261h37m4s 16 1193046h28m16s 17 -351909h41m29.709551616s 18 -1407638h45m58.838206464s 19 -506459h29m21.64327424s

Here, I avoided adding a retry cap to minimize changes compared to old code.

nickethier

LGTM on small comment about error handling.

nickethier · 2019-07-16T15:30:21Z

client/allocrunner/taskrunner/stats_hook.go

+	// check if the error is terminal otherwise it's likely a
+	// transport error and we should retry
+	if re, ok := err.(*structs.RecoverableError); ok && re.IsUnrecoverable() {
+		return nil, err


This error gets swallowed. Is it useful enough to log it even if debug? When callStatsWithRetry is called and returns with an error it just returns.

yes, we should! I logged them as an error just like the other case where we retry.

github-actions · 2023-02-07T02:15:03Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Mahmood Ali added 2 commits July 11, 2019 13:57

Signal plugin shutdown for driver.TaskStats

d8a43b2

The driver plugin stub client must call `grpcutils.HandleGrpcErr` to handle plugin shutdown similar to other functions. This ensures that TaskStats returns `ErrPluginShutdown` when plugin shutdown.

notnoop commented Jul 11, 2019

View reviewed changes

notnoop requested review from nickethier and preetapan July 11, 2019 06:42

preetapan added the 0.9.4 label Jul 15, 2019

nickethier approved these changes Jul 16, 2019

View reviewed changes

log unrecoverable errors

66bef39

notnoop merged commit b352af9 into master Jul 17, 2019

notnoop deleted the b-stats-recover-plugin-shutdown branch July 18, 2019 03:37

notnoop pushed a commit that referenced this pull request Jul 18, 2019

changelog GH-5837 and GH-5948

c7ab4a6

github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect driver stats when driver plugins are restarted #5948

Collect driver stats when driver plugins are restarted #5948

notnoop commented Jul 11, 2019 •

edited

notnoop Jul 11, 2019

preetapan Jul 11, 2019

notnoop Jul 11, 2019

notnoop Jul 11, 2019

nickethier left a comment

nickethier Jul 16, 2019

notnoop Jul 17, 2019

github-actions bot commented Feb 7, 2023

	out, err := h.Stats(ctx, interval)
	if err == bstructs.ErrPluginShutdown && first {
	first = false

	h, err = l.refreshHandle()
	if err == nil {
	goto TRY
	}
	}

	if err != bstructs.ErrPluginShutdown {
	h.logger.Debug("error fetching stats of task", "error", err)
	goto RETRY
	}
	// check if the error is terminal otherwise it's likely a
	// transport error and we should retry
	re, ok = err.(*structs.RecoverableError)
	if ok && re.IsUnrecoverable() {
	return
	}
	h.logger.Warn("stats collection for task failed", "error", err)

Collect driver stats when driver plugins are restarted #5948

Collect driver stats when driver plugins are restarted #5948

Conversation

notnoop commented Jul 11, 2019 • edited

notnoop Jul 11, 2019

Choose a reason for hiding this comment

preetapan Jul 11, 2019

Choose a reason for hiding this comment

notnoop Jul 11, 2019

Choose a reason for hiding this comment

notnoop Jul 11, 2019

Choose a reason for hiding this comment

nickethier left a comment

Choose a reason for hiding this comment

nickethier Jul 16, 2019

Choose a reason for hiding this comment

notnoop Jul 17, 2019

Choose a reason for hiding this comment

github-actions bot commented Feb 7, 2023

notnoop commented Jul 11, 2019 •

edited