Fix timeouts from very large raft messages #3122

dperny · 2023-02-15T18:02:43Z

Rationale is in a comment in the diff explaining the two removed lines.

fixes #3113

neersighted · 2023-02-15T18:03:36Z

Fixes #3113

dperny · 2023-02-17T15:30:01Z

Fixed a spelling issue.

neersighted · 2023-03-15T19:21:05Z

PTAL @thaJeztah

thaJeztah

SGTM

perhaps our "context aficionado" @corhere wants to have a look as well?

thaJeztah · 2023-03-22T08:52:30Z

Let me close/reopen to kick CI; GHA UI for some reason doesn't show "re-run" options 🤔

thaJeztah · 2023-03-22T09:53:59Z

CI failing again, but I think it's a different test now. Let me post that failure in case it was not (yet) known as a flaky

2023/03/22 08:58:55 http: TLS handshake error from 127.0.0.1:47198: remote error: tls: bad certificate
2023/03/22 08:58:55 http: TLS handshake error from 127.0.0.1:40038: remote error: tls: bad certificate
2023/03/22 08:58:55 http: TLS handshake error from 127.0.0.1:40042: remote error: tls: bad certificate
--- FAIL: TestListManagerNodes (23.68s)
    node_test.go:543: 
        	Error Trace:	/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:543
        	Error:      	Received unexpected error:
        	            	expected node 5 to be unreachable
        	            	polling failed
        	            	github.com/moby/swarmkit/v2/testutils.PollFuncWithTimeout
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:28
        	            	github.com/moby/swarmkit/v2/testutils.PollFunc
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/moby/swarmkit/v2/manager/controlapi.TestListManagerNodes
        	            		/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:543
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1439
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1571
        	Test:       	TestListManagerNodes
    node_test.go:612: 
        	Error Trace:	/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:612
        	Error:      	Received unexpected error:
        	            	expected node 1 to be unreachable
        	            	polling failed
        	            	github.com/moby/swarmkit/v2/testutils.PollFuncWithTimeout
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:28
        	            	github.com/moby/swarmkit/v2/testutils.PollFunc
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/moby/swarmkit/v2/manager/controlapi.TestListManagerNodes
        	            		/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:612
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1439
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1571
        	Test:       	TestListManagerNodes
FAIL

thaJeztah · 2023-03-22T10:52:35Z

TestListManagerNodes failing again; looks like it failed before on #3111 (comment)

2023/03/22 09:59:20 http: TLS handshake error from 127.0.0.1:35364: remote error: tls: bad certificate
2023/03/22 09:59:20 http: TLS handshake error from 127.0.0.1:38590: remote error: tls: bad certificate
2023/03/22 09:59:20 http: TLS handshake error from 127.0.0.1:38604: remote error: tls: bad certificate
--- FAIL: TestListManagerNodes (13.52s)
    node_test.go:543: 
        	Error Trace:	/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:543
        	Error:      	Received unexpected error:
        	            	expected node 4 to be unreachable
        	            	polling failed
        	            	github.com/moby/swarmkit/v2/testutils.PollFuncWithTimeout
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:28
        	            	github.com/moby/swarmkit/v2/testutils.PollFunc
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/moby/swarmkit/v2/manager/controlapi.TestListManagerNodes
        	            		/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:543
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1439
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1571
        	Test:       	TestListManagerNodes
FAIL

neersighted · 2023-03-29T17:18:40Z

Can we kick CI again on this one?

thaJeztah · 2023-03-29T17:24:06Z

Yeah, I can try it again; it kept failing 😔

Let me give it another go

dperny · 2023-04-06T18:14:18Z

This is still failing CI, but I believe the issue with this specific PR has been fixed by adding this watchdog timer thing. I need to spend some more time on the other issue, which I believe applies to all PRs right now.

thaJeztah · 2023-04-06T21:46:02Z

@dperny looks like CI is happy now, except for a typo;

#21 67.98 manager/state/raft/transport/peer.go:269:25: `reciever` is a misspelling of `receiver` (misspell)
#21 67.98 			// there would be no reciever. We'd block here forever.
#21 67.98 			                     ^

thaJeztah · 2023-04-06T21:46:34Z

manager/state/raft/transport/peer.go

+			// We cannot just do a naked send to the bump channel. If we try to
+			// send, for example, and the timer has elapsed, then the context
+			// will have been canceled, the watchdog loop will have exited, and
+			// there would be no reciever. We'd block here forever.


s/reciever/receiver/ here

Rationale is in a comment explaining the two removed lines. Signed-off-by: Drew Erny <derny@mirantis.com>

codecov-commenter · 2023-04-10T16:59:49Z

Codecov Report

Merging #3122 (c963e16) into master (e28e8ba) will increase coverage by 61.71%.
The diff coverage is 94.00%.

@@             Coverage Diff             @@
##           master    #3122       +/-   ##
===========================================
+ Coverage        0   61.71%   +61.71%     
===========================================
  Files           0      154      +154     
  Lines           0    31120    +31120     
===========================================
+ Hits            0    19207    +19207     
- Misses          0    10369    +10369     
- Partials        0     1544     +1544

thaJeztah

LGTM

corhere · 2023-04-10T17:09:08Z

manager/state/raft/transport/peer.go

+		for {
+			select {
+			case <-bump:
+			case <-time.After(p.tr.config.SendTimeout):


A new timer will be created on each turn of the loop, which won't be cleaned up until after the timer fires. Memory consumption would increase proportionally to the number of messages the snapshot data is split into, and won't be fully garbage-collectable until a nontrivial amount of time after the snapshot is fully transmitted. The memory usage could potentially become significant when there's a thundering herd of a hundred nodes joining a cluster.

Active timers created with timer.AfterFunc are allowed to be reset, and context.CancelFunc closures are idempotent, so a resettable watchdog timer which cancels a context on expiry and consumes O(1) memory can be implemented quite simply:

t := timer.AfterFunc(p.tr.config.SendTimeout, cancel) defer t.Stop() bump := func() { t.Reset(p.tr.config.SendTimeout) }

🙈 good catch (I always forget the caveats with these); @dperny can have a look?

Address moby#3122 (comment) by taking the recommendation to reduce resource churn. Signed-off-by: Bjorn Neergaard <bneergaard@mirantis.com>

dperny force-pushed the fix-large-snapshot-timeout branch from 0c75844 to 0f2c605 Compare February 17, 2023 15:29

neersighted approved these changes Feb 17, 2023

View reviewed changes

thaJeztah approved these changes Mar 15, 2023

View reviewed changes

corhere approved these changes Mar 15, 2023

View reviewed changes

thaJeztah closed this Mar 22, 2023

thaJeztah reopened this Mar 22, 2023

dperny force-pushed the fix-large-snapshot-timeout branch from 0f2c605 to ba3004f Compare April 6, 2023 17:59

thaJeztah reviewed Apr 6, 2023

View reviewed changes

Fix timeouts from very long raft messages

c963e16

Rationale is in a comment explaining the two removed lines. Signed-off-by: Drew Erny <derny@mirantis.com>

dperny force-pushed the fix-large-snapshot-timeout branch from ba3004f to c963e16 Compare April 10, 2023 16:47

thaJeztah approved these changes Apr 10, 2023

View reviewed changes

thaJeztah merged commit c6f9c0d into moby:master Apr 10, 2023
8 checks passed

corhere suggested changes Apr 10, 2023

View reviewed changes

neersighted mentioned this pull request Apr 13, 2023

man/state/raft/trans/peer: use AfterFunc for context watchdog #3128

Merged

This was referenced May 31, 2023

vendor: github.com/moby/swarmkit/v2 v2.0.0-20230531205928-01bb7a41396b moby/moby#45664

Merged

[24.0 backport] vendor: github.com/moby/swarmkit/v2 v2.0.0-20230531205928-01bb7a41396b moby/moby#45703

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix timeouts from very large raft messages #3122

Fix timeouts from very large raft messages #3122

dperny commented Feb 15, 2023 •

edited by thaJeztah

neersighted commented Feb 15, 2023

dperny commented Feb 17, 2023

neersighted commented Mar 15, 2023

thaJeztah left a comment

thaJeztah commented Mar 22, 2023

thaJeztah commented Mar 22, 2023

thaJeztah commented Mar 22, 2023

neersighted commented Mar 29, 2023

thaJeztah commented Mar 29, 2023

dperny commented Apr 6, 2023

thaJeztah commented Apr 6, 2023

thaJeztah Apr 6, 2023

codecov-commenter commented Apr 10, 2023

thaJeztah left a comment

corhere Apr 10, 2023

thaJeztah Apr 10, 2023

Fix timeouts from very large raft messages #3122

Fix timeouts from very large raft messages #3122

Conversation

dperny commented Feb 15, 2023 • edited by thaJeztah

neersighted commented Feb 15, 2023

dperny commented Feb 17, 2023

neersighted commented Mar 15, 2023

thaJeztah left a comment

Choose a reason for hiding this comment

thaJeztah commented Mar 22, 2023

thaJeztah commented Mar 22, 2023

thaJeztah commented Mar 22, 2023

neersighted commented Mar 29, 2023

thaJeztah commented Mar 29, 2023

dperny commented Apr 6, 2023

thaJeztah commented Apr 6, 2023

thaJeztah Apr 6, 2023

Choose a reason for hiding this comment

codecov-commenter commented Apr 10, 2023

Codecov Report

thaJeztah left a comment

Choose a reason for hiding this comment

corhere Apr 10, 2023

Choose a reason for hiding this comment

thaJeztah Apr 10, 2023

Choose a reason for hiding this comment

dperny commented Feb 15, 2023 •

edited by thaJeztah