Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timeouts from very large raft messages #3122

Merged
merged 1 commit into from Apr 10, 2023

Conversation

dperny
Copy link
Collaborator

@dperny dperny commented Feb 15, 2023

Rationale is in a comment in the diff explaining the two removed lines.

fixes #3113

@neersighted
Copy link
Member

Fixes #3113

@dperny
Copy link
Collaborator Author

dperny commented Feb 17, 2023

Fixed a spelling issue.

@neersighted
Copy link
Member

PTAL @thaJeztah

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

perhaps our "context aficionado" @corhere wants to have a look as well?

@thaJeztah
Copy link
Member

Let me close/reopen to kick CI; GHA UI for some reason doesn't show "re-run" options 🤔
Screenshot 2023-03-22 at 09 51 25
Screenshot 2023-03-22 at 09 51 31

@thaJeztah thaJeztah closed this Mar 22, 2023
@thaJeztah thaJeztah reopened this Mar 22, 2023
@thaJeztah
Copy link
Member

CI failing again, but I think it's a different test now. Let me post that failure in case it was not (yet) known as a flaky

2023/03/22 08:58:55 http: TLS handshake error from 127.0.0.1:47198: remote error: tls: bad certificate
2023/03/22 08:58:55 http: TLS handshake error from 127.0.0.1:40038: remote error: tls: bad certificate
2023/03/22 08:58:55 http: TLS handshake error from 127.0.0.1:40042: remote error: tls: bad certificate
--- FAIL: TestListManagerNodes (23.68s)
    node_test.go:543: 
        	Error Trace:	/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:543
        	Error:      	Received unexpected error:
        	            	expected node 5 to be unreachable
        	            	polling failed
        	            	github.com/moby/swarmkit/v2/testutils.PollFuncWithTimeout
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:28
        	            	github.com/moby/swarmkit/v2/testutils.PollFunc
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/moby/swarmkit/v2/manager/controlapi.TestListManagerNodes
        	            		/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:543
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1439
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1571
        	Test:       	TestListManagerNodes
    node_test.go:612: 
        	Error Trace:	/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:612
        	Error:      	Received unexpected error:
        	            	expected node 1 to be unreachable
        	            	polling failed
        	            	github.com/moby/swarmkit/v2/testutils.PollFuncWithTimeout
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:28
        	            	github.com/moby/swarmkit/v2/testutils.PollFunc
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/moby/swarmkit/v2/manager/controlapi.TestListManagerNodes
        	            		/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:612
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1439
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1571
        	Test:       	TestListManagerNodes
FAIL

@thaJeztah
Copy link
Member

TestListManagerNodes failing again; looks like it failed before on #3111 (comment)

2023/03/22 09:59:20 http: TLS handshake error from 127.0.0.1:35364: remote error: tls: bad certificate
2023/03/22 09:59:20 http: TLS handshake error from 127.0.0.1:38590: remote error: tls: bad certificate
2023/03/22 09:59:20 http: TLS handshake error from 127.0.0.1:38604: remote error: tls: bad certificate
--- FAIL: TestListManagerNodes (13.52s)
    node_test.go:543: 
        	Error Trace:	/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:543
        	Error:      	Received unexpected error:
        	            	expected node 4 to be unreachable
        	            	polling failed
        	            	github.com/moby/swarmkit/v2/testutils.PollFuncWithTimeout
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:28
        	            	github.com/moby/swarmkit/v2/testutils.PollFunc
        	            		/go/src/github.com/docker/swarmkit/testutils/poll.go:36
        	            	github.com/moby/swarmkit/v2/manager/controlapi.TestListManagerNodes
        	            		/go/src/github.com/docker/swarmkit/manager/controlapi/node_test.go:543
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1439
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1571
        	Test:       	TestListManagerNodes
FAIL

@neersighted
Copy link
Member

Can we kick CI again on this one?

@thaJeztah
Copy link
Member

Yeah, I can try it again; it kept failing 😔

Let me give it another go

@dperny dperny force-pushed the fix-large-snapshot-timeout branch from 0f2c605 to ba3004f Compare April 6, 2023 17:59
@dperny
Copy link
Collaborator Author

dperny commented Apr 6, 2023

This is still failing CI, but I believe the issue with this specific PR has been fixed by adding this watchdog timer thing. I need to spend some more time on the other issue, which I believe applies to all PRs right now.

@thaJeztah
Copy link
Member

@dperny looks like CI is happy now, except for a typo;

#21 67.98 manager/state/raft/transport/peer.go:269:25: `reciever` is a misspelling of `receiver` (misspell)
#21 67.98 			// there would be no reciever. We'd block here forever.
#21 67.98 			                     ^

// We cannot just do a naked send to the bump channel. If we try to
// send, for example, and the timer has elapsed, then the context
// will have been canceled, the watchdog loop will have exited, and
// there would be no reciever. We'd block here forever.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reciever/receiver/ here

Rationale is in a comment explaining the two removed lines.

Signed-off-by: Drew Erny <derny@mirantis.com>
@dperny dperny force-pushed the fix-large-snapshot-timeout branch from ba3004f to c963e16 Compare April 10, 2023 16:47
@codecov-commenter
Copy link

Codecov Report

Merging #3122 (c963e16) into master (e28e8ba) will increase coverage by 61.71%.
The diff coverage is 94.00%.

@@             Coverage Diff             @@
##           master    #3122       +/-   ##
===========================================
+ Coverage        0   61.71%   +61.71%     
===========================================
  Files           0      154      +154     
  Lines           0    31120    +31120     
===========================================
+ Hits            0    19207    +19207     
- Misses          0    10369    +10369     
- Partials        0     1544     +1544     

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah thaJeztah merged commit c6f9c0d into moby:master Apr 10, 2023
8 checks passed
for {
select {
case <-bump:
case <-time.After(p.tr.config.SendTimeout):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new timer will be created on each turn of the loop, which won't be cleaned up until after the timer fires. Memory consumption would increase proportionally to the number of messages the snapshot data is split into, and won't be fully garbage-collectable until a nontrivial amount of time after the snapshot is fully transmitted. The memory usage could potentially become significant when there's a thundering herd of a hundred nodes joining a cluster.

Active timers created with timer.AfterFunc are allowed to be reset, and context.CancelFunc closures are idempotent, so a resettable watchdog timer which cancels a context on expiry and consumes O(1) memory can be implemented quite simply:

t := timer.AfterFunc(p.tr.config.SendTimeout, cancel)
defer t.Stop()

bump := func() { t.Reset(p.tr.config.SendTimeout) }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈 good catch (I always forget the caveats with these); @dperny can have a look?

neersighted added a commit to neersighted/swarmkit that referenced this pull request Apr 13, 2023
Address
moby#3122 (comment) by
taking the recommendation to reduce resource churn.

Signed-off-by: Bjorn Neergaard <bneergaard@mirantis.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Large snapshot causes adding a new manager to fail
5 participants