-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[17.09] Fix reapTime logic in NetworkDB + handle cleanup DNS for attachable container #2017
[17.09] Fix reapTime logic in NetworkDB + handle cleanup DNS for attachable container #2017
Conversation
@thaJeztah thank you for backing port this, would be possible to make a release with it? |
@eduardolundgren this cherry-pick is in preparation of a possible 17.09.1 release; it's still being decided which changes need to be backported and which not. |
Looks like I may have to cherry-pick #1947 into this one to fix the lint-errors |
Hopefully, this backport gets into 17.09.1, the current stable release corrupts the managers' quorum when the number of overlay networks gets close to 2000, in addition to that some workers go down randomly. It starts to log lots of
Then,
Eventually, it logs a failure
|
@thaJeztah lint commit is in, can you try to rebase? |
- Added remainingReapTime field in the table event. Wihtout it a node that did not have a state for the element was marking the element for deletion setting the max reapTime. This was creating the possibility to keep the entry being resync between nodes forever avoding the purpose of the reap time itself. - On broadcast of the table event the node owner was rewritten with the local node name, this was not correct because the owner should continue to remain the original one of the message Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com> (cherry picked from commit 10cd98c) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- Changed the loop per network. Previous implementation was taking a ReadLock to update the reapTime but now with the residualReapTime also the bulkSync is using the same ReadLock creating possible issues in concurrent read and update of the value. The new logic fetches the list of networks and proceed to the cleanup network by network locking the database and releasing it after each network. This should ensure a fair locking avoiding to keep the database blocked for too much time. Note: The ticker does not guarantee that the reap logic runs precisely every reapTimePeriod, actually documentation says that if the routine is too long will skip ticks. In case of slowdown of the process itself it is possible that the lifetime of the deleted entries increases, it still should not be a huge problem because now the residual reaptime is propagated among all the nodes a slower node will let the deleted entry being repropagate multiple times but the state will still remain consistent. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com> (cherry picked from commit 3feb3aa) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Make sure that the network is garbage collected after the entries. Entries to be deleted requires that the network is present. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com> (cherry picked from commit fbba555) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The comparison was against the wrong constant value. As described in the comment the check is there to guarantee to not propagate events realted to stale deleted elements Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com> (cherry picked from commit 6f11d29) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Attachable containers they are tasks with no service associated their cleanup was not done properly so it was possible to have a leak of their name resolution if that was the last container on the network. Cleanupservicebindings was not able to do the cleanup because there is no service, while also the notification of the delete arrives after that the network is already being cleaned Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com> (cherry picked from commit 1c04e19) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
ab3a74b
to
42f9e55
Compare
rebased 👍 |
LGTM |
Do you know when |
No, I don't have a date for 17.09.1. 17.10 and 17.11 are edge releases, so would only get critical ("P0") patches; with 17.11 being released, docker 17.10 reached EOL. For your specific issue, the error message indicates it may not be an issue in the networking stack, but due to the raft state becoming too big for syncing between managers; this pull request in SwarmKit raises the maximum size to 128MB; moby/swarmkit#2375, which is being backported to 17.09.1 through docker-archive/docker-ce#323, and should address the direct problem. A bigger change is being worked on in moby/swarmkit#2458, which will use a streaming mechanism to send raft snapshots. |
Actually, the size fix was already included in 17.09.0 through docker-archive/docker-ce#242, so I'm not sure if this will address your situation |
@thaJeztah raft state becoming too big was only happening when approximately 2000 overlay networks were being created in a 30min time window. In contrary, if 2000 services were being created it was all fine. Supposedly the raft messages for service creation are bigger. Is there a way to know if I just tested on |
Yes; the size change was in 17.09.0-ce-rc3; https://github.com/docker/docker-ce/blob/v17.09.0-ce-rc3/components/engine/vendor/github.com/docker/swarmkit/manager/manager.go#L59 Note that rc3 is a release candidate; not the final release (17.09.0-ce was released after that) |
Cherry-pick of #1944 and #1960 for 17.09
Also included #1985 and #1991, which look to depend on this