Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memberlist: reresolve members with every 100 nodes on Join #411

Merged
merged 15 commits into from
Nov 14, 2023

Conversation

dimitarvdimitrov
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov commented Oct 18, 2023

Description

This change solves three problems:

  1. Context cancelations aren't respected in Join. Sometimes join might take as long as 25 minutes and there is no way to cancel it. Now we check the context on every 100 joined nodes
  2. Join will attempt to join member IPs that become obsolete between the time discoverMembers runs and memberlist initiates a push-pull with the node. Reresolve JoinMembers addresses after joining every 100 nodes. This helps to clean up obsolete IPs that will be tried at the end of the Join procedure
  3. (minor) fast joining on startup doesn't respect context cancelations

The side effect of this change is that there will be more DNS resolutions for the JoinMembers.

Note to reviewers: TestJoinMembersWithRetryBackoff seems to sometimes fail locally, but it should be an unrelated failure. I believe it should be fixed by this #412.

Checklist

  • Tests updated
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pracucci pracucci self-requested a review October 23, 2023 08:37
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the changes, but I think the implementation is a bit more complicated (see it as "harder to follow") than required. A slightly better design may simplify it.

kv/memberlist/memberlist_client.go Show resolved Hide resolved
kv/memberlist/memberlist_client.go Outdated Show resolved Hide resolved
kv/memberlist/memberlist_client.go Outdated Show resolved Hide resolved
This change solves three problems:
1. Context cancelations aren't respected in Join. Sometimes join might take as long as 25 minutes and there is no way to cancel it. Now we check the context on every 100 joined nodes
2. `Join` will attempt to join member IPs that become obsolete between the time discoverMembers runs and memberlist initiates a push-pull with the node. Reresolve `JoinMembers` addresses after joining every 100 nodes. This helps to clean up obsolete IPs that will be tried at the end of the Join procedure
3. (minor) fast joining on startup doesn't respect context cancelations

The side effect of this change is that there will be more DNS resolutions for the JoinMembers.

This is in draft beacuse
@dimitarvdimitrov dimitarvdimitrov force-pushed the dimitar/memberlist-reresolve-members-on-join branch from a16c50b to 29aae25 Compare November 2, 2023 14:46
@dimitarvdimitrov
Copy link
Contributor Author

thank you for the review @pracucci. I think I addressed your comments. I decided to go with 3 functions: 1) manage retries, 2) manage batches, 3) join a single batch. The code become a bit longer, but I think it's more clear

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall, comments are not blocking.

if attemptedNodes[n] {
continue
}
if len(batch) >= batchSize {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we move this if after batch = append(batch, n)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we don't know whether there are moreAvailableNodes. If we have filled the batch size and there is another node which wasn't attempted already, then we know. This is covering the edge case where the batch fills with the last item in nodes

kv/memberlist/memberlist_client.go Outdated Show resolved Hide resolved
if ctx.Err() != nil {
return successfullyJoined, fmt.Errorf("joining batch: %w", context.Cause(ctx))
}
// Attempt to join a single node. Complexity shouldn't be different from passing all the node IPs to Join.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complexity shouldn't be different from passing all the node IPs to Join.

What is this comment trying to say?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that it doesn't matter whether we call Join with all nodes once or call it with one node N times. I reworded the comment

kv/memberlist/memberlist_client.go Outdated Show resolved Hide resolved
kv/memberlist/memberlist_client.go Outdated Show resolved Hide resolved
kv/memberlist/memberlist_client.go Outdated Show resolved Hide resolved
level.Error(m.logger).Log("msg", "joining memberlist cluster failed", "last_error", lastErr, "elapsed_time", time.Since(startTime))
return false
// joinMembersBatch returns an error only if it couldn't successfully join any nodes or if ctx is cancelled.
func (m *KV) joinMembersBatch(ctx context.Context, nodes []string) (successfullyJoined int, lastErr error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: joinMembersBatch -> joinAllMembers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer joinMembersBatch because it complements joinMembersInBatches and it's more clear that InBatches uses Batch

@dimitarvdimitrov
Copy link
Contributor Author

thanks for the reviews @pracucci, @pstibrany. I will merge this after CI passes

@dimitarvdimitrov dimitarvdimitrov enabled auto-merge (squash) November 7, 2023 18:39
@dimitarvdimitrov
Copy link
Contributor Author

there is a race condition with some printing of atomics in the test. I don't expect this was introduced by this PR. I will investigate it and push a fix.

This prevents a race in test cases when a service is failed. The DescribeService method prints the service struct value which is detected as a data race when printing the mutex struct
@dimitarvdimitrov
Copy link
Contributor Author

i pushed a commit which gets rid of the data race. The race was there when a service was being described. The race detector treats printing a mutex field and using that mutex as a race condition. This can be circumvented by giving the memberlist kv a name, so it doesn't have to be printed. The change is in ab6e2d1

@dimitarvdimitrov dimitarvdimitrov force-pushed the dimitar/memberlist-reresolve-members-on-join branch from 14436a5 to dbc716e Compare November 14, 2023 11:43
@dimitarvdimitrov dimitarvdimitrov merged commit 90c5233 into main Nov 14, 2023
2 of 3 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the dimitar/memberlist-reresolve-members-on-join branch November 14, 2023 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants