memberlist: reresolve members with every 100 nodes on Join #411

dimitarvdimitrov · 2023-10-18T12:37:45Z

Description

This change solves three problems:

Context cancelations aren't respected in Join. Sometimes join might take as long as 25 minutes and there is no way to cancel it. Now we check the context on every 100 joined nodes
Join will attempt to join member IPs that become obsolete between the time discoverMembers runs and memberlist initiates a push-pull with the node. Reresolve JoinMembers addresses after joining every 100 nodes. This helps to clean up obsolete IPs that will be tried at the end of the Join procedure
(minor) fast joining on startup doesn't respect context cancelations

The side effect of this change is that there will be more DNS resolutions for the JoinMembers.

Note to reviewers: TestJoinMembersWithRetryBackoff seems to sometimes fail locally, but it should be an unrelated failure. I believe it should be fixed by this #412.

Checklist

Tests updated
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pracucci

I'm good with the changes, but I think the implementation is a bit more complicated (see it as "harder to follow") than required. A slightly better design may simplify it.

kv/memberlist/memberlist_client.go

This change solves three problems: 1. Context cancelations aren't respected in Join. Sometimes join might take as long as 25 minutes and there is no way to cancel it. Now we check the context on every 100 joined nodes 2. `Join` will attempt to join member IPs that become obsolete between the time discoverMembers runs and memberlist initiates a push-pull with the node. Reresolve `JoinMembers` addresses after joining every 100 nodes. This helps to clean up obsolete IPs that will be tried at the end of the Join procedure 3. (minor) fast joining on startup doesn't respect context cancelations The side effect of this change is that there will be more DNS resolutions for the JoinMembers. This is in draft beacuse

dimitarvdimitrov · 2023-11-02T14:58:48Z

thank you for the review @pracucci. I think I addressed your comments. I decided to go with 3 functions: 1) manage retries, 2) manage batches, 3) join a single batch. The code become a bit longer, but I think it's more clear

pstibrany

lgtm overall, comments are not blocking.

pstibrany · 2023-11-06T10:09:55Z

kv/memberlist/memberlist_client.go

+			if attemptedNodes[n] {
+				continue
+			}
+			if len(batch) >= batchSize {


nit: should we move this if after batch = append(batch, n)?

then we don't know whether there are moreAvailableNodes. If we have filled the batch size and there is another node which wasn't attempted already, then we know. This is covering the edge case where the batch fills with the last item in nodes

kv/memberlist/memberlist_client.go

pstibrany · 2023-11-06T10:13:09Z

kv/memberlist/memberlist_client.go

+		if ctx.Err() != nil {
+			return successfullyJoined, fmt.Errorf("joining batch: %w", context.Cause(ctx))
+		}
+		// Attempt to join a single node. Complexity shouldn't be different from passing all the node IPs to Join.


Complexity shouldn't be different from passing all the node IPs to Join.

What is this comment trying to say?

that it doesn't matter whether we call Join with all nodes once or call it with one node N times. I reworded the comment

kv/memberlist/memberlist_client.go

pstibrany · 2023-11-06T10:28:38Z

kv/memberlist/memberlist_client.go

-	level.Error(m.logger).Log("msg", "joining memberlist cluster failed", "last_error", lastErr, "elapsed_time", time.Since(startTime))
-	return false
+// joinMembersBatch returns an error only if it couldn't successfully join any nodes or if ctx is cancelled.
+func (m *KV) joinMembersBatch(ctx context.Context, nodes []string) (successfullyJoined int, lastErr error) {


suggestion: joinMembersBatch -> joinAllMembers

i prefer joinMembersBatch because it complements joinMembersInBatches and it's more clear that InBatches uses Batch

dimitarvdimitrov · 2023-11-06T19:21:16Z

thanks for the reviews @pracucci, @pstibrany. I will merge this after CI passes

dimitarvdimitrov · 2023-11-08T08:37:27Z

there is a race condition with some printing of atomics in the test. I don't expect this was introduced by this PR. I will investigate it and push a fix.

This prevents a race in test cases when a service is failed. The DescribeService method prints the service struct value which is detected as a data race when printing the mutex struct

dimitarvdimitrov · 2023-11-13T15:04:38Z

i pushed a commit which gets rid of the data race. The race was there when a service was being described. The race detector treats printing a mutex field and using that mutex as a race condition. This can be circumvented by giving the memberlist kv a name, so it doesn't have to be printed. The change is in ab6e2d1

pracucci self-requested a review October 23, 2023 08:37

pracucci approved these changes Oct 23, 2023

View reviewed changes

kv/memberlist/memberlist_client.go Show resolved Hide resolved

kv/memberlist/memberlist_client.go Outdated Show resolved Hide resolved

kv/memberlist/memberlist_client.go Outdated Show resolved Hide resolved

dimitarvdimitrov added 6 commits November 2, 2023 13:24

undo testing changes

2b78ddd

Respect context cancelations in batch iteration

30094a0

Add CHANGELOG.md entry

498f452

Simplify var placement

7a91357

Break out joinMembersInBatches

29aae25

dimitarvdimitrov force-pushed the dimitar/memberlist-reresolve-members-on-join branch from a16c50b to 29aae25 Compare November 2, 2023 14:46

dimitarvdimitrov added 2 commits November 2, 2023 15:54

Keep resolution list from last successful resolution

a962e89

Propagate context error

69d2120

pstibrany approved these changes Nov 6, 2023

View reviewed changes

dimitarvdimitrov added 5 commits November 6, 2023 20:02

Collapse for-loop

9c262fa

Clarify comment on Join complexity

26aac1f

Clarify comment on joinMembersWithRetries

40d3363

Fix var init

c79c866

Use backoff error

73e3ae5

dimitarvdimitrov enabled auto-merge (squash) November 7, 2023 18:39

Name memberlist service

ab6e2d1

This prevents a race in test cases when a service is failed. The DescribeService method prints the service struct value which is detected as a data race when printing the mutex struct

dimitarvdimitrov disabled auto-merge November 13, 2023 15:11

dimitarvdimitrov enabled auto-merge (squash) November 13, 2023 15:12

Do named service with changed field name

dbc716e

dimitarvdimitrov force-pushed the dimitar/memberlist-reresolve-members-on-join branch from 14436a5 to dbc716e Compare November 14, 2023 11:43

dimitarvdimitrov merged commit 90c5233 into main Nov 14, 2023
2 of 3 checks passed

dimitarvdimitrov deleted the dimitar/memberlist-reresolve-members-on-join branch November 14, 2023 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memberlist: reresolve members with every 100 nodes on Join #411

memberlist: reresolve members with every 100 nodes on Join #411

dimitarvdimitrov commented Oct 18, 2023 •

edited

pracucci left a comment

dimitarvdimitrov commented Nov 2, 2023

pstibrany left a comment

pstibrany Nov 6, 2023

dimitarvdimitrov Nov 6, 2023

pstibrany Nov 6, 2023

dimitarvdimitrov Nov 6, 2023

pstibrany Nov 6, 2023

dimitarvdimitrov Nov 6, 2023

dimitarvdimitrov commented Nov 6, 2023

dimitarvdimitrov commented Nov 8, 2023

dimitarvdimitrov commented Nov 13, 2023

memberlist: reresolve members with every 100 nodes on Join #411

memberlist: reresolve members with every 100 nodes on Join #411

Conversation

dimitarvdimitrov commented Oct 18, 2023 • edited

Description

pracucci left a comment

Choose a reason for hiding this comment

dimitarvdimitrov commented Nov 2, 2023

pstibrany left a comment

Choose a reason for hiding this comment

pstibrany Nov 6, 2023

Choose a reason for hiding this comment

dimitarvdimitrov Nov 6, 2023

Choose a reason for hiding this comment

pstibrany Nov 6, 2023

Choose a reason for hiding this comment

dimitarvdimitrov Nov 6, 2023

Choose a reason for hiding this comment

pstibrany Nov 6, 2023

Choose a reason for hiding this comment

dimitarvdimitrov Nov 6, 2023

Choose a reason for hiding this comment

dimitarvdimitrov commented Nov 6, 2023

dimitarvdimitrov commented Nov 8, 2023

dimitarvdimitrov commented Nov 13, 2023

dimitarvdimitrov commented Oct 18, 2023 •

edited