Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finish the leave process if broadcasting leave timeout #640

Merged
merged 3 commits into from Jan 12, 2022

Conversation

dhiaayachi
Copy link
Contributor

This fix #638

@@ -727,7 +726,7 @@ func (s *Serf) Leave() error {
select {
case <-notifyCh:
case <-time.After(s.config.BroadcastTimeout):
return errors.New("timeout while waiting for graceful leave")
s.logger.Printf("[WARN] serf: timeout while waiting for graceful leave")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding a comment to point out why we don't return on this "error" but continue to attempt to leave?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also do we need to do the same for the Leave call below since that will also error and return after the same timeout (i.e. with high probability in a large cluster).

It might not be depending on what the calling code does. The benefit would be that we could be sure that we remain online for LeavePropagationDelay to increase chances of a clean leave. If the calling application just logs the error and exits anyway for example then it would be better to not return an error in the timeout case there too to make a best effort of allowing propagation despite not making it through the broadcast queues fully in the time allotted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this I think:

  • it's important to finish the leave sequence independently from any error that could happen, so the Leave call should make a best effort to execute each of the leave steps within the given timeout
  • notifying the calling code about any error that happened during the leave is useful I think. The caller could have a mean to do further steps in case of error (or avoid some further steps). Also the caller could implement a retry mechanism.

Based on this, I think the best would be to wrap all the timeout errors that we could encounter during the sequence (both for Leave broadcast and the memberList leave) and if any return it at the end of the sequence.

Copy link
Member

@banks banks Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller could have a mean to do further steps in case of error

🤔 If the error is a broadcast timeout though and we still went through all the steps, is there anything meaningful the caller can do other than log it? It doesn't make sense to call Leave again since it's already left in its own serf state and is just as likely to timeout on the broadcasts a second time unless it waits for a really long time for its broadcast queues to empty which is likely not a reasonable option when calling Leave as 99% of the time this is during a process shutdown.

Overall I think you're right, I just wonder what the value of reporting an error the user they can't do anything about is? At very least we should probably update the doc comment to point out that even if a timeout error is returned, the node will still have "left" the cluster, it'd just not possible to determine in the bounded time whether all other peers will have seen that. Technically it's not possible to know that for sure even if it doesn't time out actually, but it could be even more confusing to get a timeout error for an operation that is likely to have been successful 🤷 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more about if the serf leave fail the calling app have another mean to reach other servers and notify them about it leaving, but that's a far fetch scenario I admit.
Yes it's a bit tricky because the timeout error in this case is more like, I can't confirm that I was able to notify the other nodes about leaving but I did my best to leave, so I'm leaning toward not confusing the user so:

  • not return an error
  • document the Leave as a best effort
  • always finish the sequence
    a Question about this, is there any concerns about changing the signature of the Leave func?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not return an error

Seems reasonable at least for the timeout cases if they are distinguishable from other possible leave errors.

document the Leave as a best effort

We could certainly make it clearer. In practice everything in Serf is "best effort" even if you don't get any timeouts at all there is still no guarantee that every node in the cluster has acknowledged the leave before this method returns because eventual consistency! I think it would be useful to call that out in leave especially though as its often confusing for folks when leaves don't happen cleanly.

always finish the sequence

👍

a Question about this, is there any concerns about changing the signature of the Leave func?

My feeling is we shouldn't change the func signature. There are other cases where an error is reasonable - for example broadcast could theoretically fail because of some other reason like the node is already in a left state or the network is misconfigured and errors as soon as you try to send a packet on the transport (remember transport is pluggable not always UDP necessarily). Ideally we'd still return those. And even if every possibly code path right now in practice returns only errors we are choosing to swallow, I personally don't think it makes sense to break API compatibility just to "clean up" the error return.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this behaviour makes more sense as discussed here. I think a comment or two about why we don't return the error may be useful as a pointer, can just link to this PR if that's easiest, but up to you - that doesn't need to block this PR.

There is a CI failure right now but it seems to be unrelated - might be worth double checking though as it's possible that we called Leave() in some tear down methods and the timing assumptions have now changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants