Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of "leader changed" errors #11426

Merged
merged 2 commits into from
Nov 16, 2022
Merged

Conversation

cenkalti
Copy link
Contributor

What this PR does / why we need it:
This PR aims to fix temporary "etcdserver: leader changed" errors from kube-apiserver that is previously attempted by #11401 by adding a single retry when this kind of error is detected.

/cc @dims

Special notes for your reviewer:
I also reverted the previous fix that didn't solve the issue. One commit is the revert, the other one is the new fix. It's better if you review commits separately.

If applicable:

  • this PR contains documentation
  • this PR contains unit tests
  • this PR has been tested for backwards compatibility

@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 11, 2022
@dims
Copy link
Contributor

dims commented Oct 11, 2022

@hickeyma @technosophos @mattfarina Folks, @cenkalti tested a bunch of scenarios recreating the problem(s) with etcd with this patch.

This reverts commit ebc79fa.

Signed-off-by: Cenk Alti <cenkalti@gmail.com>
Signed-off-by: Cenk Alti <cenkalti@gmail.com>
@cenkalti
Copy link
Contributor Author

Hey @technosophos @hickeyma. Bumping up this for visibility. It's a small fix. Can you take a look please?

Copy link
Contributor

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cenkalti thanks for pushing the PR. The code looks good in principle. However, we don't have any easy way of testing this and this is the 2nd fix following #11401. Would it be possible to provide some tests for this?

@cenkalti
Copy link
Contributor Author

cenkalti commented Nov 8, 2022

@hickeyma I tested this fix manually by setting up a custom k8s cluster with 3 etcd nodes and running etcdctl move-leader with random member ID continuously to trigger the error case. Even with this special setup, the error does not occur every time when I run helm install command as the timing of API calls are very critical that they must overlap with the leader change event.

I checked https://github.com/helm/acceptance-testing project to see If I can write this scenario as an acceptance test but I couldn't find a solution as the clusters are created with kind and there is no easy way to trigger a leader change event.

Copy link
Contributor

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @cenkalti

@hickeyma
Copy link
Contributor

hickeyma commented Nov 8, 2022

Needs another maintainer approval for merge.

@hickeyma hickeyma requested a review from dims November 8, 2022 16:22
Copy link
Contributor

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dims
Copy link
Contributor

dims commented Nov 8, 2022

thanks @cenkalti @hickeyma

@cenkalti
Copy link
Contributor Author

@technosophos @mattfarina Can you take a look please?

Copy link
Member

@technosophos technosophos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@technosophos
Copy link
Member

@hickeyma feel free to merge whenever you feel good about it.

@hickeyma hickeyma merged commit 3974136 into helm:main Nov 16, 2022
@hickeyma hickeyma added this to the 3.10.3 milestone Nov 16, 2022
@hickeyma hickeyma added the needs-pick Indicates that a PR needs to be cherry-picked into the next release candidate. label Nov 16, 2022
@hickeyma hickeyma modified the milestones: 3.10.3, 3.11.0 Dec 14, 2022
@hickeyma hickeyma removed the needs-pick Indicates that a PR needs to be cherry-picked into the next release candidate. label Dec 14, 2022
@sruthiwander
Copy link

Hi, Any chance this could be patched to the 3.2.x release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants