New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
leader election bugfix: Delete evicted leader pods #2210
leader election bugfix: Delete evicted leader pods #2210
Conversation
Before this patch, when the leader pod is hard evicted but not deleted the leader lock configmap is not garbage collected and subsequent operator pods can never become leader. With this patch, an operator attempting to become the leader is able to delete evicted leader pods triggering garbage collection and allowing leader election to continue. Sometimes, evicted operator pods will remain, even with this patch. This occurs when the leader operator pod is evicted and a new operator pod is created on the same node. In this case, the new pod will also be evicted. When an operator pod is created on a non-failing node, leader election will delete only the evicted leader pod, leaving any evicted operator pods that were not the leader. To replicate the evicted state, I used a `kind` cluster with 2 worker nodes with altered kubelet configuration and a memory-hog version of the memcached operator. See the [altered operator docs](https://github.com/asmacdo/go-memcahced-operator/blob/explosive-operator/README.md)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall! A couple of comments and suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a couple more minor improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
When is this going to make it into a release? I see this was merged 20 days ago, but 5 days ago 0.12 was released and this change was not included: https://github.com/operator-framework/operator-sdk/blob/v0.12.0/pkg/leader/leader.go#L112 |
Hello! Did this one get into 0.13? I can't see in the CHANGELOG. We are hitting this bug quite frequently lately at work, so it would be of great help to us if this could gets released. |
@jmazzitelli @poros this got into v0.13.0. We missed a changelog entry, will follow up with one. |
Thanks! :) |
Description
Before this patch, when the leader pod is hard evicted but not deleted
the leader lock configmap is not garbage collected and subsequent
operator pods can never become leader. With this patch, an operator
attempting to become the leader is able to delete evicted leader pods
triggering garbage collection and allowing leader election to continue.
Replication
To replicate the evicted state, I used a
kind
cluster with 2 workernodes with altered kubelet configuration and a memory-hog version of the
memcached operator.
See the replication readme
FYI
Sometimes, evicted operator pods will remain, even with this patch.
This occurs when the leader operator pod is evicted and a new operator
pod is created on the same node. In this case, the new pod will also be
evicted. When an operator pod is created on a non-failing node, leader
election will delete only the evicted leader pod, leaving any evicted
operator pods that were not the leader.
Closes #1305
Closes #1874