New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

leader election bugfix: Delete evicted leader pods #2210

Merged

asmacdo merged 5 commits into operator-framework:master from asmacdo:1305-eviction-deadlock

Nov 20, 2019

Member

asmacdo commented Nov 18, 2019

Description

Before this patch, when the leader pod is hard evicted but not deleted
the leader lock configmap is not garbage collected and subsequent
operator pods can never become leader. With this patch, an operator
attempting to become the leader is able to delete evicted leader pods
triggering garbage collection and allowing leader election to continue.

Replication

To replicate the evicted state, I used a kind cluster with 2 worker
nodes with altered kubelet configuration and a memory-hog version of the
memcached operator.
See the replication readme

FYI

Sometimes, evicted operator pods will remain, even with this patch.
This occurs when the leader operator pod is evicted and a new operator
pod is created on the same node. In this case, the new pod will also be
evicted. When an operator pod is created on a non-failing node, leader
election will delete only the evicted leader pod, leaving any evicted
operator pods that were not the leader.

Closes #1305
Closes #1874


          leader election bugfix: Delete evicted leader pods

c4c9281

Before this patch, when the leader pod is hard evicted but not deleted
the leader lock configmap is not garbage collected and subsequent
operator pods can never become leader. With this patch, an operator
attempting to become the leader is able to delete evicted leader pods
triggering garbage collection and allowing leader election to continue.

Sometimes, evicted operator pods will remain, even with this patch.
This occurs when the leader operator pod is evicted and a new operator
pod is created on the same node. In this case, the new pod will also be
evicted. When an operator pod is created on a non-failing node, leader
election will delete only the evicted leader pod, leaving any evicted
operator pods that were not the leader.

To replicate the evicted state, I used a `kind` cluster with 2 worker
nodes with altered kubelet configuration and a memory-hog version of the
memcached operator.
See the [altered operator docs](https://github.com/asmacdo/go-memcahced-operator/blob/explosive-operator/README.md)

openshift-ci-robot requested review from estroz and joelanford

November 18, 2019 17:33

openshift-ci-robot added the size/M label

camilamacedo86 reviewed

View reviewed changes

Makefile Outdated Show resolved Hide resolved

camilamacedo86 reviewed

View reviewed changes

go.mod Outdated Show resolved Hide resolved

asmacdo commented

View reviewed changes

pkg/leader/leader.go Outdated Show resolved Hide resolved


          Remove dev artifacts

7b37093

openshift-ci-robot added size/S and removed size/M labels

camilamacedo86 reviewed

View reviewed changes

pkg/leader/leader.go Outdated Show resolved Hide resolved


          PR comments

5a98d79

camilamacedo86 reviewed

View reviewed changes

pkg/leader/leader.go Outdated Show resolved Hide resolved

camilamacedo86 reviewed

View reviewed changes

pkg/leader/leader.go Outdated Show resolved Hide resolved

camilamacedo86 reviewed

View reviewed changes

pkg/leader/leader.go Outdated Show resolved Hide resolved

camilamacedo86 reviewed

View reviewed changes

pkg/leader/leader.go Outdated Show resolved Hide resolved

camilamacedo86 reviewed

View reviewed changes

pkg/leader/leader.go Outdated Show resolved Hide resolved

camilamacedo86 reviewed

View reviewed changes

pkg/leader/leader.go Outdated Show resolved Hide resolved

joelanford reviewed

View reviewed changes

Member

joelanford left a comment

Looks great overall! A couple of comments and suggestions.

pkg/leader/leader.go Outdated Show resolved Hide resolved

pkg/leader/leader.go Outdated Show resolved Hide resolved

pkg/leader/leader.go Outdated Show resolved Hide resolved

pkg/leader/leader.go Show resolved Hide resolved


          Rewrite for robustness and PR Comments

0549c23

openshift-ci-robot added size/M and removed size/S labels

joelanford reviewed

View reviewed changes

Member

joelanford left a comment

Looks great! Just a couple more minor improvements.

pkg/leader/leader.go Outdated Show resolved Hide resolved

pkg/leader/leader.go Outdated Show resolved Hide resolved


          check if already deleted and log err

8b821ae

joelanford approved these changes

View reviewed changes

Member

joelanford left a comment

LGTM

fabianvf approved these changes

View reviewed changes

Member

fabianvf left a comment

/lgtm

openshift-ci-robot assigned fabianvf

openshift-ci-robot added the lgtm label

asmacdo merged commit 418b603 into operator-framework:master

asmacdo deleted the 1305-eviction-deadlock branch

November 20, 2019 16:22

jmazzitelli mentioned this pull request

Evicted Kiali Operator will not deploy Kiali kiali/kiali#1584

Closed

jmazzitelli commented Dec 10, 2019

When is this going to make it into a release?

I see this was merged 20 days ago, but 5 days ago 0.12 was released and this change was not included: https://github.com/operator-framework/operator-sdk/blob/v0.12.0/pkg/leader/leader.go#L112

poros commented Dec 11, 2019

Hello! Did this one get into 0.13? I can't see in the CHANGELOG.

We are hitting this bug quite frequently lately at work, so it would be of great help to us if this could gets released.

Member

estroz commented Dec 11, 2019

@jmazzitelli @poros this got into v0.13.0. We missed a changelog entry, will follow up with one.

poros commented Dec 11, 2019

Thanks! :)

estroz added a commit to estroz/operator-sdk that referenced this pull request


          CHANGELOG.md: add leader election bugfix (operator-framework#2210)

60dd2cb

estroz mentioned this pull request

CHANGELOG.md: add leader election bugfix (#2210) #2323

Merged

estroz pushed a commit that referenced this pull request


          CHANGELOG.md: add leader election bugfix (#2210) (#2323)

fcdbd12

estroz added a commit to estroz/operator-sdk that referenced this pull request


          CHANGELOG.md: add leader election bugfix (operator-framework#2210)

e05418a

estroz pushed a commit that referenced this pull request


          CHANGELOG.md: add leader election bugfix (#2210) (#2330)

38c9195

joel-bluedata mentioned this pull request

bad interaction between pod eviction and leader lock bluek8s/kubedirector#265

Closed

slopezz mentioned this pull request

New operator pod never becomes the leader after an eviction grafana/grafana-operator#158

Closed

hmoravec mentioned this pull request

Outdated Operator SDK version causing bugs kedacore/keda#870

Closed

periklis mentioned this pull request

Migrate operator-sdk and deps to v0.18.1 openshift/cluster-logging-operator#576

Merged

baracoder mentioned this pull request

Deadlock when onepassword-connect-operator pod enters state "completed" 1Password/onepassword-operator#116

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment