Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: redeploy statefulset when stuck at failing init container #78182

Open

Conversation

@draveness
Copy link
Member

commented May 21, 2019

/kind bug

What this PR does / why we need it:

If StatefulSet stucks at the failing init container - it's impossible to fix it via a redeploy.

  • #78007 contains the context to reproduce this issue.
  • reproduce in e2e test with prow

Which issue(s) this PR fixes:

Fixes #78007 #67250

Does this PR introduce a user-facing change?:

NONE

@draveness draveness changed the title test(statefulset): add redeploy testing when stuck at init container fix: redeploy statefulset when stuck at init container May 21, 2019

@draveness draveness changed the title fix: redeploy statefulset when stuck at init container fix: redeploy statefulset when stuck at failing init container May 21, 2019

@k8s-ci-robot k8s-ci-robot requested review from kow3ns and smarterclayton May 21, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 21, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: draveness
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: janetkuo

If they are not already assigned, you can assign the PR to them by writing /assign @janetkuo in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 21, 2019

@draveness: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce 1e0d359 link /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@draveness

This comment has been minimized.

Copy link
Member Author

commented May 22, 2019

I dug around the stateful set controller and found out the controller only deletes pods which phase is exactly equal to PodFailed:

func isFailed(pod *v1.Pod) bool {
	return pod.Status.Phase == v1.PodFailed
}

func (ssc *defaultStatefulSetControl) updateStatefulSet(
	set *apps.StatefulSet,
	currentRevision *apps.ControllerRevision,
	updateRevision *apps.ControllerRevision,
	collisionCount int32,
	pods []*v1.Pod) (*apps.StatefulSetStatus, error) {
...
	for i := range replicas {
		// delete and recreate failed pods
		if isFailed(replicas[i]) {
			ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod",
				"StatefulSet %s/%s is recreating failed Pod %s",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil {
				return &status, err
			}
			if getPodRevision(replicas[i]) == currentRevision.Name {
				status.CurrentReplicas--
			}
			if getPodRevision(replicas[i]) == updateRevision.Name {
				status.UpdatedReplicas--
			}
			status.Replicas--
			replicas[i] = newVersionedStatefulSetPod(
				currentSet,
				updateSet,
				currentRevision.Name,
				updateRevision.Name,
				i)
		}
...

However, when the pod stuck at a failing init container, it never updates the phase to Failed and keeps it in Pending phase even if it has failed for thousands of times:

$ kubectl get pod
NAME                                     READY     STATUS                  RESTARTS   AGE
statefulset-bug-0                        0/1       Init:CrashLoopBackOff   1215       4d

To solve the issue, we could update the if condition in the stateful set controller, but I'm not sure whether it would change the original behaviour of statefulsets, do you have any thoughts on this?

cc/ @smarterclayton @kow3ns @janetkuo

@draveness

This comment has been minimized.

Copy link
Member Author

commented May 22, 2019

/assign @janetkuo

@draveness

This comment has been minimized.

Copy link
Member Author

commented Jun 5, 2019

ref: #67250

cc/ @enisoc

Do you have any suggestions on how to fix this, like updating the condition of deleting a statefulset pod?

@draveness draveness marked this pull request as ready for review Jun 5, 2019

@fejta-bot

This comment has been minimized.

Copy link

commented Sep 3, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@draveness

This comment has been minimized.

Copy link
Member Author

commented Sep 3, 2019

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.