Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatefulSet - can't rollback from a broken state #67250

Open
MrTrustor opened this issue Aug 10, 2018 · 27 comments

Comments

@MrTrustor
Copy link

commented Aug 10, 2018

/kind bug

What happened:

I updated a StatefulSet with a non-existent Docker image. As expected, a pod of the statefulset is destroyed and can't be recreated (ErrImagePull). However, when I change back the StatefulSet with an existing image, the StatefulSet doesn't try to remove the broken pod to replace it by a good one. It keeps trying to pull the non-existing image.
You have to delete the broken pod manually to unblock the situation.

Related Stackoverflow question

What you expected to happen:

When rolling back the bad config, I expected the StatefulSet to remove the broken pod and replace it by a good one.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy the following StatefulSet:
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 3 # by default is 1
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "standard"
      resources:
        requests:
          storage: 10Gi
  1. Once the 3 pods are running, update the StatefulSet spec and change the image to k8s.gcr.io/nginx-slim:foobar
  2. Observe the new pod failing to pull the image.
  3. Roll back the change.
  4. Observe the broken pod not being deleted.

Anything else we need to know?:

  • I observed this behaviour both on 1.8 and 1.10.
  • This seems related to the discussion in #18568

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.7", GitCommit:"dd5e1a2978fd0b97d9b78e1564398aeea7e7fe92", GitTreeState:"clean", BuildDate:"2018-04-19T00:05:56Z", GoVersion:"go1.9
.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.5-gke.3", GitCommit:"6265b9797fc8680c8395abeab12c1e3bad14069a", GitTreeState:"clean", BuildDate:"2018-07-19T23:02:51Z", GoVersi
on:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: Google Kubernetes Engine
  • OS (e.g. from /etc/os-release): COS

cc @joe-boyce

@MrTrustor

This comment has been minimized.

Copy link
Author

commented Aug 10, 2018

/sig apps
/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/apps and removed needs-sig labels Aug 10, 2018

@joe-boyce

This comment has been minimized.

Copy link

commented Aug 20, 2018

Anybody have any ideas on this one?

@enisoc

This comment has been minimized.

Copy link
Member

commented Aug 21, 2018

As far as I can tell, StatefulSet doesn't make any attempt to support this use case, namely using a rolling update to fix a StatefulSet that's in a broken state. If any of the existing Pods are broken, it appears that StatefulSet bails out before even reaching the rolling update code:

if !isRunningAndReady(replicas[i]) && monotonic {
glog.V(4).Infof(
"StatefulSet %s/%s is waiting for Pod %s to be Running and Ready",
set.Namespace,
set.Name,
replicas[i].Name)
return &status, nil
}

I haven't found any mention of this limitation in the docs, but it's possible that it was a choice made intentionally to err on the side of caution (stop and make the human decide) since stateful data is at stake and stateful Pods often have dependencies on each other (e.g. they may form a cluster/quorum).

With that said, I agree it would be ideal if StatefulSet supported this, at least for clear cases like this one where deleting a Pod that's stuck Pending is unlikely to cause any additional damage.

cc @kow3ns

@mattmb

This comment has been minimized.

Copy link

commented Sep 7, 2018

+1, I just discovered this and had assumed that it would work more like the Deployment controller.

In https://github.com/yelp/paasta we are programmatically creating/updating Deployments and StatefulSets. For that to make sense I really want them to always attempt to converge to the definition.

bonifaido added a commit to banzaicloud/bank-vaults that referenced this issue Sep 13, 2018

Support StatefuleSet upgrade with workaround
This option gives us the option to workaround current StatefulSet limitations around updates
See: kubernetes/kubernetes#67250
By default it is false.

matyix added a commit to banzaicloud/bank-vaults that referenced this issue Sep 13, 2018

Support StatefuleSet upgrade with workaround
This option gives us the option to workaround current StatefulSet limitations around updates
See: kubernetes/kubernetes#67250
By default it is false.
@fejta-bot

This comment has been minimized.

Copy link

commented Dec 6, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

commented Jan 5, 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@fejta-bot

This comment has been minimized.

Copy link

commented Feb 4, 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Feb 4, 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mattmb

This comment has been minimized.

Copy link

commented Feb 7, 2019

/reopen

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2019

@mattmb: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mattmb

This comment has been minimized.

Copy link

commented Feb 7, 2019

Heh, well it was worth a go I suppose...

@MrTrustor

This comment has been minimized.

Copy link
Author

commented Feb 8, 2019

/reopen

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2019

@MrTrustor: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Feb 8, 2019

@huzhengchuan

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

+1 meet the same problem
I think need to rollback success. but now it blocked

@dave-tock

This comment has been minimized.

Copy link

commented Mar 11, 2019

+1. This is a pretty big landmine in using StatefulSet, if you ever make any mistake you're stuck with just destroying your StatefulSet and starting over. IOW, if you ever make a mistake with StatefulSet, you need to cause an outage to recover :(

@krmayankk

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2019

/remove-lifecycle rotten

@enisoc

This comment has been minimized.

Copy link
Member

commented Mar 13, 2019

I apologize for letting this bug slip through the cracks, and for implying in my earlier comment that the current behavior is expected and acceptable. My intention was to report that I can confirm the bug exists and that it appears to be a design flaw rather than a regression.

As @MrTrustor noted in the original post, the workaround for this particular incarnation of the problem is to revert to a good Pod template, and then delete the stuck Pod. As a first step, I'll send a PR to the StatefulSet docs to warn about this pitfall and describe the workaround.

The larger, unsolved problem is what to do when StatefulSet encounters a situation in which it can't be confident that any action it takes is likely to make things better instead of worse. StatefulSet was designed to provide automation for a "most-sensitive common denominator" of stateful apps, so it very often chooses to stop and do nothing if anything isn't going perfectly according to plan.

A thread on Twitter discussing this bug resulted in a suggestion by @danderson for a feature to help mitigate these situations: provide an "escape hatch" to tell StatefulSet to proceed anyway, even if it detects something wrong. I think that's a good starting point for making progress on the general problem. Perhaps it could take the form of a field in the RollingUpdate strategy that applies all the time whenever it's set, or alternatively a one-time "acknowledge and proceed" signal that would tell StatefulSet to attempt to force-fix just one Pod but then revert to normal behavior after that.

For this specific incarnation of the general problem (a Pod that has never made it to Running state), we might be able to do something more automatic. Perhaps we can argue that it should always be safe to delete and replace such a Pod, as long as none of the containers (even init containers) ever ran. We'd also need to agree on whether this automatic fix-up can be enabled by default without being considered a breaking change, or whether it needs to be gated by a new field in StatefulSet that's off by default (until apps/v2).

I'll start investigating some of these options and report back any findings. If anyone else has ideas or input, please leave them here.

@andpol

This comment has been minimized.

Copy link

commented Mar 14, 2019

To add a bit more information to this discussion, it's worth noting that the issue described in the original post is not present if you specify podManagementPolicy: Parallel in the StatefulSet spec. From what I can tell, this is because the value of monotonic is true in the code quoted above. I guess this is somewhat implied by the policy name (rather calling it ParallelReady or something), and the StatefulSet docs do hint at this, but it was still a surprise to me when performing a rolling update.

@enisoc

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

@andpol Thanks, that's a good point. I made sure to mention it in my PR documenting this known issue: kubernetes/website#13190

@janetkuo

This comment has been minimized.

Copy link
Member

commented Mar 15, 2019

This works as intended and is documented here:

https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#updating-statefulsets

The StatefulSet controller terminates each Pod, and waits for it to transition to Running and Ready prior to updating the next Pod. Note that, even though the StatefulSet controller will not proceed to update the next Pod until its ordinal successor is Running and Ready, it will restore any Pod that fails during the update to its current version. Pods that have already received the update will be restored to the updated version, and Pods that have not yet received the update will be restored to the previous version. In this way, the controller attempts to continue to keep the application healthy and the update consistent in the presence of intermittent failures.

When designing StatefulSet (with OrderedReady policy) we expect any StatefulSet pods failure would require user intervention. But as discussed above we should provide users an escape hatch.

@danderson

This comment has been minimized.

Copy link

commented Mar 16, 2019

I think we're talking about different things. The case I hit seemed to be a straightforward logic bug: if you create a StatefulSet with a broken template (in my case, resource requests such that all pods remain pending for ever), the sts controller will ignore any changes to the StatefulSet from there on. I applied a corrected template that could definitely schedule, deleted all pods... And the StatefulSet recreated them using the broken template, where they went pending again. I had to delete the entire StatefulSet and recreate it to get my template changes to apply.

That's what I meant when I spoke of overrides to @enisoc . While it's accepted that manual intervention will be needed sometimes (either by a human or a software operator), what I encountered was a situation where the StatefulSet ignored my interventions, and kept deploying broken pods. That was my complaint, not "StatefulSet cannot guess what is safe or not for my arbitrary stateful workload", that part definitely makes sense.

@enisoc

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

@danderson Ah that's different. Can you give some more details to help me understand and reproduce the form of the bug you hit?

I tried the following steps with k8s v1.12.1:

  1. I created a StatefulSet apps/v1 with a memory request of 100Gi.
  2. StatefulSet created pod-0 which was stuck Pending due to resource request.
  3. I edited the StatefulSet to remove the memory request from the Pod template.
  4. StatefulSet did nothing. This is the form of the bug we've discussed above.
  5. I deleted pod-0. StatefulSet recreated pod-0 without the memory request (using updated template), and then proceeded to create the rest of the Pods. This is the workaround discussed above.

Since you said in your case "all pods" (implying more than one) were created but Pending, I suspected you may be using podManagementPolicy: Parallel so I tried the above steps with that setting:

  1. I created a StatefulSet apps/v1 with a memory request of 100Gi and podManagementPolicy: Parallel.
  2. StatefulSet created pod-0, pod-1, pod-2 but all were stuck Pending due to resource requests.
  3. I edited the StatefulSet to remove the memory request from the Pod template.
  4. StatefulSet deleted the Pending pod-2 and recreated it with the new Pod template, which became Running and Ready. Then it did the same for pod-1, followed by pod-0. This is consistent with the comment above that this bug should not manifest when the podManagementPolicy is Parallel.

Do you get different results? Or are there steps that I did differently?

@antonlisovenko

This comment has been minimized.

Copy link

commented Mar 22, 2019

This is a real blocker for programmatic usage of Statefulsets (from the Operator for example). The real use case: the operator creates the statefulset with some memory/cpu limits which cannot be fulfilled. So 2 pods are running and the 3rd is staying Pending as it cannot be scheduled to the node as there no available resources. Trying to fix and change the specification to have smaller limits doesn't help - the statefulset specification is updated but all the pods stay unchanged forever as the 3rd pod is Pending. The only way is to delete the pod manually which contradicts the nature of operators totally

@enisoc

This comment has been minimized.

Copy link
Member

commented Mar 22, 2019

After some initial investigation, I believe the mitigation described above should be feasible:

For this specific incarnation of the general problem (a Pod that has never made it to Running state), we might be able to do something more automatic. Perhaps we can argue that it should always be safe to delete and replace such a Pod, as long as none of the containers (even init containers) ever ran. We'd also need to agree on whether this automatic fix-up can be enabled by default without being considered a breaking change, or whether it needs to be gated by a new field in StatefulSet that's off by default (until apps/v2).

The specific incarnation we're talking about is when all of the following are true:

  1. There's a Pod stuck Pending because it was created from a bad StatefulSet revision.
  2. Deleting that stuck Pod (i.e. the workaround discussed above) would result in the Pod being replaced at a different revision (meaning we have reason to expect a different result; we won't hot-loop).

In this situation, I think we can argue that it's safe for StatefulSet to delete the Pending Pod for you, as long as we can ensure the Pod has not started running before we get around to deleting it. The argument would be, if the Pod never ran, then the application should not be affected one way or another if we delete it. We could potentially use the new ResourceVersion precondition on Delete to give a high confidence level that the Pod never started running.

There is still a very slight chance that a container started running and had some effect on the application already but the kubelet has not updated the Pod status yet. However, I would argue that the chance of that is small enough that we should take the risk in order to prevent StatefulSet from getting stuck in this common situation.

I probably won't have time to work on the code for this any time soon since I'm about to change jobs. However, I'm willing to commit to being a reviewer if anyone is available to work on this.

@TopherGopher

This comment has been minimized.

Copy link

commented Jun 4, 2019

@enisoc I'm going to try taking a whack at this

@draveness

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

This issue is similar to #78007

@dims

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

/sig architecture

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.