Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statefulset pods disappearing after initial correct start #41012

Closed
JorritSalverda opened this issue Feb 6, 2017 · 7 comments
Closed

Statefulset pods disappearing after initial correct start #41012

JorritSalverda opened this issue Feb 6, 2017 · 7 comments

Comments

@JorritSalverda
Copy link

BUG REPORT

I'm running ElasticSearch in a statefulset and in the range of 36 hours all 3 pods except for the first one (-0) have disappeared. The first one is in a CrashLoopBackoff state.

I would expect once a statefulset has correctly started the separate pods are no longer dependent on the first one running correctly.

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e
6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z", GoVersion:"go1.7.4
", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.1", GitCommit:"82450d03cb057bab0
950214ef122b67c83fb11df", GitTreeState:"clean", BuildDate:"2016-12-14T00:52:01Z", GoVersion:"go1.7.4
", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • GKE GCI image, version 1.5.2:

What happened:

All statefulset replicas disappeared, except for the first on which was in CrashLoopBackoff state.

What you expected to happen:

The second and third node to stay operational even if the first node fails.

How to reproduce it (as minimally and precisely as possible):

Haven't been able to do so, but it has happened a couple of times in a week.

Anything else we need to know:

The statefulset runs in a GKE cluster on top of preemptible nodes. To avoid the preemptibles from mass expiring I'm stopping and deleting them randomly before the 24 hours are up. This should spread deletion and making it less likely that more than one host is deleted at a time.

There's also

  • a poddisruptionbudget for the sfs with minAvailable at n-1
  • a headless service
  • a regular clusterip service for access to the nodes
  • a service across multiple statefulsets to prevent elasticsearch master, data and client nodes to live on the same hosts.
@0xmichalis
Copy link
Contributor

@kubernetes/sig-apps-bugs

@foxish
Copy link
Contributor

foxish commented Feb 7, 2017

I would expect once a statefulset has correctly started the separate pods are no longer dependent on the first one running correctly.

This is correct. If pod-0 restarts, the StatefulSet controller will bring that back up, but will not affect pod-1 and pod-2 which are already running. The more likely event here seems like one of your node deletions took down pod-1 and pod-2 after pod-0 went unhealthy. In that case, we do not attempt to recreate pod-1 or 2 till pod-0 becomes healthy again. The rationale for this is that users rely on the deterministic initialization order and write logic around that guarantee. To bring up the pods in arbitrary order would violate this guarantee.

I would recommend studying the particular application you're running to ensure that it does not enter an unhealthy state and can indeed tolerate failures and come back up successfully.

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 7, 2017 via email

@foxish
Copy link
Contributor

foxish commented Feb 8, 2017

I imagine that such non-determinism would apply at initialization time as well? Can you expand some more on the potential use cases of what you mention? If we want to do this, we should open up an issue and start collecting concrete use cases which point to this need.

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 8, 2017 via email

@JorritSalverda
Copy link
Author

I'm running a modified version of elasticsearch - https://github.com/pires/docker-elasticsearch-kubernetes - that uses the SVC dns records for discovery and doesn't depend on the order. Of course tt could use a deployment instead of a statefulset but I need the PDs to be provisioned, hence the use of the statefulset.

The dependency on the first node lowers total availability a lot, especially since we're running on preemptibles that do not last longer than 24 hours.

For a lot of other applications I would expect the order to be important the very first time it starts up, but after that it would be better to join the cluster via a service.

Is there an annotation that can modify this behaviour for scenarios like these where it isn't needed?

@JorritSalverda
Copy link
Author

When I opened this ticket I didn't use pod anti-affinity in the statefulset. Running on top of preemptible vms this lead to described failure. However since adding anti-affinity we haven't run into this issue. Closing the ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants