-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prometheus-operator repeatedly deletes prometheus StatefulSet once pods reach ContainerCreating #2950
Comments
That’s indeed odd and shouldn’t happen. It seems like some illegal update is continuously attempted. It would be good if we added what the illegal action was, as I believe the stated upset api does return this information. |
cc @pgier @s-urbaniak i think this could potentially have to do with the controller generation tooling changes? |
I should also note that this occurred after I updated |
It would be great if you could start a |
I have tested using image It appears that this behaviour may be related to the changes in Kubernetes 1.17. |
checked, no issue with OCP 4.4
|
I don't think this is related to the CRD and build changes since those weren't released until v0.35.0 and this is reported against v0.34.0. I haven't been able to reproduce this yet unfortunately. @chaosaffe what do your storageclass/standard and the associated PVs look like? Also could be helpful to turn on debug logging in prometheus-operator ( |
I was able to reproduce this and found this error from the api server.
It appears that this validation doesn't occur in kube v1.16. The issue only occurs when As a workaround for now, you can remove these two fields from the |
Great finding @pgier! I believe this then can also bite us in OpenShift 4.4, as we are on k8s 1.17 right? |
We have some logic that attempts to "resolve" this by deleting the statefulset completely and re-creating it. But it seems to me that in this case we're ending up in some sort of a loop of this behavior. |
@s-urbaniak yes, I believe this will affect us in openshift also if either of those fields are set. @brancz Right, the prometheus operator keeps trying to resolve the difference between the stateful set generated from the prometheus crd config and the stateful setting running in kube. It fails to update because of the validation error, so Prometheus operator deletes the running sts and tries to create it again which starts over the process. It's partly due to #2801 which included the running stateful set when generating the hash, but I'm not sure yet whether reverting that would completely fix the issue. I'm kind of waiting to see what the kube api team have to say about the issue I filed because this seems like a significant bug that you wouldn't be able to apply the same config twice. |
Add a new hash annotation that tracks the state of the statefulset spec separately from the inputs (Prometheus, Config, ConfigMaps). This hash annotation is added immediately after the statefulset is created, and is checked for changes during updates to detect if there were manual updates to the statefulset spec. This prevents an issue (prometheus-operator#2950) where the statefulset is continously deleted and then recreated in kubernetes v1.17 due to a mismatch between the hash annotation and the state of the statefulsetspec. The issue only occurs in kubernetes v1.17 because the api is more strict about what parts of a statefulset can be updated.
Add a new hash annotation that tracks the state of the statefulset spec separately from the inputs (Prometheus, Config, ConfigMaps). This hash annotation is added immediately after the statefulset is created, and is checked for changes during updates to detect if there were manual updates to the statefulset spec. This prevents an issue (prometheus-operator#2950) where the statefulset is continously deleted and then recreated in kubernetes v1.17 due to a mismatch between the hash annotation and the state of the statefulsetspec. The issue only occurs in kubernetes v1.17 because the api is more strict about what parts of a statefulset can be updated.
Add a new hash annotation that tracks the state of the statefulset spec separately from the inputs (Prometheus, Config, ConfigMaps). This hash annotation is added immediately after the statefulset is created, and is checked for changes during updates to detect if there were manual updates to the statefulset spec. This prevents an issue (prometheus-operator#2950) where the statefulset is continously deleted and then recreated in kubernetes v1.17 due to a mismatch between the hash annotation and the state of the statefulsetspec. The issue only occurs in kubernetes v1.17 because the api is more strict about what parts of a statefulset can be updated.
Changes in kubernetes v1.17 cause an update loop due to validation errors cause by setting the 'apiVersion' and 'kind' fields in the StatefulSet spec. These two fields are set to empty strings by the Kube API server, and v1.17 added validation that does not allow these fields to be modified. See prometheus-operator#2950
Changes in kubernetes v1.17 cause an update loop due to validation errors cause by setting the 'apiVersion' and 'kind' fields in the StatefulSet spec. These two fields are set to empty strings by the Kube API server, and v1.17 added validation that does not allow these fields to be modified. See prometheus-operator#2950
Changes in kubernetes v1.17 cause an endless update loop due to validation errors cause by setting the 'apiVersion' and 'kind' fields in the StatefulSet spec. These two fields are set to empty strings by the Kube API server, and v1.17 added validation that does not allow these fields to be modified. See prometheus-operator#2950
Changes in kubernetes v1.17 cause an endless update loop due to validation errors cause by setting the 'apiVersion' and 'kind' fields in the StatefulSet spec. These two fields are set to empty strings by the Kube API server, and v1.17 added validation that does not allow these fields to be modified. See prometheus-operator#2950
Changes in kubernetes v1.17 cause an endless update loop due to validation errors cause by setting the 'apiVersion' and 'kind' fields in the StatefulSet spec. These two fields are set to empty strings by the Kube API server, and v1.17 added validation that does not allow these fields to be modified. See prometheus-operator#2950
I'm still having this issue with kube 1.17.1, prometheus-k8s-0 and prometheus-k8s-1 keeps terminating. I just tested with kube-prometheus "version": "release-0.35". Any info how to fix this? |
@jrcjoro Can you try using prometheus-operator v0.35.1? The other workaround is to remove volumeClaimTemplate:
apiVersion: v1
kind: PersistentVolumeClaim |
Unfortunately, neither 0.35.1 nor 0.36.0 are resolving this bug in our setup (K8s v1.17.2). We are using the Helm chart to setup the Prometheus instance, i.e. we don't have the apiVersion/kind fields in our generated Prometheus manifest.
|
@pgier had this issue, removing |
I'm on AWS EKS v1.19, Rancher v2.3.4 and Prometheus-Operator 0.47.1 I had to disable Rancher Monitoring for this issue to stop. Would be great if you could add it as a note to the docs, something like
|
That seems like a good doc entry for Rancher, @unfor19 maybe consider pinging the rancher team about it? We as maintainers of prometheus-operator cannot list all quirks of every platform. |
@paulfantom I guess you're right, I'll do that |
What happened?
On upgrading to
v0.34.0
theprometheus-operator
started deleting theprometheus-k8s
StatefulSet once the pods reached theContainerCreating
status.When the operator is scaled to 0 (terminated) after the StatefulSet is created but before the pods enter the
ContainerCreating
status the pods are able to sucessfully start and run theprometheus-k8s
podsDid you expect to see some different?
Yes, I expected that the StatefulSet would be created and not be repeatedly deleted by the operator.
How to reproduce it (as minimally and precisely as possible):
Unknown. The issue recurs in this environment but is unseen in other environments
Environment
Prometheus Operator version:
v0.34.0
Kubernetes version information:
Kubeadm
This recurs repeatedly within the sync loop
Anything else we need to know?:
The text was updated successfully, but these errors were encountered: