New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-prometheus & kubernetes 1.5.2 - prometheus-k8s-0 node - docker stops responding #239
Comments
Interesting, if a docker restart solves this do you think it makes sense for us to track this here? It's certainly good to know though for future reference, so thanks for reporting! 🙂 |
It seems like prometheus can really speed up the issue. We have one cluster that had the problem twice in four days - and it was always the node with prometheus on. |
Hmm .. odd, I can't imagine how Prometheus could play into this, except maybe causing high memory pressure on the host. Maybe you can try setting the kubelet flags to reserve a bit more system memory so the kubelet doesn't over use the node too much. |
I see it on nodes without prometheus too, it just seems to happen more when prometheus is on the node too... |
I just hit the same issue after deploying the operator and |
Definitely @klausenbusk ! However, I don't think this is a particular issue with Prometheus itself rather a Docker and/or Kubernetes issue. Are either of you mounting storage? It seems some people have trouble with that. Maybe you can setup an extra Prometheus that is monitoring only your node resources, that way we could try and see whether the node resource usage is correlating with this issue. My suspicion is as I stated above that this happens due to memory or disk pressure. |
You is properly correct.
|
Kernel log: https://gist.github.com/klausenbusk/21e05251e3b170c48bc66e1c6a081b64 ,hmm.. I need to leave now.. |
I'm guessing that would be related to Prometheus time series churn because pods keep appearing and disappearing and every time series in Prometheus v1.x is one file on disk. This is going to change in Prometheus 2.0. Until then one thing to do is to increase the number of inodes on the hosts that Prometheus runs on, or choose a low retention rate, so that stale time-series get garbage collected faster, therefore less files on disk. Let us know if that helps. |
Hmm, that sounds weird. The cluster isn't very big (3 masters + 4 workers) and the pod count is also "low" (guess 20-30, not at my work computer), and I don't add/remove new pod very often.
That would require reformatting the filesystem, which isn't easily done on a cloud provider (DigitalOcean).
Prometheus was only running for a few hours in a 7 nodes cluster (3 master + 4 workers). If I set the retention that low, Prometheus would be more or less useless. |
I'm wondering if this theory by @guoshimin could explain this/is the cause of this issue. kubernetes/kubernetes#39028 (comment)
|
Also see: moby/moby#32007 , I just experienced this again today but haven't got time yet to look at the logs, I properly end up disabling selinux for Docker (it sounds like a plausible explanation). |
That does sound plausible, but unfortunately I don't see this going away until we fully support Prometheus 2.0 which is in it's first alpha release right now. Our plan is to start supporting it once it goes into beta. |
Have you been able to test out Prometheus 2.0 alpha releases in this regard? The selinux labeling all files recursively issue mentioned in moby/moby#32892, should not be an issue with Prometheus 2.0 anymore, as the number of files created by Prometheus should be drastically smaller. I realize until Prometheus 2.0 hits a stable release, this is not a valid solution, but it would be great to know whether this problem is solved. The Prometheus 2.0 pre releases are in experimental support in the Prometheus Operator so it's possible to try them out. We would highly appreciate feedback! |
Has anyone been able to try out Prometheus 2.0-beta.0 regarding this issue? Prometheus 2.0 is nearing GA, but we'd appreciate as much testing as possible, and if it solves this issue, then even better. |
I also seem to be having this issue ( CoreOS 1465.7.0, Kubernetes 1.6.1, Prometheus 1.0.1 ). However I'm mounting the prometheus data directory to the host disk. @klausenbusk Did disabling selinux fix the issue for you? |
I haven't experienced this for some time, and the cluster has changed a bit since (k8s 1.5 -> self-hosted 1.7). |
Prometheus 2.0 stable has been released, and the Prometheus Operator fully supports Prometheus 2.0. I will close this issue here. Feel free to open new issues regarding Prometheus 2.0. The issue described in this post is fundamentally not solvable with Prometheus 1.x therefore we recommend switching to 2.0. |
I don't believe that there has been a release of prometheus-operator that supports prometheus-2.0 stable since #735 |
Prometheus 2.0 has support has been in the Prometheus Operator for multiple releases now 🙂 |
…es-master [bot] Bump openshift/prometheus-operator to v0.67.0
What did you do?
hack/cluster-monitoring/deploy
What did you expect to see?
Stable cluster
What did you see instead? Under which circumstances?
docker stops responding on node. Need to do a docker restart. Node just went down again after 3 days...
Environment
AWS KOPS 1.5.1 Kubernetes 1.5.2
Kubernetes version information:
--- kubernetes/kops ‹master› » ku version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:57:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"08e099554f3c31f6e6f07b448ab3ed78d0520507", GitTreeState:"clean", BuildDate:"2017-01-12T04:52:34Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes cluster kind:
kops 1.5.2
Manifests:
https://github.com/coreos/kube-prometheus.git : 333bd23
I plan to test release 0.7.0 soon...
Node looks like this before a
/etc/init.d/docker restart
to fix:--- kubernetes/kops ‹master› » ku get po --all-namespaces -o wide | grep ip-10-101-118-222.ec2.internal athena-graphql athena-graphql-cmd-290124063-x2c19 1/1 Unknown 1 9d 100.96.33.50 ip-10-101-118-222.ec2.internal deis deis-controller-2434209242-ztkcs 1/1 Unknown 3 9d 100.96.33.47 ip-10-101-118-222.ec2.internal deis deis-logger-fluentd-19csf 1/1 NodeLost 1 9d 100.96.33.44 ip-10-101-118-222.ec2.internal deis deis-logger-redis-304849759-9z5g4 1/1 Unknown 1 5d 100.96.33.42 ip-10-101-118-222.ec2.internal deis deis-monitor-telegraf-xf6mm 1/1 NodeLost 1 9d 100.96.33.36 ip-10-101-118-222.ec2.internal deis deis-router-3101872284-nmwgf 1/1 Unknown 1 9d 100.96.33.43 ip-10-101-118-222.ec2.internal deis deis-workflow-manager-2528409207-7pttp 1/1 Unknown 1 5d 100.96.33.34 ip-10-101-118-222.ec2.internal hades-graphql hades-graphql-cmd-459006866-r3pbl 1/1 Unknown 1 3d 100.96.33.48 ip-10-101-118-222.ec2.internal kube-system kube-proxy-ip-10-101-118-222.ec2.internal 1/1 Unknown 1 9d 10.101.118.222 ip-10-101-118-222.ec2.internal monitoring grafana-1046448512-l8cgh 2/2 Unknown 2 9d 100.96.33.40 ip-10-101-118-222.ec2.internal monitoring kube-state-metrics-4090613309-mnbrj 1/1 Unknown 1 9d 100.96.33.32 ip-10-101-118-222.ec2.internal monitoring node-exporter-sz8r4 1/1 NodeLost 1 9d 10.101.118.222 ip-10-101-118-222.ec2.internal monitoring prometheus-k8s-0 2/2 Unknown 2 5d 100.96.33.51 ip-10-101-118-222.ec2.internal monitoring prometheus-operator-3658205960-2zpfp 1/1 Unknown 1 9d 100.96.33.49 ip-10-101-118-222.ec2.internal programs-service programs-service-cmd-1240201140-mjd53 1/1 Unknown 0 2d 100.96.33.52 ip-10-101-118-222.ec2.internal speech-to-text-nodejs speech-to-text-nodejs-cmd-2508035524-zk217 1/1 Unknown 1 9d 100.96.33.45 ip-10-101-118-222.ec2.internal splunkspout k8ssplunkspout-nonprod-c0d1w 1/1 NodeLost 1 9d 100.96.33.35 ip-10-101-118-222.ec2.internal styleguide styleguide-cmd-3772371803-2tdlw 1/1 Unknown 1 9d 100.96.33.38 ip-10-101-118-222.ec2.internal styleguide styleguide-cmd-3772371803-hsg0w 1/1 Unknown 1 9d 100.96.33.37 ip-10-101-118-222.ec2.internal styleguide-staging styleguide-staging-cmd-83554885-cb2pq 1/1 Unknown 1 9d 100.96.33.39 ip-10-101-118-222.ec2.internal wellbot wellbot-web-2518992024-1bvml 1/1 Unknown 1 9d 100.96.33.46 ip-10-101-118-222.ec2.internal
I think these tickets are related: kubernetes/kubernetes#42164
kubernetes/kubernetes#39028
The text was updated successfully, but these errors were encountered: