New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing container metrics in kubelet (cAdvisor) in v1.5.1 #39812
Comments
Seems to be fixed in 1.5.2.
|
It's not fixed in 1.5.2. In my experience it works after restarting the kubelet, but then gets into the state where it doesn't report container metrics again somehow. Exactly the same setup as the original poster:
|
@kubernetes/sig-node-bugs |
@ichekrygin @jakexks does kubelet summary endpoint have those informations? To verify this you can use:
|
We have had some issues in the past with stats disappearing from certain coreOS distributions: |
@piosz There's nothing in the kubelet summary either:
(There are many pods running on the node) |
Same issue with latest CoreOS stable. Metrics seem to disappear after a few hours.
I think #33192 might be related ? |
It absolutely affects HPA (stops HPAs with pods on broken nodes from scaling either direction). That's how we initially discovered this issue on our systems. |
@philk just curious, how r u guys coping w/ it. I am at the point of setting up a |
@ichekrygin just a simple script that runs every 5 minutes, |
@dashpole I believe it is fixed. I closed that issue (I'm sure someone will comment if not), but I have been unable to reproduce (so far) on our 1.5 AWS images. |
This is what I see in
|
I have the same problem on CoreOS 1235.9.0 with Kubernetes 1.5.3. @philk workaround works like a charm, but the problem is that I don't want to restart kubelet on a (temporarily) cordoned host (where pods section is empty) all the time. So I tried with something like this: kubelet-periodic-check.service
kubelet-periodic-check.timer
I (ab)use the fact that when the problem happens http://127.0.0.1:4194/containers/system.slice/etcd2.service returns:
and etcd2 should usually run all the time on my nodes. Perhaps this will help somebody. Hopefully this issue will get resolved properly soon tho. |
This issue seems to be with older versions as well. Refer #33192 |
I'm seeing this issue as well and its blocking HPA from doing any scaling in my production clusters. I would like to avoid adding hacks such as periodically restarting the kubelet if possible. Is anyone currently working on a fix for this? If not I don't mind digging into it a bit, if someone can point me in the right direction for where to start that would be great. Seems like cAdvisor code in kubelet would be a good start? |
cc @dashpole Can you investigate this issue?
…On Thu, Mar 9, 2017 at 9:05 AM, Andrew Sy Kim ***@***.***> wrote:
I'm seeing this issue as well and its blocking HPA from doing any scaling
in my production clusters. I would like to avoid adding hacks such as
periodically restarting the kubelet if possible. Is anyone currently
working on a fix for this? If not I don't mind digging into it a bit, if
someone can point me in the right direction for where to start that would
be great. Seems like cAdvisor code in kubelet would be a good start?
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#39812 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKDfKmaXvtbvovmT6xs9my2_JtGYXks5rkDFggaJpZM4LiA8_>
.
|
Ill check it out. |
This may already be fixed, although I am not sure if it has made it into kubernetes yet. |
@dashpole any chance we can get a vendor update for the cadvisor in for the next release? |
What are the chances for fixing this for 1.4.x / 1.5.x? Thanks! |
@dashpole we should cherrypick the cAdvisor fix into the release-v0.24 branch for cherrypicking to k8s 1.4/1.5 branches. |
cherrypick to 1.5: #43113 |
There was a fix that went out for this in 1.5.6, I upgraded my worker nodes to run 1.5.6 and I still see this bug. |
@dashpole @timstclair can you please take a look? |
1 similar comment
@dashpole @timstclair can you please take a look? |
@andrewsykim are you running on a systemd-based system? |
@dashpole yes I am (CoreOS) |
@andrewsykim would you mind opening an issue to cAdvisor, and giving us some of the error messages you see in your logs? I have no experience with systemd, but I can find someone who does to check out your specific case, since it wasnt fixed by #1573 |
We ran into this issue when running kubelet in a rkt container. We were seeing error messages of the form: It appears cAdvisor tries to read those files, but inside the rkt container /storage/docker was not mounted. The solution was to bind mount that dir when starting kubelet in rkt with:
Interestingly, we only saw errors on our master nodes (i.e those running kube-apiserver). Worker nodes don't have the issue even though the docker dir isn't mounted, and can still export pod metrics. No idea why that is the case ¯\(ツ)/¯ |
Is there an update on this? I'm running k8s 1.7.2 and am also noticing that port 10255/metrics does not show pod metrics. However port 10255/stats/summary does. Restarting kubelet does not change anything. |
Actually nevermind... apparently the behavior of cadvisor has changed in 1.7. I was able to get Prometheus to grab pod metrics by following the setup mentioned here: https://raw.githubusercontent.com/prometheus/prometheus/master/documentation/examples/prometheus-kubernetes.yml |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/reopen We have seen this on two clusters in recent weeks, running k8s v1.9.3. It appears that kubelet's view of the universe has diverged significantly from Docker's hence it does not have the metadata to tag container metrics. Lots of these in kubelet logs:
kubelet is running as a systemd service, not in a container. Restarting kubelet fixed the issue for today's instance. I'm told on another occasion it was necessary to drain, delete all docker files and restart. |
@bboreham: you can't re-open an issue/PR unless you authored it or you are assigned to it. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@bboreham can you open an issue with cAdvisor? We can debug it there. |
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.): No, this looks like a regression in v1.5.1
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): kubelet, metrics, cAdvisor
Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
Kubernetes version (use
kubectl version
):Environment:
uname -a
): Linux ip-10-72-161-5 4.7.3-coreos-r2 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Sun Jan 8 00:32:25 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz GenuineIntel GNU/LinuxWhat happened: after upgrading to v1.5.1 cAdvisor does not show any subcontainers, resulting in missing ALL container system metrics. For example:
What you expected to happen: cAdviser returns sub-containers and container metrics like so:
How to reproduce it (as minimally and precisely as possible):
upgrade kubelet to
v1.5.1
and check metrics:via metrics endpoint:
curl localhost:10255/metrics
or via cAdviser UI:
http://10.72.20.134:4194/containers/
Anything else do we need to know:
docker version:
The text was updated successfully, but these errors were encountered: