-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplicated metrics using kube-state-metrics sharding with --use-apiserver-cache #1679
Comments
This might be because the mechanism we rely on to serve cached data from the apiserver is vulnerable to stale reads: kubernetes/kubernetes#59848. I think it is not emphasized enough in the help text that this flag should be used with care because of that. That said, do you know if this is replicable with only one instance of kube-state-metrics? Because it looks like there are way more stale data then what I would have expected. |
We're seeing a similar thing here #1569 although the number of metrics is not growing unboundedly like you're seeing. |
@jpdstan you were seeing the same problem without sharding right? |
@dgrisonnet this is with sharding as well. 10 pods, approx 500k unique metrics each |
I wonder if we have a bug in https://github.com/kubernetes/kube-state-metrics/blob/main/pkg/sharding/listwatch.go. Does the issue happen without sharding? |
I see that K8s issue has been recently updated to say that the problem is solved for cases where the client component (e.g. kube-state-metrics) stays running. E.g. if it stays running but fails over to a different API server instance then the problem won't happen (as long as it uses the Informer class). But if the component itself restarts, the problem may still happen. Does that recent update change the seriousness of the stale read problem, for kube-state-metrics? |
@fpetkovski kube-state-metrics/pkg/watch/watch.go Lines 97 to 99 in 008bdb1
I feel like this is a bug. |
@JohnRusk I am not sure I understand your point. If |
@e-ngo this indeed seems like a bug, good catch |
Could any of you perhaps try to see if the bug still exists after the fix made by @e-ngo? This image contains it: https://console.cloud.google.com/gcr/images/k8s-staging-kube-state-metrics/global/kube-state-metrics@sha256:126f7ef47ac7723b19cc9bc6a3d63c71bcd87888cd4c12e0101684a2eb7ca804/details |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
i've tried the patch from @e-ngo , and using the query |
Thanks for the feedback! |
What happened:
Running kube-state-metrics in shards with enabled
--use-apiserver-cache
resulted in having constantly growing metrics which are not related to actual data.What you expected to happen:
Correct sharding with reduced apiserver latency
How to reproduce it (as minimally and precisely as possible):
deployment0:
--shard=0 - --total-shards=2 --use-apiserver-cache
deployment1:
--shard=1 - --total-shards=2 --use-apiserver-cache
count by (pod) (kube_pod_info)
in prometheusAnything else we need to know?:
![image](https://user-images.githubusercontent.com/2103030/152810546-a8ac1c7c-20a5-48bb-af6b-38b1c5101ebe.png)
When I removed the flag
--use-apiserver-cache
the counters dropped to good values.Probably related to #1166
Environment:
kubectl version
): v1.19.13The text was updated successfully, but these errors were encountered: