-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/kubeletstat] Review cpu.utilization
naming
#27885
Comments
Did some digging: Kubernetes Docs state:
Looks like it's getting these metrics from CRI and if CRI doesn't have stats it's computing using this formula:
Where:
🤔 Playing a bit with the formula: Limit - is total available cpu time. Let's say we collect every 1 second, and app uses total available cpu time so 1 second.
Based on this example, the result is actual usage of 1,000,000,000 nano seconds or 1second. So this metricunit seems to be nanoseconds, not percentage. If my calculations are correct, I think we should rename to |
@povilasv thank you! |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
…elemetry#25901) **Description:** Starts the name change processor for `*.cpu.utilization` metrics. **Link to tracking Issue:** Related to open-telemetry#24905 Related to open-telemetry#27885
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
FYI @TylerHelmuth @povilasv In SemConv we have merged open-telemetry/semantic-conventions#282 which adds the For the Do we have a summary so far for what is missing from the Shall we try to adopt the At the moment the implementation of the receiver provides the following:
Are we planning to keep them all? From https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/25901/files#diff-3343de7bfda986546ce7cb166e641ae88c0b0aecadd016cb253cd5a0463ff464R352-R353 I see we are going to remove/deprecate opentelemetry-collector-contrib/receiver/kubeletstatsreceiver/internal/kubelet/metadata.go Line 71 in 80bbf5e
|
@ChrsMark in my opinion yes to all questions. We want to be aligned with the spec (although I'd love to reduce the number of iterations in the receivers to gain that alignment, how long till we're stable lol). I don't have a lot of time to dedicate to getting kubeletstatsreceiver update-to-date with the non-stable spec. At this point I was planning to wait for things to stabilize before making any more changes besides the work we started in this issue. |
Thank's @TylerHelmuth, I see the point of not chasing after an unstable schema/spec. Just to clarify regarding the |
@ChrsMark yes I'd be fine with keeping the metric if we can calculate it correctly. We'd still need to go through some sort of feature-gate processor to make it clear to users that the metric has changed and that if they want the old value they need to use the new |
@TylerHelmuth @povilasv I have drafted a patch to illustrate the point at #32295. My sightings look promising :). If we agree on the idea I can move the PR forward to fix the details and open it for review. Let me know what you think. |
Seems reasonable. @jinja2 please take a look |
This looks reasonable for me as well. I would add the informer too so we don't call get Node everytime we scrape data, in practice Node cpu capacity doesn't change |
Thank's for the feedback folks! |
Hey folks. Sorry, I'm late to the party. I am not convinced about repurposing the
I think something like Even if we decide to repurpose |
Thank's @dmitryax. I agree with making the metric more specific here. If others agree, I can change the PR accordingly to:
Extra: I wonder also if that would make sense to change @dmitryax @TylerHelmuth @povilasv let me know what you think. |
I also agree with @dmitryax points too. I guess computing against node limits has problems and is unfair if node limits > container limits, etc. So we need to do proper deprecation. Regarding |
I still find this useful in order to be able to compare how much cpu utilization a container/pod has against the Node's capacity. This helps you understand if a Pod/container is a really problematic workload or not. Not against its own limit but against the Node's capacity. For example, you can see 96% |
What does "node_limit" mean here? Is it referring to the host's capacity or the node allocatable (capacity - system/kube reserved)? I find node_limit to be ambiguous but my assumption would be that it refers to allocatable since that's the amount of resources actually available for pod scheduling. Do you think a user might want to select whether the utilization is against the capacity or the allocatable? |
I am good with this. I care mainly about the switch from |
I would go with the I will change the PR for now to to use the |
…ic (#32295) **Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> At the moment. We calculate the `k8s.container.cpu_limit_utilization` as [a ratio of the container's limits](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/documentation.md#k8scontainercpu_limit_utilization) at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/867d6700c31446172e6998e602c55fbf7351831f/receiver/kubeletstatsreceiver/internal/kubelet/cpu.go#L30. Similarly we can calculate the cpu utilization as ratio of the whole node's allocatable cpu, if we divide by the total number of node's cores. We can retrieve this information from the Node's `Status.Capacity`, for example: ```console $ k get nodes kind-control-plane -ojsonpath='{.status.capacity}' {"cpu":"8","ephemeral-storage":"485961008Ki","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"32564732Ki","pods":"110"} ``` ## Performance concerns In order to get the Node's capacity we need an API call to the k8s API in order to get the Node object. Something to consider here is the performance impact that this extra API call would bring. We can always choose to have this metric as disabled by default and clearly specify in the docs that this metric comes with an extra API call to get the Node of the Pods. The good thing is that `kubeletstats` receiver target's only one node so I believe it's a safe assumption to only fetch the current node because all the observed Pods will belong to the one single local node. Correct me if I miss anything here. In addition, instead of performing the API call explicitly on every single `scrape` we can use an informer instead and leverage its cache. I can change this patch to this direction if we agree on this. Would love to hear other's opinions on this. ## Todos ✅ 1) Apply this change behind a feature gate as it was indicated at #27885 (comment) ✅ 2) Use an Informer instead of direct API calls. **Link to tracking Issue:** <Issue number if applicable> ref: #27885 **Testing:** <Describe what testing was performed and which tests were added.> I experimented with this approach and the results look correct. In order to verify this I deployed a stress Pod on my machine to consume a target cpu of 4 cores: ```yaml apiVersion: v1 kind: Pod metadata: name: cpu-stress spec: containers: - name: cpu-stress image: polinux/stress command: ["stress"] args: ["-c", "4"] ``` And then the collected `container.cpu.utilization` for that Pod's container was at `0,5` as exepcted, based that my machine-node comes with 8 cores in total: ![cpu-stress](https://github.com/open-telemetry/opentelemetry-collector-contrib/assets/11754898/3abe4a0d-6c99-4b4e-a704-da5789dde01b) Unit test is also included. **Documentation:** <Describe the documentation added.> Added: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/32295/files#diff-8ad3b506fb1132c961e8da99b677abd31f0108e3f9ed6999dd96ad3297b51e08 --------- Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> This PR adds the `k8s.pod.cpu.node.utilization` metric. Follow up from #32295 (comment) (cc @TylerHelmuth) . **Link to tracking Issue:** <Issue number if applicable> Related to #27885. **Testing:** <Describe what testing was performed and which tests were added.> Adjusted the respective unit test to cover this metric as well. **Documentation:** <Describe the documentation added.> Added Tested with a single container Pod: ![podCpu](https://github.com/open-telemetry/opentelemetry-collector-contrib/assets/11754898/9a0069c2-7077-4944-93b6-2dde00979bf3) --------- Signed-off-by: ChrsMark <chrismarkou92@gmail.com> Co-authored-by: Tiffany Hrabusa <30397949+tiffany76@users.noreply.github.com> Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com>
…-telemetry#33390) **Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> This PR adds the `k8s.pod.cpu.node.utilization` metric. Follow up from open-telemetry#32295 (comment) (cc @TylerHelmuth) . **Link to tracking Issue:** <Issue number if applicable> Related to open-telemetry#27885. **Testing:** <Describe what testing was performed and which tests were added.> Adjusted the respective unit test to cover this metric as well. **Documentation:** <Describe the documentation added.> Added Tested with a single container Pod: ![podCpu](https://github.com/open-telemetry/opentelemetry-collector-contrib/assets/11754898/9a0069c2-7077-4944-93b6-2dde00979bf3) --------- Signed-off-by: ChrsMark <chrismarkou92@gmail.com> Co-authored-by: Tiffany Hrabusa <30397949+tiffany76@users.noreply.github.com> Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com>
Component(s)
receiver/kubeletstats
Is your feature request related to a problem? Please describe.
The Kubeletestats Receiver currently uses
*.cpu.utilization
as the name for cpu metrics that report the CPUStatsUsageNanoCores
value.I believe that
UsageNanoCores
reports the actual amount of cpu being used not the ratio of the amount being used out of a total limit. If this is true, then our use ofutilization
is not meeting semantic convention exceptions.I would like to have a discussion about what exactly
UsageNanoCores
represents and if our metric naming needs updating.Related to discussion that started in #24905
The text was updated successfully, but these errors were encountered: