[Prometheus]: Remove ComponentExceedsRequestedCPU alert #6977

iholder101 · 2021-12-20T16:45:18Z

What this PR does / why we need it:
This PR removes KubeVirtComponentExceedsRequestedCPU alert.

The runbook for this alert says: If this alert consistently fires this could mean that the node’s CPU resources are not being optimally used and could be overloaded. This is a mistake.

In Kubernetes, it's completely normal to use more vCPU than whatever defined in request section. Not only that - in order to avoid wasted CPU time, the CPU "leftovers" will be distributed between containers proportionately to their CPU request amount.

Example: let's imagine that two containers are running on a node. Container A requests 0.5 CPUs, container B requests 0.3 CPUs, and the node has 1 CPU (for the simplicity of the example). In this situation it isn't wise to waste the remaining 20% CPU time that is left unused. Instead every container would get the remained 20% proportionate to the requested amount of its CPU usage.

Therefore, an alert is not needed to be triggered on such cases which are absolutely normal and expected.

Release note:

NONE

kubevirt-bot · 2021-12-20T16:45:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign rmohr after the PR has been reviewed.
You can assign the PR to them by writing /assign @rmohr in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

iholder101 · 2021-12-20T16:47:36Z

/cc @sradco

iholder101 · 2021-12-21T09:01:07Z

/retest
/cc @Barakmor1

Barakmor1 · 2021-12-21T09:04:20Z

/lgtm

xpivarc · 2021-12-21T09:10:21Z

I may not be aware of the motivation behind this but I don't think removing this alert is the right way to go.

The cited part in the PR description is impact. I agree that we can improve here at least by mentioning the more important issue. The issue I am seeing is the possible suboptimal performance of our components. Furthermore, we should evaluate the time period to eliminate sudden spikes.

I think keeping this alert is important at least for clusters with bigger scales. Maybe this is something for sig-scale to look into.

Barakmor1 · 2021-12-21T09:35:39Z

I may not be aware of the motivation behind this but I don't think removing this alert is the right way to go.

The cited part in the PR description is impact. I agree that we can improve here at least by mentioning the more important issue. The issue I am seeing is the possible suboptimal performance of our components. Furthermore, we should evaluate the time period to eliminate sudden spikes.

I think keeping this alert is important at least for clusters with bigger scales. Maybe this is something for sig-scale to look into.

I think that the cpu usage is determined by the system for instance:
if one container (lets call it container A) request for 50% of the cpu of the node there are two cases to consider:

There are more containers that fill the remaining cpu gap (the total cpu request of all the containers is 100%), in that case container A will use 50% of cpu and the scheduling system will make sure of that (just like processes in OS).
There is free cpu in the node. For instance, if container A runs alone in the node, container A will use 100% of the cpu since it would be a waste of resources to make the node idle (it doesn't make sense to use 50% of the CPU if no other container will ever use it..)

therefore i don't think that container that exceeds his cpu request is something that should trigger an alert.

iholder101 · 2021-12-21T10:18:36Z

I agree with @Barakmor1.
The most simple example is a node with 2 CPUs running one container that requests for 1 CPU. Will the second CPU be idle? This doesn't make sense and therefore Kubernetes will let the container use the 2 CPUs as long as no other container requests it. In such a scenario this alert would fire constantly although everything is completely normal and operates as expected.

@xpivarc Please let me know if you agree. I'd be happy to provide further information.

iholder101 · 2021-12-22T09:23:47Z

After speaking with @xpivarc offline - we both agree that the current state of the alert is completely wrong and needs to be changed. I'm not sure what was the intent behind this alert to begin with, but one possibility is that it aimed to give an indication about having a wrong cpu "request" amount. While this alert does not serve this goal in any way we can maybe think of other ways of serving it.

(We thought to trigger an alert when two things happen at the same time: VMI that exceeds its request now stops exceeding it and performance degregation. This might indicate of the actual CPU needed for the container is higher than requested).

While I agree that this is an important goal I don't think we should keep a bad alert simply because we don't have a good one yet. IMHO this alert only makes things worse and not better. It is confusing and fires on perfectly normal situations, the runbook for it is completely wrong and I can't think of a single situation that this alert would benefit anyone.

iholder101 · 2021-12-22T09:24:07Z

/cc @vladikr FYI

kubevirt-bot · 2021-12-22T09:24:09Z

@iholder-redhat: GitHub didn't allow me to request PR reviews from the following users: FYI.

Note that only kubevirt members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @vladikr FYI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

iholder101 · 2021-12-22T14:38:58Z

Rebased.
Can you re-lgtm? @Barakmor1

Barakmor1 · 2021-12-22T15:02:43Z

/lgtm

vladikr · 2021-12-22T18:59:00Z

/hold
@iholder-redhat this alert doesn't target VMIs, but this is about monitoring the CPU consumption of the KubeVirt components (virt-api, virt-handler, virt-controller, etc... ).
I think the intent here is to collect information and be able to adjust our defaults.
I think this is useful to have.

iholder101 · 2021-12-23T07:56:39Z

/hold @iholder-redhat this alert doesn't target VMIs, but this is about monitoring the CPU consumption of the KubeVirt components (virt-api, virt-handler, virt-controller, etc... ). I think the intent here is to collect information and be able to adjust our defaults. I think this is useful to have.

@vladikr
Can you give me an example of a situation this might be helpful?

Again - if there's a node without any VMIs - only virt-controller, virt-api, etc'. Also, let the node have a lot of extra CPU (in other words - all of the containers on the node request 50% of the total CPU amount on the node). In this situation all of our components will exceed CPU requested by them simply because there is free CPU that the node is willing to spare.

enp0s3 · 2022-01-02T14:51:50Z

@iholder-redhat Hi. I think one useful case is already seen by us when we tried to debug multiple VM creation on a large scale cluster. In that case the virt-api deployment was highly utilized, as you said it was given only 5m in the resource requests. If that alert would pop up then we could spot high utilization on the virt-api deployment and we would increase the replica set.

iholder101 · 2022-01-03T09:18:07Z

@iholder-redhat Hi. I think one useful case is already seen by us when we tried to debug multiple VM creation on a large scale cluster. In that case the virt-api deployment was highly utilized, as you said it was given only 5m in the resource requests. If that alert would pop up then we could spot high utilization on the virt-api deployment and we would increase the replica set.

As I've explained in detail in previous comments, there is absolutely no correlation between the pod's workload and the amount of CPU is uses.

The only situation in which the Pod will exceed its CPU requests is if the node has spare CPU that it does not use. Therefore this gives us no indication of a problem and will only confuse the cluster admin doing more harm than good.

I think one useful case is already seen by us when we tried to debug multiple VM creation on a large scale cluster.

This is confusing, but it's not true. Again, we would have seen this alert if the node had enough spare CPU, it doesn't have anything to do with the workload itself.

Signed-off-by: Itamar Holder <iholder@redhat.com>

kubevirt-bot · 2022-01-03T09:19:08Z

New changes are detected. LGTM label has been removed.

iholder101 · 2022-01-03T12:52:20Z

/retest

kubevirt-bot · 2022-01-03T16:51:03Z

@iholder-redhat: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubevirt-e2e-k8s-1.20-sig-compute-nonroot	`ef1a04d`	link	true	`/test pull-kubevirt-e2e-k8s-1.20-sig-compute-nonroot`
pull-kubevirt-e2e-k8s-1.21-sig-compute	`ef1a04d`	link	true	`/test pull-kubevirt-e2e-k8s-1.21-sig-compute`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

vladikr · 2022-01-05T15:36:47Z

Can you give me an example of a situation this might be helpful?

I think this was needed for #6144
For example.

vladikr · 2022-01-05T15:49:03Z

Again - if there's a node without any VMIs - only virt-controller, virt-api, etc'. Also, let the node have a lot of extra CPU (in other words - all of the containers on the node request 50% of the total CPU amount on the node). In this situation all of our components will exceed CPU requested by them simply because there is free CPU that the node is willing to spare.

if all components request 50% of the total capacity, then yes Some components will be throttled. The idea is to identify how much CPU time each of the components needs to avoid throttling.
And this is the point of this alert, afaiu.

xpivarc · 2022-01-05T16:14:35Z

After speaking with @xpivarc offline - we both agree that the current state of the alert is completely wrong and needs to be changed. I'm not sure what was the intent behind this alert to begin with, but one possibility is that it aimed to give an indication about having a wrong cpu "request" amount. While this alert does not serve this goal in any way we can maybe think of other ways of serving it.

(We thought to trigger an alert when two things happen at the same time: VMI that exceeds its request now stops exceeding it and performance degregation. This might indicate of the actual CPU needed for the container is higher than requested).

While I agree that this is an important goal I don't think we should keep a bad alert simply because we don't have a good one yet. IMHO this alert only makes things worse and not better. It is confusing and fires on perfectly normal situations, the runbook for it is completely wrong and I can't think of a single situation that this alert would benefit anyone.

I only remember agreeing that runbook is wrong. E.g Check to see what the cpu resource limit is. should talk about requests.
The impact is also not the most important. Rather we should mention the performance degradation of components.

I am definitely againts removing the alert at least without any replacement.

Again - if there's a node without any VMIs - only virt-controller, virt-api, etc'. Also, let the node have a lot of extra CPU (in other words - all of the containers on the node request 50% of the total CPU amount on the node). In this situation all of our components will exceed CPU requested by them simply because there is free CPU that the node is willing to spare.

Let's say our component needs 1.5 cpu in a particular cluster and we request only 1. In this case, we are not guaranteed to get the 0.5 additional cpu for our component and the performance can degrade.

It is completely normal for Pod to use more cpu than requested but in this case, we don't want it to happen for our components for large(mostly indefinite time). Therefore I suggested raising the evaluation time.

kubevirt-bot · 2022-04-05T16:20:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

iholder101 · 2022-04-07T06:57:29Z

/close

kubevirt-bot · 2022-04-07T06:57:33Z

@iholder-redhat: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

iholder101 · 2022-05-24T09:51:40Z

Evidence for why this alert is misleading: #6144.

It's obvious that when pressuring the host this alert would fire and it's completely normal, but it makes it look like something is wrong. I still think it either has to be removed or renamed and explained properly.

kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/M area/monitoring labels Dec 20, 2021

kubevirt-bot requested review from acardace and xpivarc December 20, 2021 16:45

kubevirt-bot requested a review from sradco December 20, 2021 16:47

kubevirt-bot requested a review from Barakmor1 December 21, 2021 09:01

kubevirt-bot assigned Barakmor1 Dec 21, 2021

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 21, 2021

kubevirt-bot requested a review from vladikr December 22, 2021 09:24

kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 22, 2021

iholder101 force-pushed the prometheus/removeAlertForCpuOverRequested branch from 3ce3628 to d75f123 Compare December 22, 2021 14:38

kubevirt-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 22, 2021

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 22, 2021

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 22, 2021

kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 2, 2022

iholder101 added 2 commits January 3, 2022 11:18

Remove KubeVirtComponentExceedsRequestedCPU alert

a607c40

Signed-off-by: Itamar Holder <iholder@redhat.com>

Remove KubeVirtComponentExceedsRequestedCPU from unit test yaml

ef1a04d

Signed-off-by: Itamar Holder <iholder@redhat.com>

iholder101 force-pushed the prometheus/removeAlertForCpuOverRequested branch from d75f123 to ef1a04d Compare January 3, 2022 09:19

kubevirt-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 3, 2022

kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2022

kubevirt-bot closed this Apr 7, 2022

This was referenced May 24, 2022

Scaling virt-api to overcome Kubevirt scalability issues #7101

Open

[Sig-scale] KubeVirt modules are using more cpu than requested #6144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Prometheus]: Remove ComponentExceedsRequestedCPU alert #6977

[Prometheus]: Remove ComponentExceedsRequestedCPU alert #6977

iholder101 commented Dec 20, 2021

kubevirt-bot commented Dec 20, 2021

iholder101 commented Dec 20, 2021

iholder101 commented Dec 21, 2021

Barakmor1 commented Dec 21, 2021

xpivarc commented Dec 21, 2021

Barakmor1 commented Dec 21, 2021

iholder101 commented Dec 21, 2021

iholder101 commented Dec 22, 2021

iholder101 commented Dec 22, 2021

kubevirt-bot commented Dec 22, 2021

iholder101 commented Dec 22, 2021

Barakmor1 commented Dec 22, 2021

vladikr commented Dec 22, 2021

iholder101 commented Dec 23, 2021

enp0s3 commented Jan 2, 2022

iholder101 commented Jan 3, 2022

kubevirt-bot commented Jan 3, 2022

iholder101 commented Jan 3, 2022

kubevirt-bot commented Jan 3, 2022

vladikr commented Jan 5, 2022

vladikr commented Jan 5, 2022

xpivarc commented Jan 5, 2022

kubevirt-bot commented Apr 5, 2022

iholder101 commented Apr 7, 2022

kubevirt-bot commented Apr 7, 2022

iholder101 commented May 24, 2022

[Prometheus]: Remove ComponentExceedsRequestedCPU alert #6977

[Prometheus]: Remove ComponentExceedsRequestedCPU alert #6977

Conversation

iholder101 commented Dec 20, 2021

kubevirt-bot commented Dec 20, 2021

iholder101 commented Dec 20, 2021

iholder101 commented Dec 21, 2021

Barakmor1 commented Dec 21, 2021

xpivarc commented Dec 21, 2021

Barakmor1 commented Dec 21, 2021

iholder101 commented Dec 21, 2021

iholder101 commented Dec 22, 2021

iholder101 commented Dec 22, 2021

kubevirt-bot commented Dec 22, 2021

iholder101 commented Dec 22, 2021

Barakmor1 commented Dec 22, 2021

vladikr commented Dec 22, 2021

iholder101 commented Dec 23, 2021

enp0s3 commented Jan 2, 2022

iholder101 commented Jan 3, 2022

kubevirt-bot commented Jan 3, 2022

iholder101 commented Jan 3, 2022

kubevirt-bot commented Jan 3, 2022

vladikr commented Jan 5, 2022

vladikr commented Jan 5, 2022

xpivarc commented Jan 5, 2022

kubevirt-bot commented Apr 5, 2022

iholder101 commented Apr 7, 2022

kubevirt-bot commented Apr 7, 2022

iholder101 commented May 24, 2022