-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Prometheus]: Remove ComponentExceedsRequestedCPU alert #6977
[Prometheus]: Remove ComponentExceedsRequestedCPU alert #6977
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @sradco |
/retest |
/lgtm |
I may not be aware of the motivation behind this but I don't think removing this alert is the right way to go. The cited part in the PR description is impact. I agree that we can improve here at least by mentioning the more important issue. The issue I am seeing is the possible suboptimal performance of our components. Furthermore, we should evaluate the time period to eliminate sudden spikes. I think keeping this alert is important at least for clusters with bigger scales. Maybe this is something for sig-scale to look into. |
I think that the cpu usage is determined by the system for instance:
therefore i don't think that container that exceeds his cpu request is something that should trigger an alert. |
I agree with @Barakmor1. @xpivarc Please let me know if you agree. I'd be happy to provide further information. |
After speaking with @xpivarc offline - we both agree that the current state of the alert is completely wrong and needs to be changed. I'm not sure what was the intent behind this alert to begin with, but one possibility is that it aimed to give an indication about having a wrong cpu "request" amount. While this alert does not serve this goal in any way we can maybe think of other ways of serving it. (We thought to trigger an alert when two things happen at the same time: VMI that exceeds its request now stops exceeding it and performance degregation. This might indicate of the actual CPU needed for the container is higher than requested). While I agree that this is an important goal I don't think we should keep a bad alert simply because we don't have a good one yet. IMHO this alert only makes things worse and not better. It is confusing and fires on perfectly normal situations, the runbook for it is completely wrong and I can't think of a single situation that this alert would benefit anyone. |
/cc @vladikr FYI |
@iholder-redhat: GitHub didn't allow me to request PR reviews from the following users: FYI. Note that only kubevirt members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
3ce3628
to
d75f123
Compare
Rebased. |
/lgtm |
/hold |
@vladikr Again - if there's a node without any VMIs - only virt-controller, virt-api, etc'. Also, let the node have a lot of extra CPU (in other words - all of the containers on the node request 50% of the total CPU amount on the node). In this situation all of our components will exceed CPU requested by them simply because there is free CPU that the node is willing to spare. |
@iholder-redhat Hi. I think one useful case is already seen by us when we tried to debug multiple VM creation on a large scale cluster. In that case the virt-api deployment was highly utilized, as you said it was given only |
As I've explained in detail in previous comments, there is absolutely no correlation between the pod's workload and the amount of CPU is uses. The only situation in which the Pod will exceed its CPU requests is if the node has spare CPU that it does not use. Therefore this gives us no indication of a problem and will only confuse the cluster admin doing more harm than good.
This is confusing, but it's not true. Again, we would have seen this alert if the node had enough spare CPU, it doesn't have anything to do with the workload itself. |
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
d75f123
to
ef1a04d
Compare
New changes are detected. LGTM label has been removed. |
/retest |
@iholder-redhat: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I think this was needed for #6144 |
if all components request 50% of the total capacity, then yes Some components will be throttled. The idea is to identify how much CPU time each of the components needs to avoid throttling. |
I only remember agreeing that runbook is wrong. E.g I am definitely againts removing the alert at least without any replacement.
Let's say our component needs 1.5 cpu in a particular cluster and we request only 1. In this case, we are not guaranteed to get the 0.5 additional cpu for our component and the performance can degrade. It is completely normal for Pod to use more cpu than requested but in this case, we don't want it to happen for our components for large(mostly indefinite time). Therefore I suggested raising the evaluation time. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/close |
@iholder-redhat: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Evidence for why this alert is misleading: #6144. It's obvious that when pressuring the host this alert would fire and it's completely normal, but it makes it look like something is wrong. I still think it either has to be removed or renamed and explained properly. |
What this PR does / why we need it:
This PR removes
KubeVirtComponentExceedsRequestedCPU
alert.The runbook for this alert says:
If this alert consistently fires this could mean that the node’s CPU resources are not being optimally used and could be overloaded
. This is a mistake.In Kubernetes, it's completely normal to use more vCPU than whatever defined in
request
section. Not only that - in order to avoid wasted CPU time, the CPU "leftovers" will be distributed between containers proportionately to their CPU request amount.Example: let's imagine that two containers are running on a node. Container A requests 0.5 CPUs, container B requests 0.3 CPUs, and the node has 1 CPU (for the simplicity of the example). In this situation it isn't wise to waste the remaining 20% CPU time that is left unused. Instead every container would get the remained 20% proportionate to the requested amount of its CPU usage.
Therefore, an alert is not needed to be triggered on such cases which are absolutely normal and expected.
Release note: