-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an alert for excessive migrations on a VMI #7757
Add an alert for excessive migrations on a VMI #7757
Conversation
/cc @sradco |
053130c
to
2f11420
Compare
Runbook PR: |
2f11420
to
4bbd7cf
Compare
/test pull-kubevirt-unit-test |
4bbd7cf
to
66a5bdc
Compare
Hi @Barakmor1 ,
it is summing up migrations per VMI, regardless of other labels, then calculates their increase over a period of time (24 hours) with a resolution of 1 minute. |
cf8690a
to
68b3e68
Compare
If a VMI has been successfully migrated more than 12 times over a period of 24 hours, which is considerably higher than normal operation including an upgrade, an alert with a severity of warning is being fired. Note: A new VMI that hasn't been migrated yet has no datapoints for its `kubevirt_migrate_vmi_succeeded_total` metric. When it is first migrated, the metric changes to 1, and the `increase()` function doesn't count this change as an increase in the total migrations for that VMI. Therefore, for a new VMI the alert is being fired after 13 migrations in the specified time period (24 hours). If the VMI has been migrated at least once in the past, 12 migrations in 24 hours causes the alert to be fired. Also, fixing some typos. Signed-off-by: orenc1 <ocohen@redhat.com>
68b3e68
to
345650a
Compare
Thanks! |
input_series: | ||
- series: 'kubevirt_migrate_vmi_succeeded_total{vmi="vmi-example-1", source="node-1", target="node-2"}' | ||
# time: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 | ||
values: "_ 1 2 2 3 _ 1 2 2 _ _ 1 2 3+0x13" # 7 increases, in samples: 2,4,6,7,11,12,13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orenc1 Its 8 increases in total no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why do you consider the first value as 0 time? I think its the value of the first hour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an important detail and this is why the unit testing can become biased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, because the first one doesn't count (change from _
to 1
).
but it does count on the second series, since there is already a datapoint for kubevirt_migrate_vmi_succeeded_total{vmi="vmi-example-1"}
@orenc1 @sradco @assafad Hi. IIUC our policy regarding alerts addition is that a runbook should be first created for it in the monitoring repo, otherwise the PR will fail the monitoring lane. Either I don't access the KubeVirtVMIExcessiveMigrations correctly or it doesn't exist. For the latter case lets first add the runbook and then continue with this PR. |
Waiting for the runbook PR to be merged |
hi @enp0s3 , the runbook PR has been merged. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enp0s3 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required |
/retest-required |
1 similar comment
/retest-required |
If a VMI has been successfully migrated more than 12 times over a period of 24 hours, which is considerably higher than normal operation including an upgrade, an alert with a severity of
warning
is being fired.Also, fixing some typos.
Signed-off-by: orenc1 ocohen@redhat.com
What this PR does / why we need it:
Excessive number of migrations on a single VMI over a period of time can suggest issues related to cluster configuration or available resources. Firing an alert if excessive migrations have been found on a VMI.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Release note: