Add an alert for excessive migrations on a VMI #7757

orenc1 · 2022-05-17T11:11:30Z

If a VMI has been successfully migrated more than 12 times over a period of 24 hours, which is considerably higher than normal operation including an upgrade, an alert with a severity of warning is being fired.

Also, fixing some typos.

Signed-off-by: orenc1 ocohen@redhat.com

What this PR does / why we need it:
Excessive number of migrations on a single VMI over a period of time can suggest issues related to cluster configuration or available resources. Firing an alert if excessive migrations have been found on a VMI.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

new alert for excessive number of VMI migrations in a period of time.

orenc1 · 2022-05-17T11:11:51Z

/cc @sradco

hack/prom-rule-ci/prom-rules-tests.yaml

orenc1 · 2022-05-17T12:59:02Z

Runbook PR:
kubevirt/monitoring#85

orenc1 · 2022-05-17T20:52:13Z

/test pull-kubevirt-unit-test

Barakmor1 · 2022-05-19T10:24:53Z

Hey Oren,
Thanks for the PR.
I tried to use the new alert but i think that it doesn't work as expected:
i created a vm and migrated it two times , these are the results that i got from the new query:

Note that the result of sum by(vmi) (increase(kubevirt_migrate_vmi_succeeded_total[1d])) is 0
maybe you get the correct number after 24h .
Please let me know if i missed something.

orenc1 · 2022-05-23T12:10:35Z

Hi @Barakmor1 ,
thanks for noticing.
the problem was that the query treated to migrations of the same VMI, but with different source/target as separate results, thus the migrations on a specific VMI were not counted correctly.
This is the fixed expression I'm using now:

floor(increase(sum by (vmi) (kubevirt_migrate_vmi_succeeded_total)[1d:1m])) >= 12

it is summing up migrations per VMI, regardless of other labels, then calculates their increase over a period of time (24 hours) with a resolution of 1 minute.

If a VMI has been successfully migrated more than 12 times over a period of 24 hours, which is considerably higher than normal operation including an upgrade, an alert with a severity of warning is being fired. Note: A new VMI that hasn't been migrated yet has no datapoints for its `kubevirt_migrate_vmi_succeeded_total` metric. When it is first migrated, the metric changes to 1, and the `increase()` function doesn't count this change as an increase in the total migrations for that VMI. Therefore, for a new VMI the alert is being fired after 13 migrations in the specified time period (24 hours). If the VMI has been migrated at least once in the past, 12 migrations in 24 hours causes the alert to be fired. Also, fixing some typos. Signed-off-by: orenc1 <ocohen@redhat.com>

Barakmor1 · 2022-05-26T07:00:47Z

Thanks!
/lgtm

sradco · 2022-05-26T07:05:20Z

hack/prom-rule-ci/prom-rules-tests.yaml

+    input_series:
+      - series: 'kubevirt_migrate_vmi_succeeded_total{vmi="vmi-example-1", source="node-1", target="node-2"}'
+        # time:  0 1 2 3 4 5 6 7 8 9 10 11 12 13
+        values: "_ 1 2 2 3 _ 1 2 2 _ _  1  2  3+0x13"        # 7 increases, in samples: 2,4,6,7,11,12,13


@orenc1 Its 8 increases in total no?

Also, why do you consider the first value as 0 time? I think its the value of the first hour.

That is an important detail and this is why the unit testing can become biased.

no, because the first one doesn't count (change from _ to 1).
but it does count on the second series, since there is already a datapoint for kubevirt_migrate_vmi_succeeded_total{vmi="vmi-example-1"}

enp0s3 · 2022-05-29T12:40:41Z

@orenc1 @sradco @assafad Hi. IIUC our policy regarding alerts addition is that a runbook should be first created for it in the monitoring repo, otherwise the PR will fail the monitoring lane. Either I don't access the KubeVirtVMIExcessiveMigrations correctly or it doesn't exist. For the latter case lets first add the runbook and then continue with this PR.

enp0s3 · 2022-05-29T12:54:05Z

Waiting for the runbook PR to be merged

orenc1 · 2022-05-30T12:54:01Z

hi @enp0s3 , the runbook PR has been merged.

enp0s3 · 2022-05-31T10:50:56Z

/approve

kubevirt-bot · 2022-05-31T10:51:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enp0s3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [enp0s3]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

orenc1 · 2022-05-31T11:48:02Z

/retest-required

kubevirt-commenter-bot · 2022-05-31T18:21:55Z

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot · 2022-06-01T01:21:49Z

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/M labels May 17, 2022

kubevirt-bot requested review from acardace and mhenriks May 17, 2022 11:11

kubevirt-bot added the area/monitoring label May 17, 2022

kubevirt-bot requested a review from sradco May 17, 2022 11:11

sradco reviewed May 17, 2022

View reviewed changes

hack/prom-rule-ci/prom-rules-tests.yaml Outdated Show resolved Hide resolved

sradco reviewed May 17, 2022

View reviewed changes

hack/prom-rule-ci/prom-rules-tests.yaml Show resolved Hide resolved

orenc1 force-pushed the alert_excessive_vmi_migrations branch from 053130c to 2f11420 Compare May 17, 2022 12:54

orenc1 force-pushed the alert_excessive_vmi_migrations branch from 2f11420 to 4bbd7cf Compare May 17, 2022 14:11

orenc1 force-pushed the alert_excessive_vmi_migrations branch from 4bbd7cf to 66a5bdc Compare May 23, 2022 12:07

orenc1 force-pushed the alert_excessive_vmi_migrations branch 2 times, most recently from cf8690a to 68b3e68 Compare May 26, 2022 06:41

orenc1 force-pushed the alert_excessive_vmi_migrations branch from 68b3e68 to 345650a Compare May 26, 2022 06:42

kubevirt-bot assigned Barakmor1 May 26, 2022

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label May 26, 2022

sradco reviewed May 26, 2022

View reviewed changes

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2022

kubevirt-bot merged commit 912acbe into kubevirt:main Jun 1, 2022

orenc1 deleted the alert_excessive_vmi_migrations branch June 1, 2022 09:48

enp0s3 mentioned this pull request Sep 16, 2022

Use collector to set migration metrics #8368

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an alert for excessive migrations on a VMI #7757

Add an alert for excessive migrations on a VMI #7757

orenc1 commented May 17, 2022

orenc1 commented May 17, 2022

orenc1 commented May 17, 2022

orenc1 commented May 17, 2022

Barakmor1 commented May 19, 2022 •

edited

orenc1 commented May 23, 2022 •

edited

Barakmor1 commented May 26, 2022

sradco May 26, 2022

sradco May 26, 2022

sradco May 26, 2022 •

edited

orenc1 May 26, 2022

enp0s3 commented May 29, 2022

enp0s3 commented May 29, 2022

orenc1 commented May 30, 2022

enp0s3 commented May 31, 2022

kubevirt-bot commented May 31, 2022

orenc1 commented May 31, 2022

kubevirt-commenter-bot commented May 31, 2022

kubevirt-commenter-bot commented Jun 1, 2022

Add an alert for excessive migrations on a VMI #7757

Add an alert for excessive migrations on a VMI #7757

Conversation

orenc1 commented May 17, 2022

orenc1 commented May 17, 2022

orenc1 commented May 17, 2022

orenc1 commented May 17, 2022

Barakmor1 commented May 19, 2022 • edited

orenc1 commented May 23, 2022 • edited

Barakmor1 commented May 26, 2022

sradco May 26, 2022

Choose a reason for hiding this comment

sradco May 26, 2022

Choose a reason for hiding this comment

sradco May 26, 2022 • edited

Choose a reason for hiding this comment

orenc1 May 26, 2022

Choose a reason for hiding this comment

enp0s3 commented May 29, 2022

enp0s3 commented May 29, 2022

orenc1 commented May 30, 2022

enp0s3 commented May 31, 2022

kubevirt-bot commented May 31, 2022

orenc1 commented May 31, 2022

kubevirt-commenter-bot commented May 31, 2022

kubevirt-commenter-bot commented Jun 1, 2022

Barakmor1 commented May 19, 2022 •

edited

orenc1 commented May 23, 2022 •

edited

sradco May 26, 2022 •

edited