Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an alert for excessive migrations on a VMI #7757

Merged
merged 1 commit into from Jun 1, 2022

Conversation

orenc1
Copy link
Contributor

@orenc1 orenc1 commented May 17, 2022

If a VMI has been successfully migrated more than 12 times over a period of 24 hours, which is considerably higher than normal operation including an upgrade, an alert with a severity of warning is being fired.

Also, fixing some typos.

Signed-off-by: orenc1 ocohen@redhat.com

What this PR does / why we need it:
Excessive number of migrations on a single VMI over a period of time can suggest issues related to cluster configuration or available resources. Firing an alert if excessive migrations have been found on a VMI.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

new alert for excessive number of VMI migrations in a period of time.

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/M labels May 17, 2022
@orenc1
Copy link
Contributor Author

orenc1 commented May 17, 2022

/cc @sradco

@kubevirt-bot kubevirt-bot requested a review from sradco May 17, 2022 11:11
@orenc1 orenc1 force-pushed the alert_excessive_vmi_migrations branch from 053130c to 2f11420 Compare May 17, 2022 12:54
@orenc1
Copy link
Contributor Author

orenc1 commented May 17, 2022

Runbook PR:
kubevirt/monitoring#85

@orenc1 orenc1 force-pushed the alert_excessive_vmi_migrations branch from 2f11420 to 4bbd7cf Compare May 17, 2022 14:11
@orenc1
Copy link
Contributor Author

orenc1 commented May 17, 2022

/test pull-kubevirt-unit-test

@Barakmor1
Copy link
Member

Barakmor1 commented May 19, 2022

Hey Oren,
Thanks for the PR.
I tried to use the new alert but i think that it doesn't work as expected:
i created a vm and migrated it two times , these are the results that i got from the new query:
image
Note that the result of sum by(vmi) (increase(kubevirt_migrate_vmi_succeeded_total[1d])) is 0
maybe you get the correct number after 24h .
Please let me know if i missed something.

@orenc1 orenc1 force-pushed the alert_excessive_vmi_migrations branch from 4bbd7cf to 66a5bdc Compare May 23, 2022 12:07
@orenc1
Copy link
Contributor Author

orenc1 commented May 23, 2022

Hi @Barakmor1 ,
thanks for noticing.
the problem was that the query treated to migrations of the same VMI, but with different source/target as separate results, thus the migrations on a specific VMI were not counted correctly.
This is the fixed expression I'm using now:

floor(increase(sum by (vmi) (kubevirt_migrate_vmi_succeeded_total)[1d:1m])) >= 12

it is summing up migrations per VMI, regardless of other labels, then calculates their increase over a period of time (24 hours) with a resolution of 1 minute.

@orenc1 orenc1 force-pushed the alert_excessive_vmi_migrations branch 2 times, most recently from cf8690a to 68b3e68 Compare May 26, 2022 06:41
If a VMI has been successfully migrated more than 12 times over a period of 24 hours, which is considerably higher than normal operation including an upgrade, an alert with a severity of warning is being fired.
Note:
A new VMI that hasn't been migrated yet has no datapoints for its `kubevirt_migrate_vmi_succeeded_total` metric.
When it is first migrated, the metric changes to 1, and the `increase()` function doesn't count this change as an increase in the total migrations for that VMI.
Therefore, for a new VMI the alert is being fired after 13 migrations in the specified time period (24 hours).
If the VMI has been migrated at least once in the past, 12 migrations in 24 hours causes the alert to be fired.

Also, fixing some typos.

Signed-off-by: orenc1 <ocohen@redhat.com>
@orenc1 orenc1 force-pushed the alert_excessive_vmi_migrations branch from 68b3e68 to 345650a Compare May 26, 2022 06:42
@Barakmor1
Copy link
Member

Thanks!
/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label May 26, 2022
input_series:
- series: 'kubevirt_migrate_vmi_succeeded_total{vmi="vmi-example-1", source="node-1", target="node-2"}'
# time: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
values: "_ 1 2 2 3 _ 1 2 2 _ _ 1 2 3+0x13" # 7 increases, in samples: 2,4,6,7,11,12,13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orenc1 Its 8 increases in total no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why do you consider the first value as 0 time? I think its the value of the first hour.

Copy link
Contributor

@sradco sradco May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an important detail and this is why the unit testing can become biased.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because the first one doesn't count (change from _ to 1).
but it does count on the second series, since there is already a datapoint for kubevirt_migrate_vmi_succeeded_total{vmi="vmi-example-1"}

@enp0s3
Copy link
Contributor

enp0s3 commented May 29, 2022

@orenc1 @sradco @assafad Hi. IIUC our policy regarding alerts addition is that a runbook should be first created for it in the monitoring repo, otherwise the PR will fail the monitoring lane. Either I don't access the KubeVirtVMIExcessiveMigrations correctly or it doesn't exist. For the latter case lets first add the runbook and then continue with this PR.

@enp0s3
Copy link
Contributor

enp0s3 commented May 29, 2022

Waiting for the runbook PR to be merged

@orenc1
Copy link
Contributor Author

orenc1 commented May 30, 2022

hi @enp0s3 , the runbook PR has been merged.

@enp0s3
Copy link
Contributor

enp0s3 commented May 31, 2022

/approve

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enp0s3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2022
@orenc1
Copy link
Contributor Author

orenc1 commented May 31, 2022

/retest-required

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-bot kubevirt-bot merged commit 912acbe into kubevirt:main Jun 1, 2022
@orenc1 orenc1 deleted the alert_excessive_vmi_migrations branch June 1, 2022 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants