Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-0.65] Add alert to notify of nmstate removal #1302

Merged

Conversation

rhrazdil
Copy link
Contributor

kubernetes-nmstate is removed in the next CNAO version.
This change adds an alert that notifies users who have
knmstate deployed with CNAO, and point them to runbook,
that explains that standalone kubernetes-nmstate operator
should be installed.

Signed-off-by: Radim Hrazdil rhrazdil@redhat.com

What this PR does / why we need it:

Special notes for your reviewer:

Release note:


@kubevirt-bot kubevirt-bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Mar 24, 2022
@rhrazdil
Copy link
Contributor Author

/cc @RamLavi

@RamLavi
Copy link
Collaborator

RamLavi commented Mar 29, 2022

@sradco can you also review this ?

@RamLavi RamLavi requested a review from sradco March 29, 2022 08:02
Copy link
Collaborator

@RamLavi RamLavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add an e2e to alerts_tests?

@@ -408,12 +408,33 @@ var _ = Describe("NetworkAddonsConfig", func() {
var (
configSpec cnao.NetworkAddonsConfigSpec
)
checkMetricValues := func(expectedMetricValueMap map[string]string) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we recently moved endpoints tests to here. can you do the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also means adding the monitoring lane, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

labels:
severity: warning
kubernetes_operator_part_of: kubevirt
kubernetes_operator_component: cluster-network-addons-operator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to highlight this alert in the UI virtualization overview page.

can we add a label that will tell the UI to show this alert on the top of the virtualization overview page for better visibility ?

for exampe:
kubevirt_ui: highlighted
does that makes sense?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a naming convention for alert labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have, but @sradco may confirm that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest

kubevirt_console_scope: [cluster,virtualmachine]

But IIUC, we need to provide more info in case an alert is cluster scoped (that's not case of the alert in this PR, just to make the mechanism general).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaacov Can you please explain what is the reasoning to highlight this specific alert?
I believe alerts should be ordered based on their severity.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for the label should be part of OCP metrics documentation.
Im not sure if we should have it as mandatory as the severity.
The issue is that the label naming impacts also OCP and all other operators.
We should propose this in the monitoring forum and get their opinion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it as priority.

makes sense to me too

Im not sure if we should have it as mandatory as the severity.

IMHO it should not be mandatory (default to medium if missing)
my reasoning is that unlike severity that is clear to define by the rule maintainer, priority may depend on things outside the scope of the alert rules maintainer, the cases where the rule maintainer wish to raise/lower priority are the odd cases.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created openshift/enhancements#1077. Please review.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from openshift/enhancements#1077 is looks like "priority: high"

@rhrazdil can you add the "priority" label?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've missed your comment.
Added the priority: high label in a new commit

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 30, 2022
labels:
severity: warning
kubernetes_operator_part_of: kubevirt
kubernetes_operator_component: cluster-network-addons-operator
- alert: CnaoDown
annotations:
summary: CNAO pod is down.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to be more verbose if we highlight this alert.
maybe cluster network addons operator pod is not found, the operator deploy additional networking components required by kubevirt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to highlight the CnaoNmstateMigration alert, but yes, I get your point

Adding a new e2e test module and lane for monitoring.

Signed-off-by: Ram Lavi <ralavi@redhat.com>
@rhrazdil rhrazdil force-pushed the add_alert_nmstate_migration branch from 1835e5b to 04ceb17 Compare April 4, 2022 11:21
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2022
@rhrazdil rhrazdil changed the title Add alert to notify of nmstate removal [release-0.65] Add alert to notify of nmstate removal Apr 7, 2022
@rhrazdil rhrazdil force-pushed the add_alert_nmstate_migration branch from 0cf3891 to 5567456 Compare April 7, 2022 06:29
@rhrazdil
Copy link
Contributor Author

rhrazdil commented Apr 7, 2022

/test pull-e2e-cnao-nmstate-functests-release-0.65

@phoracek
Copy link
Member

Adding explicit hold so we don't merge this preemptively

/hold

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 22, 2022
@@ -22,6 +22,7 @@ spec:
expr: sum(kubevirt_cnao_nmstate_deployed or vector(0)) > 0 and sum(kubevirt_nmstate_operator_deployments or vector(0)) == 0
for: 5m
labels:
priority: high
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhrazdil Please move the Priority label below the severity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sradco I have nothing against moving it, but could you explain, what is the reasoning?
p is before s alphabetically

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO semantically, the required fields should go before the optional ones,
using that way of thinking, severity that is required label (and more important) will go before the optional priority.
I have nothing against going with alphabetical order if it's the common practice for labels, so that is also good option, don't know what is best.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhrazdil I agree with Kobi.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I take it. Updated

@rhrazdil rhrazdil force-pushed the add_alert_nmstate_migration branch from 46cadb7 to faf97e0 Compare May 12, 2022 12:55
@rhrazdil
Copy link
Contributor Author

/hold cancel

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 16, 2022
Copy link
Collaborator

@RamLavi RamLavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if we really need to define kubevirt_cnao_nmstate_deployed as gauge, if you can define it with pure prometheus given metrics (like it seems you already have succeeded), then we can remove the gauges, which will allow us to remove some of the cherry picks which will make things simpler IMO)

@@ -33,10 +33,15 @@ var (
Name: "kubevirt_cnao_cr_kubemacpool_deployed",
Help: "Kubemacpool is deployed by Cnao CR",
})
nmstateHandlerDeployed = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "kubevirt_cnao_nmstate_deployed",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm.. I don't get something..
if you already defined kubevirt_cnao_nmstate_deployed with kube_daemonset_labels, i.e. without the use of a GAUGE variable, why do you need to define one here? isn't this redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the variable in this force-push.

@rhrazdil rhrazdil force-pushed the add_alert_nmstate_migration branch from faf97e0 to 7cc8bcd Compare May 16, 2022 14:08
@rhrazdil
Copy link
Contributor Author

Please check if we really need to define kubevirt_cnao_nmstate_deployed as gauge, if you can define it with pure prometheus given metrics (like it seems you already have succeeded), then we can remove the gauges, which will allow us to remove some of the cherry picks which will make things simpler IMO)

Updated (removed the redundant variable).
To the cherry-picks, the monitoring lane adds infra for testing alerts, we could add the infra alone without adding the lane, but it seems easier from maintenance perspective to cherry-pick the whole commit.
The following two commits:
monitoring, e2e, Move endpoints test to monitoring lane
tests, metrics: Fix potential flakes on metrics scraping test

Are not really neccessary, we could keep the tests in workflow/deployment, if you prefer

@rhrazdil rhrazdil force-pushed the add_alert_nmstate_migration branch from 7cc8bcd to 79db7a4 Compare May 16, 2022 14:36
Radim Hrazdil added 4 commits May 17, 2022 07:53
kubernetes-nmstate is removed in the next CNAO version.
This change adds an alert that notifies users who have
knmstate deployed with CNAO, and point them to runbook,
that explains that standalone kubernetes-nmstate operator
should be installed.

Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com>
Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com>
Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com>
Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com>
@rhrazdil rhrazdil force-pushed the add_alert_nmstate_migration branch from 79db7a4 to 511b995 Compare May 17, 2022 06:06
@sonarcloud
Copy link

sonarcloud bot commented May 17, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 8 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@RamLavi
Copy link
Collaborator

RamLavi commented May 17, 2022

/retest

Copy link
Collaborator

@RamLavi RamLavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label May 17, 2022
@kubevirt-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RamLavi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 17, 2022
@RamLavi
Copy link
Collaborator

RamLavi commented May 17, 2022

/release-note-none

@kubevirt-bot kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 17, 2022
@kubevirt-bot kubevirt-bot merged commit ba15d4a into kubevirt:release-0.65 May 17, 2022
@yaacov
Copy link

yaacov commented May 23, 2022

@rhrazdil hi,
should we merge the alert in the main branch too ?
it makes sense to me to have fixes in main when possible ?

@RamLavi WDYT ?

cc:// @sradco @phoracek

@rhrazdil
Copy link
Contributor Author

Hello @yaacov,
we decided to only throw the alert in the release-0.65 branch, because cnao in newer release deletes kubernetes-nmstate, if it's installed.
So the alert couldn't fire in newer branch anyway.

@sradco
Copy link

sradco commented May 23, 2022

@rhrazdil I think it is important to add this alerts also to main branch. In case the user upgrade directly to the next major version and not to the next minor z-stream.

@rhrazdil
Copy link
Contributor Author

We're blocking an upgrade when user upgrades from CNV 4.10 to CNV 4.11 and has kubernetes-nmstate deployed
via CNAO.
Hence, if user has CNV 4.11 (and thus also ocp 4.11) and knmstate deployed, we can safely assume, that it's deployed via Kubernetes Nmstate Operator. CNAO would delete the knmstate deployment within a minute.

So this alert has no meaning on 4.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants