Bug 1793850: Update UsingDeprecatedAPIExtensionsV1Beta1 alert #730

cblecker · 2020-01-23T22:14:33Z

Updated message that describes the issue more clearly
Exclude velero-server client
Change alert to use an increase in the last 24h
Change severity to none

/assign @soltysh
/cc @sttts @brancz @tnozicka

openshift-ci-robot · 2020-01-23T22:14:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cblecker
To complete the pull request process, please assign mfojtik
You can assign the PR to them by writing /assign @mfojtik in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cblecker · 2020-01-23T22:15:51Z

This and #705 should be cherrypicked back to release-4.3

openshift-ci-robot · 2020-01-23T22:17:02Z

@cblecker: This pull request references Bugzilla bug 1793850, which is invalid:

expected the bug to target the "4.4.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1793850: Update UsingDeprecatedAPIExtensionsV1Beta1 alert

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cblecker · 2020-01-24T01:10:45Z

/retest

lilic · 2020-01-24T10:13:15Z

manifests/0000_90_kube-apiserver-operator_04_servicemonitor-apiserver.yaml

      expr: |
-        apiserver_request_count{group="extensions",version="v1beta1",resource!~"ingresses|",client!~"hyperkube/.*|cluster-policy-controller/.*"}
+        increase(
+          apiserver_request_count{


Please note that apiserver_request_count is a deprecated metric and we are dropping those metrics from openshift as well. We should be using apiserver_request_total instead.

@lilic drop when? This PR is meant for 4.3 most importantly. If you're dropping this in 4.4 - fine change, if in the next we'll fix it then.

Yes we are dropping this metric start with 4.4. It has been deprecated since 1.14 k8s version and has a replacement which is apiserver_request_total so just changing it to that should just work ™️ .

Made this change, and I verified the alert still fires as expected in a test cluster.

👍 @lilic does the new metric will be working fine in 4.3 as well? Just ensuring before we switch it in the backport too, or the backport has to be different?

My test cluster is running 4.3.0 GA, so I can confirm that this works.

lilic · 2020-01-24T10:16:45Z

manifests/0000_90_kube-apiserver-operator_04_servicemonitor-apiserver.yaml

      labels:
-        severity: warning
+        severity: none


why severity none? 🤔

I remember that severity should've stayed, from the discussion we were to only change the metrics to be rate based.

In the OSD context, Critical and Warnings are fed into monitoring for cluster health. This alert isn't particularly actionable and doesn't actually impact the cluster's current health -- it's an "info" type alert that you should do something prior to the next upgrade, but the current cluster health isn't degraded. None in this case, I believe is appropriate.

Note that with a severity of none, it still surfaces in the on-cluster UI

alert isn't particularly actionable and doesn't actually impact the cluster's current health

Precisely why this alert shouldn't exist in the first place.

The reasoning behind is that we need to yell loudly at people using this api, b/c otherwise after next upgrade this won't work. I can be convinced and I was originally going after an info level, but none such exists atm. That's why the warning, and it's important one to have.

But in 4.4 its not firing, otherwise we would have had failed jobs which confirms @tnozicka:

note that for 4.4 the serving is disabled so we should drop this check entirely there

Do we want our customers to react on this alert? And if yes how?

Do we want our customers to react on this alert? And if yes how?

If this fires in 4.3 it means you have clients accessing endpoints that will be removed in the next release and they should be updating that code.

I can be convinced and I was originally going after an info level, but none such exists atm.

They do exist. OLM uses info level alerts and we also (will) have special rules for handling this severity.

If this fires in 4.3 it means you have clients accessing endpoints that will be removed in the next release and they should be updating that code.

This is not immediately actionable from operator point of view.

tnozicka

note that for 4.4 the serving is disabled so we should drop this check entirely there

tnozicka · 2020-01-24T10:39:28Z

manifests/0000_90_kube-apiserver-operator_04_servicemonitor-apiserver.yaml

+            group="extensions",
+            version="v1beta1",
+            resource!~"ingresses|",
+            client!~"hyperkube/.*|cluster-policy-controller/.*|velero-server/.*"


why velero-server? the only reason for the others was using discovery to spin informers

Velero is a piece of backup software, and it scrapes every API type available on the system in order to back them up. This causes the alert to fire in OSD and never resolve. @sttts mentioned that ignoring this client may be a good idea.

yeah, that's a minor problem I can live with.

I'd also note that there is no action that we or velero need to take on this. When the API endpoint goes away in 4.4, it won't be there to scrape anymore.

soltysh · 2020-01-24T12:35:47Z

manifests/0000_90_kube-apiserver-operator_04_servicemonitor-apiserver.yaml

      labels:
-        severity: warning
+        severity: none


I remember that severity should've stayed, from the discussion we were to only change the metrics to be rate based.

soltysh · 2020-01-24T12:36:58Z

manifests/0000_90_kube-apiserver-operator_04_servicemonitor-apiserver.yaml

      expr: |
-        apiserver_request_count{group="extensions",version="v1beta1",resource!~"ingresses|",client!~"hyperkube/.*|cluster-policy-controller/.*"}
+        increase(
+          apiserver_request_count{


@lilic drop when? This PR is meant for 4.3 most importantly. If you're dropping this in 4.4 - fine change, if in the next we'll fix it then.

soltysh · 2020-01-24T12:38:23Z

manifests/0000_90_kube-apiserver-operator_04_servicemonitor-apiserver.yaml

+            version="v1beta1",
+            resource!~"ingresses|",
+            client!~"hyperkube/.*|cluster-policy-controller/.*|velero-server/.*"
+          }[24h]


Didn't you want to have 5m here? Why going with 24h?

So my understanding of this alert, is we want it to surface to the administrators of the cluster that some client is making a call that they will no longer be able to in 4.4+. Right now, this alert alarms and never resolves -- which on the one hand accomplishes this goal, but on the other hand doesn't allow for the alert to resolve if the condition stops.

I'm okay with 5m, but I think that lessons the usefulness of this alert, as the administrator would need to catch it within that 5 minute period. For clients that aren't calling the API that frequently (cron jobs, deploy pipelines, etc) it might be harder to catch this alert.

I don't have any strong preferences either way, I was just asking 😉

soltysh · 2020-01-24T12:38:28Z

manifests/0000_90_kube-apiserver-operator_04_servicemonitor-apiserver.yaml

+            version="v1beta1",
+            resource!~"ingresses|",
+            client!~"hyperkube/.*|cluster-policy-controller/.*|velero-server/.*"
+          }[24h]


Didn't you want to have 5m here? Why going with 24h?

soltysh

My only one concern with this is still the level, it has to stay warning, or if there's info that's the lowest we can go. It's quite important feedback for users to get, before the next upgrade which is when it'll be too late for them, and they will be blocked for the upgrade or even much more after the upgrade.

- Updated message that describes the issue more clearly - Exclude velero-server client - Change alert to use an increase in the last 24h - Change severity to none

cblecker · 2020-01-24T17:14:18Z

@soltysh Updated to info. It's important that this not be a "warning" because the clusters health is not actually degraded.

cblecker · 2020-01-24T17:20:29Z

Example of how this looks in the on-cluster UI

cblecker · 2020-01-24T18:51:21Z

/retest

cblecker · 2020-01-27T19:05:05Z

/retest

openshift-ci-robot · 2020-01-27T21:41:26Z

@cblecker: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-operator	`4599af8`	link	`/test e2e-aws-operator`
ci/prow/e2e-aws-serial	`4599af8`	link	`/test e2e-aws-serial`
ci/prow/e2e-aws	`4599af8`	link	`/test e2e-aws`
ci/prow/e2e-aws-upgrade	`4599af8`	link	`/test e2e-aws-upgrade`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

soltysh · 2020-01-28T13:37:01Z

Superseded with #741.

openshift-ci-robot assigned soltysh Jan 23, 2020

openshift-ci-robot requested review from brancz, sttts and tnozicka January 23, 2020 22:14

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 23, 2020

cblecker changed the title ~~Update UsingDeprecatedAPIExtensionsV1Beta1 alert~~ Bug 1793850: Update UsingDeprecatedAPIExtensionsV1Beta1 alert Jan 23, 2020

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jan 23, 2020

cblecker force-pushed the v1beta1-alert branch from 1e72fc6 to f9ac58c Compare January 23, 2020 22:45

lilic reviewed Jan 24, 2020

View reviewed changes

tnozicka suggested changes Jan 24, 2020

View reviewed changes

soltysh requested changes Jan 24, 2020

View reviewed changes

cblecker force-pushed the v1beta1-alert branch from f9ac58c to 6bbd781 Compare January 24, 2020 16:52

soltysh requested changes Jan 24, 2020

View reviewed changes

Update UsingDeprecatedAPIExtensionsV1Beta1 alert

4599af8

- Updated message that describes the issue more clearly - Exclude velero-server client - Change alert to use an increase in the last 24h - Change severity to none

cblecker force-pushed the v1beta1-alert branch from 6bbd781 to 4599af8 Compare January 24, 2020 17:13

This was referenced Jan 28, 2020

Drop extensions/v1beta1 alert #740

Closed

Bug 1795617: Remove entirely deprecated alerts #741

Merged

soltysh closed this Jan 28, 2020

cblecker deleted the v1beta1-alert branch January 29, 2020 14:46

Bug 1793850: Update UsingDeprecatedAPIExtensionsV1Beta1 alert #730

Bug 1793850: Update UsingDeprecatedAPIExtensionsV1Beta1 alert #730

Conversation

cblecker commented Jan 23, 2020

openshift-ci-robot commented Jan 23, 2020

cblecker commented Jan 23, 2020

openshift-ci-robot commented Jan 23, 2020

cblecker commented Jan 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnozicka left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh left a comment

Choose a reason for hiding this comment

cblecker commented Jan 24, 2020

cblecker commented Jan 24, 2020

cblecker commented Jan 24, 2020

cblecker commented Jan 27, 2020

openshift-ci-robot commented Jan 27, 2020

soltysh commented Jan 28, 2020

tnozicka left a comment •

edited