Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1793850: Update UsingDeprecatedAPIExtensionsV1Beta1 alert #730

Closed
wants to merge 1 commit into from

Conversation

cblecker
Copy link
Member

  • Updated message that describes the issue more clearly
  • Exclude velero-server client
  • Change alert to use an increase in the last 24h
  • Change severity to none

/assign @soltysh
/cc @sttts @brancz @tnozicka

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 23, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cblecker
To complete the pull request process, please assign mfojtik
You can assign the PR to them by writing /assign @mfojtik in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cblecker
Copy link
Member Author

This and #705 should be cherrypicked back to release-4.3

@cblecker cblecker changed the title Update UsingDeprecatedAPIExtensionsV1Beta1 alert Bug 1793850: Update UsingDeprecatedAPIExtensionsV1Beta1 alert Jan 23, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jan 23, 2020
@openshift-ci-robot
Copy link

@cblecker: This pull request references Bugzilla bug 1793850, which is invalid:

  • expected the bug to target the "4.4.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1793850: Update UsingDeprecatedAPIExtensionsV1Beta1 alert

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cblecker
Copy link
Member Author

/retest

expr: |
apiserver_request_count{group="extensions",version="v1beta1",resource!~"ingresses|",client!~"hyperkube/.*|cluster-policy-controller/.*"}
increase(
apiserver_request_count{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that apiserver_request_count is a deprecated metric and we are dropping those metrics from openshift as well. We should be using apiserver_request_total instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilic drop when? This PR is meant for 4.3 most importantly. If you're dropping this in 4.4 - fine change, if in the next we'll fix it then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we are dropping this metric start with 4.4. It has been deprecated since 1.14 k8s version and has a replacement which is apiserver_request_total so just changing it to that should just work ™️ .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change, and I verified the alert still fires as expected in a test cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 @lilic does the new metric will be working fine in 4.3 as well? Just ensuring before we switch it in the backport too, or the backport has to be different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My test cluster is running 4.3.0 GA, so I can confirm that this works.

labels:
severity: warning
severity: none
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why severity none? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that severity should've stayed, from the discussion we were to only change the metrics to be rate based.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the OSD context, Critical and Warnings are fed into monitoring for cluster health. This alert isn't particularly actionable and doesn't actually impact the cluster's current health -- it's an "info" type alert that you should do something prior to the next upgrade, but the current cluster health isn't degraded. None in this case, I believe is appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that with a severity of none, it still surfaces in the on-cluster UI

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alert isn't particularly actionable and doesn't actually impact the cluster's current health

Precisely why this alert shouldn't exist in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning behind is that we need to yell loudly at people using this api, b/c otherwise after next upgrade this won't work. I can be convinced and I was originally going after an info level, but none such exists atm. That's why the warning, and it's important one to have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in 4.4 its not firing, otherwise we would have had failed jobs which confirms @tnozicka:

note that for 4.4 the serving is disabled so we should drop this check entirely there

Do we want our customers to react on this alert? And if yes how?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want our customers to react on this alert? And if yes how?

If this fires in 4.3 it means you have clients accessing endpoints that will be removed in the next release and they should be updating that code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can be convinced and I was originally going after an info level, but none such exists atm.

They do exist. OLM uses info level alerts and we also (will) have special rules for handling this severity.

If this fires in 4.3 it means you have clients accessing endpoints that will be removed in the next release and they should be updating that code.

This is not immediately actionable from operator point of view.

Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that for 4.4 the serving is disabled so we should drop this check entirely there

group="extensions",
version="v1beta1",
resource!~"ingresses|",
client!~"hyperkube/.*|cluster-policy-controller/.*|velero-server/.*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why velero-server? the only reason for the others was using discovery to spin informers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Velero is a piece of backup software, and it scrapes every API type available on the system in order to back them up. This causes the alert to fire in OSD and never resolve. @sttts mentioned that ignoring this client may be a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's a minor problem I can live with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also note that there is no action that we or velero need to take on this. When the API endpoint goes away in 4.4, it won't be there to scrape anymore.

labels:
severity: warning
severity: none
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that severity should've stayed, from the discussion we were to only change the metrics to be rate based.

expr: |
apiserver_request_count{group="extensions",version="v1beta1",resource!~"ingresses|",client!~"hyperkube/.*|cluster-policy-controller/.*"}
increase(
apiserver_request_count{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilic drop when? This PR is meant for 4.3 most importantly. If you're dropping this in 4.4 - fine change, if in the next we'll fix it then.

version="v1beta1",
resource!~"ingresses|",
client!~"hyperkube/.*|cluster-policy-controller/.*|velero-server/.*"
}[24h]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you want to have 5m here? Why going with 24h?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my understanding of this alert, is we want it to surface to the administrators of the cluster that some client is making a call that they will no longer be able to in 4.4+. Right now, this alert alarms and never resolves -- which on the one hand accomplishes this goal, but on the other hand doesn't allow for the alert to resolve if the condition stops.

I'm okay with 5m, but I think that lessons the usefulness of this alert, as the administrator would need to catch it within that 5 minute period. For clients that aren't calling the API that frequently (cron jobs, deploy pipelines, etc) it might be harder to catch this alert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any strong preferences either way, I was just asking 😉

version="v1beta1",
resource!~"ingresses|",
client!~"hyperkube/.*|cluster-policy-controller/.*|velero-server/.*"
}[24h]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you want to have 5m here? Why going with 24h?

Copy link
Member

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only one concern with this is still the level, it has to stay warning, or if there's info that's the lowest we can go. It's quite important feedback for users to get, before the next upgrade which is when it'll be too late for them, and they will be blocked for the upgrade or even much more after the upgrade.

- Updated message that describes the issue more clearly
- Exclude velero-server client
- Change alert to use an increase in the last 24h
- Change severity to none
@cblecker
Copy link
Member Author

@soltysh Updated to info. It's important that this not be a "warning" because the clusters health is not actually degraded.

@cblecker
Copy link
Member Author

Screenshot 2020-01-24 09 19 56
Example of how this looks in the on-cluster UI

@cblecker
Copy link
Member Author

/retest

1 similar comment
@cblecker
Copy link
Member Author

/retest

@openshift-ci-robot
Copy link

@cblecker: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-operator 4599af8 link /test e2e-aws-operator
ci/prow/e2e-aws-serial 4599af8 link /test e2e-aws-serial
ci/prow/e2e-aws 4599af8 link /test e2e-aws
ci/prow/e2e-aws-upgrade 4599af8 link /test e2e-aws-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@soltysh
Copy link
Member

soltysh commented Jan 28, 2020

Superseded with #741.

@soltysh soltysh closed this Jan 28, 2020
@cblecker cblecker deleted the v1beta1-alert branch January 29, 2020 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants