Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create event intervals for alerts #26508

Merged
merged 4 commits into from Oct 8, 2021

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Oct 6, 2021

trying to add alerts as event intervals to overlay on our CI runs

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 6, 2021
@openshift-ci openshift-ci bot requested review from bparees and mfojtik October 6, 2021 17:43
@openshift-ci openshift-ci bot added vendor-update Touching vendor dir or related files approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 6, 2021
fmt.Printf("\n\n\n#### alertErr=%v\n", err)
}
events = append(events, alertEventIntervals...)
sort.Sort(events)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this new sort.Sort call allow us to drop some from earlier in the function (e.g. the one from above after loading from AdditionalEvents_*)?

timeRange := prometheusv1.Range{
Start: startTime,
End: time.Now(),
Step: 1 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default alerting rules are evaluated every 30 seconds. It can be overriden by the PrometheusRules resources but AFAIK no operator does that. It should be ok to use a step of 10 seconds which would reduce the amount of data returned by Prometheus.

@deads2k deads2k force-pushed the gather-alerts branch 4 times, most recently from 6c546d1 to e324013 Compare October 8, 2021 13:00
@deads2k deads2k changed the title [wip] create event intervals for alerts create event intervals for alerts Oct 8, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 8, 2021
matrixAlert := alerts.(prometheustypes.Matrix)
for _, alert := range matrixAlert {
alertName := alert.Metric[prometheustypes.AlertNameLabel]
if alertName == "Watchdog" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to keep Watchdog here, as we're using its result to evaluate if Prometheus is 100% accessible during upgrade


var alertStartTime *time.Time
var lastTime *time.Time
for _, currValue := range alert.Values {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably can be offloaded to Prometheus - iiuc count_over_time(ALERTS[<test_duration>:1s]) would return a number of seconds the alert has been found

Copy link
Contributor Author

@deads2k deads2k Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably can be offloaded to Prometheus - iiuc count_over_time(ALERTS[<test_duration>:1s]) would return a number of seconds the alert has been found

if someone wants to refine later, I won't stop them. This PR needs to merge in the current state though to get some data.

Copy link
Contributor

@dgoodwin dgoodwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me just one nit, but there's a few good comments here from yesterday that should be resolved somehow.

return [item.locator, "", "AlertCritical"]
}

return [item.locator, "", "AlertCritical"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional to default to critical? AlertUnknown perhaps? Should have a comment either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Showing it as Critical here would have us find misconfigured alerts

pendingAlerts, err := createEventIntervalsForAlerts(ctx, alerts, startTime)
if err != nil {
return nil, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does AlertPending mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html - pending is "metric value matches, but it didn't last long enough yet"

@dgoodwin
Copy link
Contributor

dgoodwin commented Oct 8, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 8, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 8, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, dgoodwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@vrutkovs
Copy link
Member

vrutkovs commented Oct 8, 2021

/skip
/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 8, 2021

@deads2k: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-single-node 13def02 link false /test e2e-aws-single-node
ci/prow/e2e-agnostic-cmd 13def02 link false /test e2e-agnostic-cmd

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 971ef68 into openshift:master Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants