New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create event intervals for alerts #26508
Conversation
fmt.Printf("\n\n\n#### alertErr=%v\n", err) | ||
} | ||
events = append(events, alertEventIntervals...) | ||
sort.Sort(events) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this new sort.Sort
call allow us to drop some from earlier in the function (e.g. the one from above after loading from AdditionalEvents_*
)?
timeRange := prometheusv1.Range{ | ||
Start: startTime, | ||
End: time.Now(), | ||
Step: 1 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by default alerting rules are evaluated every 30 seconds. It can be overriden by the PrometheusRules resources but AFAIK no operator does that. It should be ok to use a step of 10 seconds which would reduce the amount of data returned by Prometheus.
6c546d1
to
e324013
Compare
pkg/monitor/alerts.go
Outdated
matrixAlert := alerts.(prometheustypes.Matrix) | ||
for _, alert := range matrixAlert { | ||
alertName := alert.Metric[prometheustypes.AlertNameLabel] | ||
if alertName == "Watchdog" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to keep Watchdog here, as we're using its result to evaluate if Prometheus is 100% accessible during upgrade
|
||
var alertStartTime *time.Time | ||
var lastTime *time.Time | ||
for _, currValue := range alert.Values { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably can be offloaded to Prometheus - iiuc count_over_time(ALERTS[<test_duration>:1s])
would return a number of seconds the alert has been found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably can be offloaded to Prometheus - iiuc
count_over_time(ALERTS[<test_duration>:1s])
would return a number of seconds the alert has been found
if someone wants to refine later, I won't stop them. This PR needs to merge in the current state though to get some data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me just one nit, but there's a few good comments here from yesterday that should be resolved somehow.
return [item.locator, "", "AlertCritical"] | ||
} | ||
|
||
return [item.locator, "", "AlertCritical"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional to default to critical? AlertUnknown perhaps? Should have a comment either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Showing it as Critical here would have us find misconfigured alerts
pendingAlerts, err := createEventIntervalsForAlerts(ctx, alerts, startTime) | ||
if err != nil { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does AlertPending mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html - pending is "metric value matches, but it didn't last long enough yet"
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, dgoodwin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/skip |
@deads2k: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
trying to add alerts as event intervals to overlay on our CI runs