New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUILD-187: Detect if image trigger ID was cleared #150
BUILD-187: Detect if image trigger ID was cleared #150
Conversation
/test e2e-aws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@openshift/openshift-team-monitoring I discussed this a bit with @lilic on slack.
We want to inform cluster admins that some action on the cluster is triggering behavior that will be deprecated in the future. Design initially calls for an Alert to be fired, driven by the metric here. This metric started as a simple counter, but per our discussion a Gauge was determined to be a better fit. To appropriately track if this behavior is being triggered or not, however, we need to use a vector with namespace
and name
as labels. This creates an unbound time series (m namespaces * n BuildConfig objects per namespace).
One alternative we discussed is foregoing the alert completely, and rely only on the event.
Are we abusing the alert system in this case?
// MarkImageChangeTriggerCleared sets the openshift_build_image_change_trigger_cleared gague value. | ||
// A value of 1 indicates the LastTriggeredImageID field on a BuildConfig has been reset to empty. | ||
func MarkImageChangeTriggerCleared(namespace, name string, cleared bool) { | ||
value := float64(0) | ||
if cleared { | ||
value = float64(1) | ||
} | ||
triggerClearedCounter.WithLabelValues(namespace, name).Set(value) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we set the metric value to 0 or 1 - eventually every BuildConfig
in the cluster will have a value set.
for _, currentTrigger := range cur.Spec.Triggers { | ||
if currentTrigger.ImageChange == nil { | ||
continue | ||
} | ||
from := c.getImageChangeTriggerInputReference(cur, currentTrigger) | ||
prev, found := imageLastTriggers[*from] | ||
if found && len(prev) > 0 && len(currentTrigger.ImageChange.LastTriggeredImageID) == 0 { | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the logic here was lifted from the ImageChange trigger controller code.
"ImageChangeTriggerCleared", | ||
"spec.triggers[*].imagechange.lastTriggeredImageID was cleared, which will trigger a build. This behavior is deprecated and will be removed in a future OpenShift release.") | ||
} else { | ||
prometheus.MarkImageChangeTriggerCleared(bc.Namespace, bc.Name, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can be a bit more intelligent and only mark cleared to false
if:
- BuildConfig has a "valid" ImageChange trigger (unchanged, updates from empty -> sha, or sha -> sha)
- BuildConfig previously had an ImageChange trigger, but it was removed.
In a real cluster, this would significantly cut down the time series size.
c.recorder.Event(bc, | ||
corev1.EventTypeWarning, | ||
"ImageChangeTriggerCleared", | ||
"spec.triggers[*].imagechange.lastTriggeredImageID was cleared, which will trigger a build. This behavior is deprecated and will be removed in a future OpenShift release.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Record a warning event. This will bubble up to the web console and reference the BuildConfig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a monitoring standpoint, I agree with @lilic that the use case is better suited for events rather than metrics given the unbounded cardinality.
a6998c7
to
6f6d6e2
Compare
/hold Will add an e2e test in origin before merging. |
PR to verify the behavior: openshift/origin#25739 |
/hold cancel e2e test is in origin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of minor comments/suggestions
} | ||
from := c.getImageChangeTriggerInputReference(cur, currentTrigger) | ||
prev, found := imageLastTriggers[*from] | ||
if found && len(prev) > 0 && len(currentTrigger.ImageChange.LastTriggeredImageID) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we care if the devops user set it to some non-zero "dummy" value vs. clearing it out
i.e. alert that they should not be setting it in any fashion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we care less about this scenario, but is something we can check. If the current is an invalid image ref, we have a pretty good idea that the image trigger controller didn't do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think we can reliably do this other than with the empty string. "foo" is a perfectly valid pull spec - otherwise would would need to assume that the image trigger always adds the full SHA reference to this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the validity of the image ref was not a much a thought for me as the fact that they really should not be setting it at all IMO
that said, upon further review, it could get kludgy at best figuring out if a user set it vs. the controller set it
that, and don't feel strongly about it really anyway ... so let's move on .... an edge case at most, if a concern at all, so we can wait for it to crop up in a customer case before worrying about it at most
Detect if the lastTriggeredImageID was cleared in a BuildConfig. If an update cleared the last triggered image, record a warning event. A metric and associated alert were considered to better inform users that deprecated behavior is being triggered. However, this was rejected because the metric would require unbound time series cardinality. The behavior that triggers builds off of `spec` will be removed in BUILD-188 [1]. [1] https://issues.redhat.com/browse/BUILD-188
6f6d6e2
to
7d133c1
Compare
/lgtm /retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adambkaplan, gabemontero The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
Detect if the lastTriggeredImageID was cleared in a BuildConfig.
If an update cleared the last triggered image, record a warning event.
A metric and associated alert were considered to better inform users
that deprecated behavior is being triggered. However, this was rejected
because the metric would require unbound time series cardinality.
The behavior that triggers builds off of
spec
will be removed in BUILD-188 [1].[1] https://issues.redhat.com/browse/BUILD-188