Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUILD-187: Detect if image trigger ID was cleared #150

Merged

Conversation

adambkaplan
Copy link
Contributor

@adambkaplan adambkaplan commented Nov 18, 2020

Detect if the lastTriggeredImageID was cleared in a BuildConfig.
If an update cleared the last triggered image, record a warning event.

A metric and associated alert were considered to better inform users
that deprecated behavior is being triggered. However, this was rejected
because the metric would require unbound time series cardinality.

The behavior that triggers builds off of spec will be removed in BUILD-188 [1].

[1] https://issues.redhat.com/browse/BUILD-188

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 18, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 18, 2020
@ingvagabund
Copy link
Member

/test e2e-aws

Copy link
Contributor Author

@adambkaplan adambkaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openshift/openshift-team-monitoring I discussed this a bit with @lilic on slack.

We want to inform cluster admins that some action on the cluster is triggering behavior that will be deprecated in the future. Design initially calls for an Alert to be fired, driven by the metric here. This metric started as a simple counter, but per our discussion a Gauge was determined to be a better fit. To appropriately track if this behavior is being triggered or not, however, we need to use a vector with namespace and name as labels. This creates an unbound time series (m namespaces * n BuildConfig objects per namespace).

One alternative we discussed is foregoing the alert completely, and rely only on the event.

Are we abusing the alert system in this case?

Comment on lines 75 to 84
// MarkImageChangeTriggerCleared sets the openshift_build_image_change_trigger_cleared gague value.
// A value of 1 indicates the LastTriggeredImageID field on a BuildConfig has been reset to empty.
func MarkImageChangeTriggerCleared(namespace, name string, cleared bool) {
value := float64(0)
if cleared {
value = float64(1)
}
triggerClearedCounter.WithLabelValues(namespace, name).Set(value)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we set the metric value to 0 or 1 - eventually every BuildConfig in the cluster will have a value set.

Comment on lines +205 to +210
for _, currentTrigger := range cur.Spec.Triggers {
if currentTrigger.ImageChange == nil {
continue
}
from := c.getImageChangeTriggerInputReference(cur, currentTrigger)
prev, found := imageLastTriggers[*from]
if found && len(prev) > 0 && len(currentTrigger.ImageChange.LastTriggeredImageID) == 0 {
return true
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the logic here was lifted from the ImageChange trigger controller code.

"ImageChangeTriggerCleared",
"spec.triggers[*].imagechange.lastTriggeredImageID was cleared, which will trigger a build. This behavior is deprecated and will be removed in a future OpenShift release.")
} else {
prometheus.MarkImageChangeTriggerCleared(bc.Namespace, bc.Name, false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can be a bit more intelligent and only mark cleared to false if:

  1. BuildConfig has a "valid" ImageChange trigger (unchanged, updates from empty -> sha, or sha -> sha)
  2. BuildConfig previously had an ImageChange trigger, but it was removed.

In a real cluster, this would significantly cut down the time series size.

Comment on lines +183 to +185
c.recorder.Event(bc,
corev1.EventTypeWarning,
"ImageChangeTriggerCleared",
"spec.triggers[*].imagechange.lastTriggeredImageID was cleared, which will trigger a build. This behavior is deprecated and will be removed in a future OpenShift release.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record a warning event. This will bubble up to the web console and reference the BuildConfig.

Copy link

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a monitoring standpoint, I agree with @lilic that the use case is better suited for events rather than metrics given the unbounded cardinality.

@adambkaplan adambkaplan changed the title WIP - BUILD-187: Detect if image trigger ID was cleared BUILD-187: Detect if image trigger ID was cleared Dec 3, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 3, 2020
@adambkaplan
Copy link
Contributor Author

/hold

Will add an e2e test in origin before merging.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020
@adambkaplan
Copy link
Contributor Author

Event as it appears in the admin console
build-trigger-cleared-event

@adambkaplan
Copy link
Contributor Author

PR to verify the behavior: openshift/origin#25739

@adambkaplan
Copy link
Contributor Author

/hold cancel

e2e test is in origin

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2020
Copy link
Contributor

@gabemontero gabemontero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of minor comments/suggestions

}
from := c.getImageChangeTriggerInputReference(cur, currentTrigger)
prev, found := imageLastTriggers[*from]
if found && len(prev) > 0 && len(currentTrigger.ImageChange.LastTriggeredImageID) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we care if the devops user set it to some non-zero "dummy" value vs. clearing it out

i.e. alert that they should not be setting it in any fashion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we care less about this scenario, but is something we can check. If the current is an invalid image ref, we have a pretty good idea that the image trigger controller didn't do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think we can reliably do this other than with the empty string. "foo" is a perfectly valid pull spec - otherwise would would need to assume that the image trigger always adds the full SHA reference to this field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the validity of the image ref was not a much a thought for me as the fact that they really should not be setting it at all IMO

that said, upon further review, it could get kludgy at best figuring out if a user set it vs. the controller set it

that, and don't feel strongly about it really anyway ... so let's move on .... an edge case at most, if a concern at all, so we can wait for it to crop up in a customer case before worrying about it at most

Detect if the lastTriggeredImageID was cleared in a BuildConfig.
If an update cleared the last triggered image, record a warning event.

A metric and associated alert were considered to better inform users
that deprecated behavior is being triggered. However, this was rejected
because the metric would require unbound time series cardinality.

The behavior that triggers builds off of `spec` will be removed in BUILD-188 [1].

[1] https://issues.redhat.com/browse/BUILD-188
@gabemontero
Copy link
Contributor

/lgtm

/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan, gabemontero

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 6290112 into openshift:master Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants