-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Targeted edge blocking #663
Targeted edge blocking #663
Conversation
b5f1d1d
to
facfbe2
Compare
This is ART's default for OpenShift 4.10 [1], and we need [1.16's new `io/fs` package][2] to avoid [3]: INFO[2021-09-24T19:45:46Z] vendor/k8s.io/client-go/plugin/pkg/client/auth/exec/metrics.go:21:2: cannot find package "." in: /go/src/github.com/openshift/cluster-version-operator/vendor/io/fs for the 1.22 Kube bumps needed for targeted edge blocking [4]. [1]: https://github.com/openshift/ocp-build-data/blob/f14ea97fcb9893c325f12bed9d9afb9cd2f10857/streams.yml#L31 [2]: https://golang.org/doc/go1.16#fs [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/663/pull-ci-openshift-cluster-version-operator-master-unit/1441488679963987968#1:build-log.txt%3A13 [4]: openshift#663
#665 landed. /retest |
488d17a
to
7ba19cd
Compare
7ba19cd
to
7497d0f
Compare
b4e03e0
to
d50479d
Compare
Separated from the rest of the CVO stuff, because the insights folks might want to use this too [1]. [1]: openshift/enhancements#837
…ttling Per [1]: Additionally, the operator will continually re-evaluate the blocking conditionals in conditionalUpdates and update conditionalUpdates[].risks accordingly. The timing of the evaluation and freshness are largely internal details, but to avoid consuming excessive monitoring resources and because the rules should be based on slowly-changing state, the operator will handle polling with the following restrictions: * The cluster-version operator will cache polling results for each query, so a single query which is used in evaluating multiple risks over multiple conditional update targets will only be evaluated once per round. * After evaluating a PromQL query, the cluster-version operator will wait at least 10 minutes before evaluating any PromQL. This delay will not be persisted between operator restarts, so a crash-looping CVO may result in higher PromQL load. But a crash-looping CVO will also cause the KubePodCrashLooping alert to fire, which will summon the cluster administrator. * After evaluating a PromQL query, the cluster-version operator will wait at least an hour before evaluating that PromQL query again. That's what this commit sets up. The tests are a bit fiddly, since I wanted to excercise "I have so many queries that I'd like to run, and they're expiring before I can get through them all". I'm trying to show that if you give it enough tries, we won't consistently starve out surprisingly many conditions, even though in that overloaded case, someone is always getting starved out. Unlikely to happen in the wild, but the enhancement section is intentionally addressing the "what if some malicious/misconfigured graph floods the CVO with PromQL suggestions?". [1]: https://github.com/openshift/enhancements/blob/2cc2d9b331532c852878a7c793f3a754914c824e/enhancements/update/targeted-update-edge-blocking.md#cluster-version-operator-support-for-the-enhanced-schema
6dfc418
to
ca186ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LalatenduMohanty, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e2e-agnostic seems to have hung mid-install for some reason after bootstrap-destroy:
By the time the gather steps rolled around, it was /test e2e-agnostic |
But by gather-time, things look ok: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/663/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1463030884557918208/artifacts/e2e-agnostic/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "network").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' | sort
2021-11-23T06:52:08Z ManagementStateDegraded=False -: -
2021-11-23T06:52:08Z Upgradeable=True -: -
2021-11-23T06:53:45Z Available=True -: -
2021-11-23T07:22:46Z Degraded=False -: -
2021-11-23T07:22:46Z Progressing=False -: - Those 7:22:46 transitions are just after the 7:21:57 timeout. From the network operator's logs
Perhaps a slow node? I dunno. Trying again: /test e2e-agnostic |
Poking some more in that last failed run, the nodes were all $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/663/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1463030884557918208/artifacts/e2e-agnostic/gather-extra/artifacts/nodes.json | jq -r '.items[] | .metadata.name as $n | .status.conditions[] | select(.type == "Ready") | .lastTransitionTime + " " + .state + " " + $n' | sort
2021-11-23T06:52:40Z ci-op-tnvjkz61-3302f-cx8rj-master-0
2021-11-23T06:52:48Z ci-op-tnvjkz61-3302f-cx8rj-master-1
2021-11-23T06:52:51Z ci-op-tnvjkz61-3302f-cx8rj-master-2
2021-11-23T07:05:24Z ci-op-tnvjkz61-3302f-cx8rj-worker-centralus1-8tnn7
2021-11-23T07:05:49Z ci-op-tnvjkz61-3302f-cx8rj-worker-centralus3-85dqc
2021-11-23T07:07:28Z ci-op-tnvjkz61-3302f-cx8rj-worker-centralus2-zm4wt One of the DaemonSet pods was definitely slow to go $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/663/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1463030884557918208/artifacts/e2e-agnostic/gather-extra/artifacts/pods.json | jq -r '.items[] | . as $p | select(.metadata.name | startswith("multus-additional-cni-plugins-")).status.conditions[] | select(.type == "Ready") | .lastTransitionTime + " " + .status + " " + $p.metadata.name + " " + $p.spec.nodeName' | sort
2021-11-23T06:53:08Z True multus-additional-cni-plugins-4lfnf ci-op-tnvjkz61-3302f-cx8rj-master-1
2021-11-23T06:53:13Z True multus-additional-cni-plugins-l27nd ci-op-tnvjkz61-3302f-cx8rj-master-2
2021-11-23T06:53:45Z True multus-additional-cni-plugins-g2nh7 ci-op-tnvjkz61-3302f-cx8rj-master-0
2021-11-23T07:06:47Z True multus-additional-cni-plugins-8vsc5 ci-op-tnvjkz61-3302f-cx8rj-worker-centralus1-8tnn7
2021-11-23T07:07:52Z True multus-additional-cni-plugins-sjs99 ci-op-tnvjkz61-3302f-cx8rj-worker-centralus2-zm4wt
2021-11-23T07:22:46Z True multus-additional-cni-plugins-zp4tg ci-op-tnvjkz61-3302f-cx8rj-worker-centralus3-85dqc Digging into that slow pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/663/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1463030884557918208/artifacts/e2e-agnostic/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "multus-additional-cni-plugins-zp4tg").status.initContainerStatuses[] | (.state.terminated | .startedAt + " " + .finishedAt) + " " + (.restartCount | tostring) + " " + .name'
2021-11-23T07:05:35Z 2021-11-23T07:05:35Z 0 egress-router-binary-copy
2021-11-23T07:05:44Z 2021-11-23T07:05:44Z 0 cni-plugins
2021-11-23T07:05:47Z 2021-11-23T07:05:47Z 0 bond-cni-plugin
2021-11-23T07:05:59Z 2021-11-23T07:05:59Z 0 routeoverride-cni
2021-11-23T07:22:44Z 2021-11-23T07:22:44Z 0 whereabouts-cni-bincopy
2021-11-23T07:22:44Z 2021-11-23T07:22:44Z 0 whereabouts-cni That is a huge delay between $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/663/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1463030884557918208/artifacts/e2e-agnostic/gather-extra/artifacts/events.json | jq -r '.items[] | select(tostring | contains("multus-additional-cni-plugins-zp4tg")) | .metadata.creationTimestamp + " " + .reason + ": " + .message'
...
2021-11-23T07:06:00Z Pulling: Pulling image "registry.build01.ci.openshift.org/ci-op-tnvjkz61/stable@sha256:b1418e6400569e5a31bd3708198fe2f0e202d2084001d1bcafe79d724fff3483"
2021-11-23T07:22:26Z Failed: Failed to pull image "registry.build01.ci.openshift.org/ci-op-tnvjkz61/stable@sha256:b1418e6400569e5a31bd3708198fe2f0e202d2084001d1bcafe79d724fff3483": rpc error: code = Unknown desc = reading blob sha256:35a67cc5ac632c9d7fa635dcbde51c7ca1d002f042a7e235ff696eed0382ee02: Get "https://build01-9hdwj-image-registry-us-east-1-nucqrmelsxtgndkbvchwdkw.s3.dualstack.us-east-1.amazonaws.com/docker/registry/v2/blobs/sha256/35/35a67cc5ac632c9d7fa635dcbde51c7ca1d002f042a7e235ff696eed0382ee02/data?...": read tcp 10.0.128.4:48550->52.217.198.250:443: read: connection timed out
2021-11-23T07:22:26Z Failed: Error: ErrImagePull
2021-11-23T07:22:27Z BackOff: Back-off pulling image "registry.build01.ci.openshift.org/ci-op-tnvjkz61/stable@sha256:b1418e6400569e5a31bd3708198fe2f0e202d2084001d1bcafe79d724fff3483"
2021-11-23T07:22:27Z Failed: Error: ImagePullBackOff
2021-11-23T07:22:43Z Pulled: Successfully pulled image "registry.build01.ci.openshift.org/ci-op-tnvjkz61/stable@sha256:b1418e6400569e5a31bd3708198fe2f0e202d2084001d1bcafe79d724fff3483" in 5.151277654s
2021-11-23T07:22:44Z Created: Created container whereabouts-cni-bincopy
2021-11-23T07:22:44Z Started: Started container whereabouts-cni-bincopy
2021-11-23T07:22:44Z Pulled: Container image "registry.build01.ci.openshift.org/ci-op-tnvjkz61/stable@sha256:b1418e6400569e5a31bd3708198fe2f0e202d2084001d1bcafe79d724fff3483" already present on machine
2021-11-23T07:22:44Z Created: Created container whereabouts-cni
... |
if target == nil { | ||
return current, updates, nil, &Error{ | ||
Reason: "ResponseInvalid", | ||
Message: fmt.Sprintf("no node for conditional update %s", edge.To), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case target node is null, it seems edge.To
would be null as well. If that's true, seems that's no need to log it in the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edge.To
might have a version string that just happens to not be listed in nodes
. Not something that should exist in well-formed Cincinnati graph JSON, but still worth logging.
Having edge.To
be empty would also be a problem, and we could also complain about that without even looking through nodes
. As this code stands, I guess we could have an empty edge.To
and an entry in nodes
with an empty-string version
and maybe not get a ResponseInvalid
complaint? Seems fair to open a bug if we aren't setting ResponseInvalid
if nodes[].version == ""
or unset today.
…rget version When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel might only have to evaluate OpenStackNodeCreationFails. But when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks was throttled at one every 10 minutes. This means if there are three new risks it may take up to 20 minutes after the channel has changed for the full set of conditional updates to be computed. With this commit, I'm sorting the conditional updates in version-descending order, which is the order we've used in the ClusterVersion status since c9dd479 (pkg/cvo/availableupdates: Sort (conditional)updates, 2021-09-29, openshift#663). This prioritizes the longest-hop risks. For example, 4.10.34 currently has the following updates: * 4.10.(z!=38): no risks * 4.10.38: OpenStackNodeCreationFails * 4.11.(z<10): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn, OVNNetworkPolicyLongName * 4.11.(10<=z<26): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn * 4.11.26: ARM64SecCompError524, AWSOldBootImagesLackAfterburn * 4.11.(27<=z<...): AWSOldBootImagesLackAfterburn By focusing on the largest target (say 4.11.30), we'd evaluate AWSOldBootImagesLackAfterburn first. If it did not match the current cluster, 4.11.27 and later would be quickly recommended. It would take another 10m before the self-throttling allowed us to evaluate ARM64SecCompError524, and once we had, that would unblock 4.11.26. Ten minutes after that, we'd evaluate MachineConfigRenderingChurn, and unblock 4.11.(10<=z<26). And so on. But folks on 4.10.34 today are much more likely to be interested in 4.11.30 and other tip releases than they are to care about 4.11.10 and other relatively old releases.
…rget version When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel might only have to evaluate OpenStackNodeCreationFails. But when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks was throttled at one every 10 minutes. This means if there are three new risks it may take up to 20 minutes after the channel has changed for the full set of conditional updates to be computed. With this commit, I'm sorting the conditional updates in version-descending order, which is the order we've used in the ClusterVersion status since c9dd479 (pkg/cvo/availableupdates: Sort (conditional)updates, 2021-09-29, openshift#663). This prioritizes the longest-hop risks. For example, 4.10.34 currently has the following updates: * 4.10.(z!=38): no risks * 4.10.38: OpenStackNodeCreationFails * 4.11.(z<10): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn, OVNNetworkPolicyLongName * 4.11.(10<=z<26): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn * 4.11.26: ARM64SecCompError524, AWSOldBootImagesLackAfterburn * 4.11.(27<=z<...): AWSOldBootImagesLackAfterburn By focusing on the largest target (say 4.11.30), we'd evaluate AWSOldBootImagesLackAfterburn first. If it did not match the current cluster, 4.11.27 and later would be quickly recommended. It would take another 10m before the self-throttling allowed us to evaluate ARM64SecCompError524, and once we had, that would unblock 4.11.26. Ten minutes after that, we'd evaluate MachineConfigRenderingChurn, and unblock 4.11.(10<=z<26). And so on. But folks on 4.10.34 today are much more likely to be interested in 4.11.30 and other tip releases than they are to care about 4.11.10 and other relatively old releases.
…rget version When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel might only have to evaluate OpenStackNodeCreationFails. But when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks was throttled at one every 10 minutes. This means if there are three new risks it may take up to 20 minutes after the channel has changed for the full set of conditional updates to be computed. With this commit, I'm sorting the conditional updates in version-descending order, which is the order we've used in the ClusterVersion status since c9dd479 (pkg/cvo/availableupdates: Sort (conditional)updates, 2021-09-29, openshift#663). This prioritizes the longest-hop risks. For example, 4.10.34 currently has the following updates: * 4.10.(z!=38): no risks * 4.10.38: OpenStackNodeCreationFails * 4.11.(z<10): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn, OVNNetworkPolicyLongName * 4.11.(10<=z<26): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn * 4.11.26: ARM64SecCompError524, AWSOldBootImagesLackAfterburn * 4.11.(27<=z<...): AWSOldBootImagesLackAfterburn By focusing on the largest target (say 4.11.30), we'd evaluate AWSOldBootImagesLackAfterburn first. If it did not match the current cluster, 4.11.27 and later would be quickly recommended. It would take another 10m before the self-throttling allowed us to evaluate ARM64SecCompError524, and once we had, that would unblock 4.11.26. Ten minutes after that, we'd evaluate MachineConfigRenderingChurn, and unblock 4.11.(10<=z<26). And so on. But folks on 4.10.34 today are much more likely to be interested in 4.11.30 and other tip releases than they are to care about 4.11.10 and other relatively old releases.
…rget version When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel might only have to evaluate OpenStackNodeCreationFails. But when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks was throttled at one every 10 minutes. This means if there are three new risks it may take up to 20 minutes after the channel has changed for the full set of conditional updates to be computed. With this commit, I'm sorting the conditional updates in version-descending order, which is the order we've used in the ClusterVersion status since c9dd479 (pkg/cvo/availableupdates: Sort (conditional)updates, 2021-09-29, openshift#663). This prioritizes the longest-hop risks. For example, 4.10.34 currently has the following updates: * 4.10.(z!=38): no risks * 4.10.38: OpenStackNodeCreationFails * 4.11.(z<10): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn, OVNNetworkPolicyLongName * 4.11.(10<=z<26): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn * 4.11.26: ARM64SecCompError524, AWSOldBootImagesLackAfterburn * 4.11.(27<=z<...): AWSOldBootImagesLackAfterburn By focusing on the largest target (say 4.11.30), we'd evaluate AWSOldBootImagesLackAfterburn first. If it did not match the current cluster, 4.11.27 and later would be quickly recommended. It would take another 10m before the self-throttling allowed us to evaluate ARM64SecCompError524, and once we had, that would unblock 4.11.26. Ten minutes after that, we'd evaluate MachineConfigRenderingChurn, and unblock 4.11.(10<=z<26). And so on. But folks on 4.10.34 today are much more likely to be interested in 4.11.30 and other tip releases than they are to care about 4.11.10 and other relatively old releases.
…rget version When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel might only have to evaluate OpenStackNodeCreationFails. But when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks was throttled at one every 10 minutes. This means if there are three new risks it may take up to 20 minutes after the channel has changed for the full set of conditional updates to be computed. With this commit, I'm sorting the conditional updates in version-descending order, which is the order we've used in the ClusterVersion status since c9dd479 (pkg/cvo/availableupdates: Sort (conditional)updates, 2021-09-29, openshift#663). This prioritizes the longest-hop risks. For example, 4.10.34 currently has the following updates: * 4.10.(z!=38): no risks * 4.10.38: OpenStackNodeCreationFails * 4.11.(z<10): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn, OVNNetworkPolicyLongName * 4.11.(10<=z<26): ARM64SecCompError524, AWSOldBootImagesLackAfterburn, MachineConfigRenderingChurn * 4.11.26: ARM64SecCompError524, AWSOldBootImagesLackAfterburn * 4.11.(27<=z<...): AWSOldBootImagesLackAfterburn By focusing on the largest target (say 4.11.30), we'd evaluate AWSOldBootImagesLackAfterburn first. If it did not match the current cluster, 4.11.27 and later would be quickly recommended. It would take another 10m before the self-throttling allowed us to evaluate ARM64SecCompError524, and once we had, that would unblock 4.11.26. Ten minutes after that, we'd evaluate MachineConfigRenderingChurn, and unblock 4.11.(10<=z<26). And so on. But folks on 4.10.34 today are much more likely to be interested in 4.11.30 and other tip releases than they are to care about 4.11.10 and other relatively old releases.
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. The original 10m requirement is from the enhancement [1], and was implemented in ca186ed (pkg/clusterconditions/cache: Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663). But discussing with Lala, Scott, and Ben, we feel like the addressing the demonstrated user experience need of low-latency risk evaluation [2] is worth reducing the throttling to 1s per expression evaluation. We still have MinForCondition set to an hour, so with this commit, a cluster-version operator evaluating three risks will move from a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 10m, evaluate B for the first time (MinBetweenMatches after 1). 3. 20m, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). to a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatches after 1). 3. 2s, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). We could deliver faster cache warming while preserving spaced out refresh evaluation by splitting MinBetweenMatches into a 1s MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which would produce timelines like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1). 3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8). but again discussing with Lala, Scott, and Ben, the code complexity to deliver that distinction does not seem to be worth thet protection it delivers to the PromQL engine. And really, PromQL engines concerned about load should harden themselves, including via Retry-After [3] that allow clients to back off gracefully when the service needs that, instead of relying on clients to guess about the load the service could handle and back off without insight into actual server capacity. [1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325 [2]: https://issues.redhat.com/browse/OCPBUGS-19512 [3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. The original 10m requirement is from the enhancement [1], and was implemented in ca186ed (pkg/clusterconditions/cache: Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663). But discussing with Lala, Scott, and Ben, we feel like the addressing the demonstrated user experience need of low-latency risk evaluation [2] is worth reducing the throttling to 1s per expression evaluation. We still have MinForCondition set to an hour, so with this commit, a cluster-version operator evaluating three risks will move from a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 10m, evaluate B for the first time (MinBetweenMatches after 1). 3. 20m, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). to a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatches after 1). 3. 2s, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). We could deliver faster cache warming while preserving spaced out refresh evaluation by splitting MinBetweenMatches into a 1s MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which would produce timelines like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1). 3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8). but again discussing with Lala, Scott, and Ben, the code complexity to deliver that distinction does not seem to be worth thet protection it delivers to the PromQL engine. And really, PromQL engines concerned about load should harden themselves, including via Retry-After [3] that allow clients to back off gracefully when the service needs that, instead of relying on clients to guess about the load the service could handle and back off without insight into actual server capacity. [1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325 [2]: https://issues.redhat.com/browse/OCPBUGS-19512 [3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. The original 10m requirement is from the enhancement [1], and was implemented in ca186ed (pkg/clusterconditions/cache: Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663). But discussing with Lala, Scott, and Ben, we feel like the addressing the demonstrated user experience need of low-latency risk evaluation [2] is worth reducing the throttling to 1s per expression evaluation. We still have MinForCondition set to an hour, so with this commit, a cluster-version operator evaluating three risks will move from a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 10m, evaluate B for the first time (MinBetweenMatches after 1). 3. 20m, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). to a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatches after 1). 3. 2s, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). We could deliver faster cache warming while preserving spaced out refresh evaluation by splitting MinBetweenMatches into a 1s MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which would produce timelines like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1). 3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8). but again discussing with Lala, Scott, and Ben, the code complexity to deliver that distinction does not seem to be worth thet protection it delivers to the PromQL engine. And really, PromQL engines concerned about load should harden themselves, including via Retry-After [3] that allow clients to back off gracefully when the service needs that, instead of relying on clients to guess about the load the service could handle and back off without insight into actual server capacity. [1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325 [2]: https://issues.redhat.com/browse/OCPBUGS-19512 [3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. The original 10m requirement is from the enhancement [1], and was implemented in ca186ed (pkg/clusterconditions/cache: Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663). But discussing with Lala, Scott, and Ben, we feel like the addressing the demonstrated user experience need of low-latency risk evaluation [2] is worth reducing the throttling to 1s per expression evaluation. We still have MinForCondition set to an hour, so with this commit, a cluster-version operator evaluating three risks will move from a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 10m, evaluate B for the first time (MinBetweenMatches after 1). 3. 20m, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). to a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatches after 1). 3. 2s, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). We could deliver faster cache warming while preserving spaced out refresh evaluation by splitting MinBetweenMatches into a 1s MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which would produce timelines like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1). 3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8). but again discussing with Lala, Scott, and Ben, the code complexity to deliver that distinction does not seem to be worth thet protection it delivers to the PromQL engine. And really, PromQL engines concerned about load should harden themselves, including via Retry-After [3] that allow clients to back off gracefully when the service needs that, instead of relying on clients to guess about the load the service could handle and back off without insight into actual server capacity. [1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325 [2]: https://issues.redhat.com/browse/OCPBUGS-19512 [3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. The original 10m requirement is from the enhancement [1], and was implemented in ca186ed (pkg/clusterconditions/cache: Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663). But discussing with Lala, Scott, and Ben, we feel like the addressing the demonstrated user experience need of low-latency risk evaluation [2] is worth reducing the throttling to 1s per expression evaluation. We still have MinForCondition set to an hour, so with this commit, a cluster-version operator evaluating three risks will move from a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 10m, evaluate B for the first time (MinBetweenMatches after 1). 3. 20m, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). to a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatches after 1). 3. 2s, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). We could deliver faster cache warming while preserving spaced out refresh evaluation by splitting MinBetweenMatches into a 1s MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which would produce timelines like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1). 3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8). but again discussing with Lala, Scott, and Ben, the code complexity to deliver that distinction does not seem to be worth thet protection it delivers to the PromQL engine. And really, PromQL engines concerned about load should harden themselves, including via Retry-After [3] that allow clients to back off gracefully when the service needs that, instead of relying on clients to guess about the load the service could handle and back off without insight into actual server capacity. [1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325 [2]: https://issues.redhat.com/browse/OCPBUGS-19512 [3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. The original 10m requirement is from the enhancement [1], and was implemented in ca186ed (pkg/clusterconditions/cache: Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663). But discussing with Lala, Scott, and Ben, we feel like the addressing the demonstrated user experience need of low-latency risk evaluation [2] is worth reducing the throttling to 1s per expression evaluation. We still have MinForCondition set to an hour, so with this commit, a cluster-version operator evaluating three risks will move from a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 10m, evaluate B for the first time (MinBetweenMatches after 1). 3. 20m, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). to a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatches after 1). 3. 2s, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). We could deliver faster cache warming while preserving spaced out refresh evaluation by splitting MinBetweenMatches into a 1s MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which would produce timelines like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1). 3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8). but again discussing with Lala, Scott, and Ben, the code complexity to deliver that distinction does not seem to be worth thet protection it delivers to the PromQL engine. And really, PromQL engines concerned about load should harden themselves, including via Retry-After [3] that allow clients to back off gracefully when the service needs that, instead of relying on clients to guess about the load the service could handle and back off without insight into actual server capacity. [1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325 [2]: https://issues.redhat.com/browse/OCPBUGS-19512 [3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. The original 10m requirement is from the enhancement [1], and was implemented in ca186ed (pkg/clusterconditions/cache: Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663). But discussing with Lala, Scott, and Ben, we feel like the addressing the demonstrated user experience need of low-latency risk evaluation [2] is worth reducing the throttling to 1s per expression evaluation. We still have MinForCondition set to an hour, so with this commit, a cluster-version operator evaluating three risks will move from a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 10m, evaluate B for the first time (MinBetweenMatches after 1). 3. 20m, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). to a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatches after 1). 3. 2s, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). We could deliver faster cache warming while preserving spaced out refresh evaluation by splitting MinBetweenMatches into a 1s MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which would produce timelines like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1). 3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8). but again discussing with Lala, Scott, and Ben, the code complexity to deliver that distinction does not seem to be worth thet protection it delivers to the PromQL engine. And really, PromQL engines concerned about load should harden themselves, including via Retry-After [3] that allow clients to back off gracefully when the service needs that, instead of relying on clients to guess about the load the service could handle and back off without insight into actual server capacity. [1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325 [2]: https://issues.redhat.com/browse/OCPBUGS-19512 [3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. The original 10m requirement is from the enhancement [1], and was implemented in ca186ed (pkg/clusterconditions/cache: Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663). But discussing with Lala, Scott, and Ben, we feel like the addressing the demonstrated user experience need of low-latency risk evaluation [2] is worth reducing the throttling to 1s per expression evaluation. We still have MinForCondition set to an hour, so with this commit, a cluster-version operator evaluating three risks will move from a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 10m, evaluate B for the first time (MinBetweenMatches after 1). 3. 20m, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). to a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatches after 1). 3. 2s, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). We could deliver faster cache warming while preserving spaced out refresh evaluation by splitting MinBetweenMatches into a 1s MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which would produce timelines like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1). 3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8). but again discussing with Lala, Scott, and Ben, the code complexity to deliver that distinction does not seem to be worth thet protection it delivers to the PromQL engine. And really, PromQL engines concerned about load should harden themselves, including via Retry-After [3] that allow clients to back off gracefully when the service needs that, instead of relying on clients to guess about the load the service could handle and back off without insight into actual server capacity. [1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325 [2]: https://issues.redhat.com/browse/OCPBUGS-19512 [3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. The original 10m requirement is from the enhancement [1], and was implemented in ca186ed (pkg/clusterconditions/cache: Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663). But discussing with Lala, Scott, and Ben, we feel like the addressing the demonstrated user experience need of low-latency risk evaluation [2] is worth reducing the throttling to 1s per expression evaluation. We still have MinForCondition set to an hour, so with this commit, a cluster-version operator evaluating three risks will move from a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 10m, evaluate B for the first time (MinBetweenMatches after 1). 3. 20m, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). to a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatches after 1). 3. 2s, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). We could deliver faster cache warming while preserving spaced out refresh evaluation by splitting MinBetweenMatches into a 1s MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which would produce timelines like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1). 3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8). but again discussing with Lala, Scott, and Ben, the code complexity to deliver that distinction does not seem to be worth thet protection it delivers to the PromQL engine. And really, PromQL engines concerned about load should harden themselves, including via Retry-After [3] that allow clients to back off gracefully when the service needs that, instead of relying on clients to guess about the load the service could handle and back off without insight into actual server capacity. [1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325 [2]: https://issues.redhat.com/browse/OCPBUGS-19512 [3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
530a509 (pkg/cvo/availableupdates: Prioritize conditional risks for largest target version, 2023-03-06, openshift#909) prioritized the order in which risks were evaluated. But we were still waiting 10 minutes between different PromQL evaluations while evaluating conditional update risks. The original 10m requirement is from the enhancement [1], and was implemented in ca186ed (pkg/clusterconditions/cache: Add a cache wrapper for client-side throttling, 2021-11-10, openshift#663). But discussing with Lala, Scott, and Ben, we feel like the addressing the demonstrated user experience need of low-latency risk evaluation [2] is worth reducing the throttling to 1s per expression evaluation. We still have MinForCondition set to an hour, so with this commit, a cluster-version operator evaluating three risks will move from a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 10m, evaluate B for the first time (MinBetweenMatches after 1). 3. 20m, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). to a timeline like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatches after 1). 3. 2s, evaluate C for the first time (MinBetweenMatches after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatches after 3). 5. 1h1s, evaluate B again (MinForCondition after 2 and MinBetweenMatches after 4). 6. 1h2s, evaluate C again (MinForCondition after 3 and MinBetweenMatches after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatches after 6). 8. 2h1s, evaluate B again (MinForCondition after 5 and MinBetweenMatches after 7). 9. 2h2s, evaluate C again (MinForCondition after 6 and MinBetweenMatches after 8). We could deliver faster cache warming while preserving spaced out refresh evaluation by splitting MinBetweenMatches into a 1s MinBetweenMatchesInitial and 10m MinBetweenMatchesWhenCached, which would produce timelines like: 1. 0s, hear about risks that depend on PromQL A, B, and C. Evaluate A for the first time. 2. 1s, evaluate B for the first time (MinBetweenMatchesInitial after 1). 3. 2s, evaluate C for the first time (MinBetweenMatchesInitial after 2). 4. 1h, evaluate A again (MinForCondition after 1, also well past MinBetweenMatchesWhenCached after 3). 5. 1h10m, evaluate B again (MinForCondition after 2 and MinBetweenMatchesWhenCached after 4). 6. 1h20m, evaluate C again (MinForCondition after 3 and MinBetweenMatchesWhenCached after 5). 7. 2h, evaluate A again (MinForCondition after 4, also well past MinBetweenMatchesWhenCached after 6). 8. 2h10m, evaluate B again (MinForCondition after 5 and MinBetweenMatchesWhenCached after 7). 9. 2h20m, evaluate C again (MinForCondition after 6 and MinBetweenMatchesWhenCached after 8). but again discussing with Lala, Scott, and Ben, the code complexity to deliver that distinction does not seem to be worth thet protection it delivers to the PromQL engine. And really, PromQL engines concerned about load should harden themselves, including via Retry-After [3] that allow clients to back off gracefully when the service needs that, instead of relying on clients to guess about the load the service could handle and back off without insight into actual server capacity. [1]: https://github.com/openshift/enhancements/blame/158111ce156aac7fa6063a47c00e129c13033aec/enhancements/update/targeted-update-edge-blocking.md#L323-L325 [2]: https://issues.redhat.com/browse/OCPBUGS-19512 [3]: https://www.rfc-editor.org/rfc/rfc9110#name-retry-after
Implementing openshift/enhancements#821 and openshift/enhancements#910.