New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?" #400
lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?" #400
Conversation
Can you split the unrelated changes out to smaller separate PRs? Rather review those individually |
Split into #401 ( |
/uncc @abhinavdahiya |
cebb958
to
974f6e8
Compare
974f6e8
to
c6afa88
Compare
Rebased onto master around #401 (and dropping the rejected #402) with cebb958 -> c6afa88. #403 is still waiting on review, but it's maybe far enough away to avoid getting marked as a conflict due to overlapping context. And it will be easy enough to rebase this PR or #403 depending on which lands first if GitHub decides there is a conflict. |
/assign @smarterclayton |
c6afa88
to
507d91b
Compare
/assign @jottofar |
507d91b
to
100231e
Compare
…ow?" We've had 'if updated' guards around waitFor*Completion since the library landed in 2d334c2 (lib: add resource builder that allows Do on any lib.Manifest, 2018-08-20, openshift#10). But, only waiting when 'updated' is true is a weak block, because if/when we fail to complete, Task.Run will back-off and call builder.Apply again. That new Apply will see the already-updated object, set 'updated' false, and not wait. So whether we block or not is orthogonal to 'updated'; nobody cares about whether the most recent update happened in this builder.Apply, this sync cycle, or a previous cycle. We don't even care all that much about whether the Deployment, DaemonSet, CustomResourceDefinition, or Job succeeded. Most feedback is going to come from the ClusterOperator, so with this commit we continue past the resource wait-for unless the resource is really hurting, in which case we fail immediately (inside builder.Apply, Task.Run will still hit us a few times) to bubble that up. In situations where we don't see anything too terrible going on, we'll continue on past and later block on ClusterOperator not being ready. There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [1,2]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). We still need the public WaitForJobCompletion, because fetchUpdatePayloadToDir uses it to wait on the release download. Also expand "iff" -> "if and only if" while I'm touching that line, at Jack's suggestion [3]. [1]: https://github.com/kubernetes/api/blob/v0.18.0/apps/v1/types.go#L586-L590 [2]: https://github.com/kubernetes/apiextensions-apiserver/blob/v0.18.0/pkg/apis/apiextensions/types.go#L319-L320 [3]: openshift#400 (comment)
100231e
to
c2af13b
Compare
…s it alive now?" Like cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400), but for ClusterOperator. In that commit message, I'd explain that we'd continue on past imperfect (e.g. Progressing=True) but still happy-enough Deployments and such and later block on the ClusterOperator. But there's really no need to poll the ClusterOperator while we're blocked on it; we can instead fail that sync loop, take the short cool-off break, and come in again with a new pass at the sync cycle. This will help reduce delays like [1] where a good chunk of the ~5.5 minute delay was waiting for the network ClusterOperator to become happy. If instead we give up on that sync cycle and start in again with a fresh sync cycle, we would have been more likely to recreate the ServiceMonitors CRD more quickly. The downside is that we will now complain about unavailable, degraded, and unleveled operators more quickly, without giving them time to become happy before complaining in ClusterVersion's status. This is mitigated by most of the returned errors being UpdateEffectNone, which we will render as "waiting on..." Progressing messages. The exceptions are mostly unavailable, which is a serious enough condition that I'm fine complaining about it aggressively, and degraded, which has the fail-after-interval guard keeping us from complaining about it too aggressively. I'm also shifting the "does not declare expected versions" check before the Get call. It's goal is still to ensure that we fail in CI before shipping an operator with such a manifest, and there's no point in firing off an API GET before failing on a guard that only needs local information. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c2
…s it alive now?" Like cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400), but for ClusterOperator. In that commit message, I'd explain that we'd continue on past imperfect (e.g. Progressing=True) but still happy-enough Deployments and such and later block on the ClusterOperator. But there's really no need to poll the ClusterOperator while we're blocked on it; we can instead fail that sync loop, take the short cool-off break, and come in again with a new pass at the sync cycle. This will help reduce delays like [1] where a good chunk of the ~5.5 minute delay was waiting for the network ClusterOperator to become happy. If instead we give up on that sync cycle and start in again with a fresh sync cycle, we would have been more likely to recreate the ServiceMonitors CRD more quickly. The downside is that we will now complain about unavailable, degraded, and unleveled operators more quickly, without giving them time to become happy before complaining in ClusterVersion's status. This is mitigated by most of the returned errors being UpdateEffectNone, which we will render as "waiting on..." Progressing messages. The exceptions are mostly unavailable, which is a serious enough condition that I'm fine complaining about it aggressively, and degraded, which has the fail-after-interval guard keeping us from complaining about it too aggressively. I'm also shifting the "does not declare expected versions" check before the Get call. It's goal is still to ensure that we fail in CI before shipping an operator with such a manifest, and there's no point in firing off an API GET before failing on a guard that only needs local information. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c2
…s it alive now?" Like cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400), but for ClusterOperator. In that commit message, I'd explain that we'd continue on past imperfect (e.g. Progressing=True) but still happy-enough Deployments and such and later block on the ClusterOperator. But there's really no need to poll the ClusterOperator while we're blocked on it; we can instead fail that sync loop, take the short cool-off break, and come in again with a new pass at the sync cycle. This will help reduce delays like [1] where a good chunk of the ~5.5 minute delay was waiting for the network ClusterOperator to become happy. If instead we give up on that sync cycle and start in again with a fresh sync cycle, we would have been more likely to recreate the ServiceMonitors CRD more quickly. The downside is that we will now complain about unavailable, degraded, and unleveled operators more quickly, without giving them time to become happy before complaining in ClusterVersion's status. This is mitigated by most of the returned errors being UpdateEffectNone, which we will render as "waiting on..." Progressing messages. The exceptions are mostly unavailable, which is a serious enough condition that I'm fine complaining about it aggressively, and degraded, which has the fail-after-interval guard keeping us from complaining about it too aggressively. I'm also shifting the "does not declare expected versions" check before the Get call. It's goal is still to ensure that we fail in CI before shipping an operator with such a manifest, and there's no point in firing off an API GET before failing on a guard that only needs local information. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c2
…s it alive now?" Like cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400), but for ClusterOperator. In that commit message, I'd explain that we'd continue on past imperfect (e.g. Progressing=True) but still happy-enough Deployments and such and later block on the ClusterOperator. But there's really no need to poll the ClusterOperator while we're blocked on it; we can instead fail that sync loop, take the short cool-off break, and come in again with a new pass at the sync cycle. This will help reduce delays like [1] where a good chunk of the ~5.5 minute delay was waiting for the network ClusterOperator to become happy. If instead we give up on that sync cycle and start in again with a fresh sync cycle, we would have been more likely to recreate the ServiceMonitors CRD more quickly. The downside is that we will now complain about unavailable, degraded, and unleveled operators more quickly, without giving them time to become happy before complaining in ClusterVersion's status. This is mitigated by most of the returned errors being UpdateEffectNone, which we will render as "waiting on..." Progressing messages. The exceptions are mostly unavailable, which is a serious enough condition that I'm fine complaining about it aggressively, and degraded, which has the fail-after-interval guard keeping us from complaining about it too aggressively. I'm also shifting the "does not declare expected versions" check before the Get call. It's goal is still to ensure that we fail in CI before shipping an operator with such a manifest, and there's no point in firing off an API GET before failing on a guard that only needs local information. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c2
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [2]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. And the code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [2]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. And the code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. And the code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. And the code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. And the code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. I was probably confused by e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), which broke the health-check inputs as described in 002591d (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). The code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. I was probably confused by e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), which broke the health-check inputs as described in 002591d (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). The code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. I was probably confused by e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), which broke the health-check inputs as described in 002591d (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). The code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. I was probably confused by e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), which broke the health-check inputs as described in 002591d (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). The code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
We've had
if updated
guards aroundwaitFor*Completion
since the library landed in 2d334c2 (#10). But, only waiting whenupdated
is true is a weak block, because if/when we fail to complete,Task.Run
will back-off and callbuilder.Apply
again. That newApply
will see the already-updated object, setupdated
false, and not wait. So whether we block or not is orthogonal toupdated
; nobody cares about whether the most recent update happened in thisbuilder.Apply
, this sync cycle, or a previous cycle.We don't even care all that much about whether the Deployment, DaemonSet, CustomResourceDefinition, or Job succeeded. Most feedback is going to come from the ClusterOperator, so with this commit we continue past the resource wait-for unless the resource is really hurting, in which case we fail immediately (inside
builder.Apply
,Task.Run
will still hit us a few times) to bubble that up. In situations where we don't see anything too terrible going on, we'll continue on past and later block on ClusterOperator not being ready.Other changes in this commit:
b.modifier(crd)
out of the switch, because that should happen regardless of the CRD version.Task.Run
retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle).deploymentBuilder
anddaemonsetBuilder
growmode
properties. They had been usingactual.Generation > 1
as a proxy for "post-install" since 14fab0b (add generic 2-way merge handler. #26), but generation 1 is just "we haven't changed the object since it was created", not "we're installing a fresh cluster". For example, a new Deployment or DaemonSet could be added as part of a cluster update, and we don't want special install-time "we don't care about specific manifest failures" then.