Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?" #400

Merged

Conversation

wking
Copy link
Member

@wking wking commented Jul 7, 2020

We've had if updated guards around waitFor*Completion since the library landed in 2d334c2 (#10). But, only waiting when updated is true is a weak block, because if/when we fail to complete, Task.Run will back-off and call builder.Apply again. That new Apply will see the already-updated object, set updated false, and not wait. So whether we block or not is orthogonal to updated; nobody cares about whether the most recent update happened in this builder.Apply, this sync cycle, or a previous cycle.

We don't even care all that much about whether the Deployment, DaemonSet, CustomResourceDefinition, or Job succeeded. Most feedback is going to come from the ClusterOperator, so with this commit we continue past the resource wait-for unless the resource is really hurting, in which case we fail immediately (inside builder.Apply, Task.Run will still hit us a few times) to bubble that up. In situations where we don't see anything too terrible going on, we'll continue on past and later block on ClusterOperator not being ready.

Other changes in this commit:

  • I've pulled b.modifier(crd) out of the switch, because that should happen regardless of the CRD version.
  • I've added an error for unrecognized CRD version, because that will be easier to debug than trying to figure out why a CRD manifest is being silently ignored by the CVO.
  • There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists here and here). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle).
  • deploymentBuilder and daemonsetBuilder grow mode properties. They had been using actual.Generation > 1 as a proxy for "post-install" since 14fab0b (add generic 2-way merge handler. #26), but generation 1 is just "we haven't changed the object since it was created", not "we're installing a fresh cluster". For example, a new Deployment or DaemonSet could be added as part of a cluster update, and we don't want special install-time "we don't care about specific manifest failures" then.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2020
@smarterclayton
Copy link
Contributor

Can you split the unrelated changes out to smaller separate PRs? Rather review those individually

@wking
Copy link
Member Author

wking commented Jul 8, 2020

Can you split the unrelated changes out to smaller separate PRs?

Split into #401 (Generation > 1 -> mode), #402 (centralize b.modifier(crd)), and #403 (erroring on unrecognized CRD version). Once #401 and #402 land, I'll rebase this one on top. #403 may or may not count as a conflict; depends on how ornery GitHub is feeling ;).

@abhinavdahiya
Copy link
Contributor

Can you split the unrelated changes out to smaller separate PRs? Rather review those individually

/uncc @abhinavdahiya

@openshift-ci-robot openshift-ci-robot removed the request for review from abhinavdahiya July 8, 2020 02:20
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 12, 2020
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2020
@wking
Copy link
Member Author

wking commented Jul 20, 2020

Can you split the unrelated changes out to smaller separate PRs?

Rebased onto master around #401 (and dropping the rejected #402) with cebb958 -> c6afa88. #403 is still waiting on review, but it's maybe far enough away to avoid getting marked as a conflict due to overlapping context. And it will be easy enough to rebase this PR or #403 depending on which lands first if GitHub decides there is a conflict.

@wking
Copy link
Member Author

wking commented Jul 20, 2020

/assign @smarterclayton

@wking
Copy link
Member Author

wking commented Jul 22, 2020

Failures look unrelated to this change.

@jottofar
Copy link
Contributor

/assign @jottofar

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 22, 2020
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 23, 2020
@wking
Copy link
Member Author

wking commented Jul 23, 2020

Rebased around #412 with 507d91b -> 100231e. No conflicts; I guess GitHub's conflict detector is just jumpier than Git's diff-context detector.

wking added a commit to wking/cluster-version-operator that referenced this pull request Jul 23, 2020
…ow?"

We've had 'if updated' guards around waitFor*Completion since the
library landed in 2d334c2 (lib: add resource builder that allows Do
on any lib.Manifest, 2018-08-20, openshift#10).  But, only waiting when
'updated' is true is a weak block, because if/when we fail to
complete, Task.Run will back-off and call builder.Apply again.  That
new Apply will see the already-updated object, set 'updated' false,
and not wait.  So whether we block or not is orthogonal to 'updated';
nobody cares about whether the most recent update happened in this
builder.Apply, this sync cycle, or a previous cycle.

We don't even care all that much about whether the Deployment,
DaemonSet, CustomResourceDefinition, or Job succeeded.  Most feedback
is going to come from the ClusterOperator, so with this commit we
continue past the resource wait-for unless the resource is really
hurting, in which case we fail immediately (inside builder.Apply,
Task.Run will still hit us a few times) to bubble that up.  In
situations where we don't see anything too terrible going on, we'll
continue on past and later block on ClusterOperator not being ready.

There's no object status for CRDs or DaemonSets that marks "we are
really hurting".  The v1.18.0 Kubernetes CRD and DaemonSet controllers
do not set any conditions in their operand status (although the API
for those conditions exists [1,2]).  With this commit, we have very
minimal wait logic for either.  Sufficiently unhealthy DaemonSet
should be reported on via their associated ClusterOperator, and
sufficiently unhealthy CRD should be reported on when we fail to push
any custom resources consuming them (Task.Run retries will give the
API server time to ready itself after accepting a CRD update before
the CVO fails its sync cycle).

We still need the public WaitForJobCompletion, because
fetchUpdatePayloadToDir uses it to wait on the release download.

Also expand "iff" -> "if and only if" while I'm touching that line, at
Jack's suggestion [3].

[1]: https://github.com/kubernetes/api/blob/v0.18.0/apps/v1/types.go#L586-L590
[2]: https://github.com/kubernetes/apiextensions-apiserver/blob/v0.18.0/pkg/apis/apiextensions/types.go#L319-L320
[3]: openshift#400 (comment)
@openshift-merge-robot openshift-merge-robot merged commit 7141c36 into openshift:master Jul 31, 2020
@wking wking deleted the drop-most-manifest-waits branch July 31, 2020 00:16
wking added a commit to wking/cluster-version-operator that referenced this pull request May 13, 2021
…s it alive now?"

Like cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400), but for
ClusterOperator.  In that commit message, I'd explain that we'd
continue on past imperfect (e.g. Progressing=True) but still
happy-enough Deployments and such and later block on the
ClusterOperator.  But there's really no need to poll the
ClusterOperator while we're blocked on it; we can instead fail that
sync loop, take the short cool-off break, and come in again with a new
pass at the sync cycle.

This will help reduce delays like [1] where a good chunk of the ~5.5
minute delay was waiting for the network ClusterOperator to become
happy.  If instead we give up on that sync cycle and start in again
with a fresh sync cycle, we would have been more likely to recreate
the ServiceMonitors CRD more quickly.

The downside is that we will now complain about unavailable, degraded,
and unleveled operators more quickly, without giving them time to
become happy before complaining in ClusterVersion's status.  This is
mitigated by most of the returned errors being UpdateEffectNone, which
we will render as "waiting on..." Progressing messages.  The
exceptions are mostly unavailable, which is a serious enough condition
that I'm fine complaining about it aggressively, and degraded, which
has the fail-after-interval guard keeping us from complaining about it
too aggressively.

I'm also shifting the "does not declare expected versions" check
before the Get call.  It's goal is still to ensure that we fail in CI
before shipping an operator with such a manifest, and there's no point
in firing off an API GET before failing on a guard that only needs
local information.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c2
wking added a commit to wking/cluster-version-operator that referenced this pull request May 13, 2021
…s it alive now?"

Like cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400), but for
ClusterOperator.  In that commit message, I'd explain that we'd
continue on past imperfect (e.g. Progressing=True) but still
happy-enough Deployments and such and later block on the
ClusterOperator.  But there's really no need to poll the
ClusterOperator while we're blocked on it; we can instead fail that
sync loop, take the short cool-off break, and come in again with a new
pass at the sync cycle.

This will help reduce delays like [1] where a good chunk of the ~5.5
minute delay was waiting for the network ClusterOperator to become
happy.  If instead we give up on that sync cycle and start in again
with a fresh sync cycle, we would have been more likely to recreate
the ServiceMonitors CRD more quickly.

The downside is that we will now complain about unavailable, degraded,
and unleveled operators more quickly, without giving them time to
become happy before complaining in ClusterVersion's status.  This is
mitigated by most of the returned errors being UpdateEffectNone, which
we will render as "waiting on..." Progressing messages.  The
exceptions are mostly unavailable, which is a serious enough condition
that I'm fine complaining about it aggressively, and degraded, which
has the fail-after-interval guard keeping us from complaining about it
too aggressively.

I'm also shifting the "does not declare expected versions" check
before the Get call.  It's goal is still to ensure that we fail in CI
before shipping an operator with such a manifest, and there's no point
in firing off an API GET before failing on a guard that only needs
local information.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c2
wking added a commit to wking/cluster-version-operator that referenced this pull request May 13, 2021
…s it alive now?"

Like cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400), but for
ClusterOperator.  In that commit message, I'd explain that we'd
continue on past imperfect (e.g. Progressing=True) but still
happy-enough Deployments and such and later block on the
ClusterOperator.  But there's really no need to poll the
ClusterOperator while we're blocked on it; we can instead fail that
sync loop, take the short cool-off break, and come in again with a new
pass at the sync cycle.

This will help reduce delays like [1] where a good chunk of the ~5.5
minute delay was waiting for the network ClusterOperator to become
happy.  If instead we give up on that sync cycle and start in again
with a fresh sync cycle, we would have been more likely to recreate
the ServiceMonitors CRD more quickly.

The downside is that we will now complain about unavailable, degraded,
and unleveled operators more quickly, without giving them time to
become happy before complaining in ClusterVersion's status.  This is
mitigated by most of the returned errors being UpdateEffectNone, which
we will render as "waiting on..." Progressing messages.  The
exceptions are mostly unavailable, which is a serious enough condition
that I'm fine complaining about it aggressively, and degraded, which
has the fail-after-interval guard keeping us from complaining about it
too aggressively.

I'm also shifting the "does not declare expected versions" check
before the Get call.  It's goal is still to ensure that we fail in CI
before shipping an operator with such a manifest, and there's no point
in firing off an API GET before failing on a guard that only needs
local information.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c2
wking added a commit to wking/cluster-version-operator that referenced this pull request May 13, 2021
…s it alive now?"

Like cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400), but for
ClusterOperator.  In that commit message, I'd explain that we'd
continue on past imperfect (e.g. Progressing=True) but still
happy-enough Deployments and such and later block on the
ClusterOperator.  But there's really no need to poll the
ClusterOperator while we're blocked on it; we can instead fail that
sync loop, take the short cool-off break, and come in again with a new
pass at the sync cycle.

This will help reduce delays like [1] where a good chunk of the ~5.5
minute delay was waiting for the network ClusterOperator to become
happy.  If instead we give up on that sync cycle and start in again
with a fresh sync cycle, we would have been more likely to recreate
the ServiceMonitors CRD more quickly.

The downside is that we will now complain about unavailable, degraded,
and unleveled operators more quickly, without giving them time to
become happy before complaining in ClusterVersion's status.  This is
mitigated by most of the returned errors being UpdateEffectNone, which
we will render as "waiting on..." Progressing messages.  The
exceptions are mostly unavailable, which is a serious enough condition
that I'm fine complaining about it aggressively, and degraded, which
has the fail-after-interval guard keeping us from complaining about it
too aggressively.

I'm also shifting the "does not declare expected versions" check
before the Get call.  It's goal is still to ensure that we fail in CI
before shipping an operator with such a manifest, and there's no point
in firing off an API GET before failing on a guard that only needs
local information.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c2
wking added a commit to wking/cluster-version-operator that referenced this pull request May 24, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 29, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request May 2, 2022
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for
with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming:

  There's no object status for CRDs or DaemonSets that marks "we are
  really hurting".  The v1.18.0 Kubernetes CRD and DaemonSet
  controllers do not set any conditions in their operand status
  (although the API for those conditions exists [2,3]).  With this
  commit, we have very minimal wait logic for either.  Sufficiently
  unhealthy DaemonSet should be reported on via their associated
  ClusterOperator, and sufficiently unhealthy CRD should be reported
  on when we fail to push any custom resources consuming them
  (Task.Run retries will give the API server time to ready itself
  after accepting a CRD update before the CVO fails its sync cycle).

But from [2]:

  > It might take a few seconds for the endpoint to be created. You
  > can watch the Established condition of your
  > CustomResourceDefinition to be true or watch the discovery
  > information of the API server for your resource to show up.

So I was correct that we will hear about CRD issues when we fail to
push a dependent custom resource.  But I was not correct in claiming
that the CRD controller set no conditions.  And the code I removed in
cc9292a was in fact looking at the Established condition already.

This commit restores the Established check, but without the previous
PollImmediateUntil wait.

[2]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
wking added a commit to wking/cluster-version-operator that referenced this pull request May 2, 2022
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for
with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming:

  There's no object status for CRDs or DaemonSets that marks "we are
  really hurting".  The v1.18.0 Kubernetes CRD and DaemonSet
  controllers do not set any conditions in their operand status
  (although the API for those conditions exists [2,3]).  With this
  commit, we have very minimal wait logic for either.  Sufficiently
  unhealthy DaemonSet should be reported on via their associated
  ClusterOperator, and sufficiently unhealthy CRD should be reported
  on when we fail to push any custom resources consuming them
  (Task.Run retries will give the API server time to ready itself
  after accepting a CRD update before the CVO fails its sync cycle).

But from [1]:

  > It might take a few seconds for the endpoint to be created. You
  > can watch the Established condition of your
  > CustomResourceDefinition to be true or watch the discovery
  > information of the API server for your resource to show up.

So I was correct that we will hear about CRD issues when we fail to
push a dependent custom resource.  But I was not correct in claiming
that the CRD controller set no conditions.  And the code I removed in
cc9292a was in fact looking at the Established condition already.

This commit restores the Established check, but without the previous
PollImmediateUntil wait.

[1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
wking added a commit to wking/cluster-version-operator that referenced this pull request May 3, 2022
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for
with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming:

  There's no object status for CRDs or DaemonSets that marks "we are
  really hurting".  The v1.18.0 Kubernetes CRD and DaemonSet
  controllers do not set any conditions in their operand status
  (although the API for those conditions exists [2,3]).  With this
  commit, we have very minimal wait logic for either.  Sufficiently
  unhealthy DaemonSet should be reported on via their associated
  ClusterOperator, and sufficiently unhealthy CRD should be reported
  on when we fail to push any custom resources consuming them
  (Task.Run retries will give the API server time to ready itself
  after accepting a CRD update before the CVO fails its sync cycle).

But from [1]:

  > It might take a few seconds for the endpoint to be created. You
  > can watch the Established condition of your
  > CustomResourceDefinition to be true or watch the discovery
  > information of the API server for your resource to show up.

So I was correct that we will hear about CRD issues when we fail to
push a dependent custom resource.  But I was not correct in claiming
that the CRD controller set no conditions.  And the code I removed in
cc9292a was in fact looking at the Established condition already.

This commit restores the Established check, but without the previous
PollImmediateUntil wait.

[1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
wking added a commit to wking/cluster-version-operator that referenced this pull request May 3, 2022
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for
with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming:

  There's no object status for CRDs or DaemonSets that marks "we are
  really hurting".  The v1.18.0 Kubernetes CRD and DaemonSet
  controllers do not set any conditions in their operand status
  (although the API for those conditions exists [2,3]).  With this
  commit, we have very minimal wait logic for either.  Sufficiently
  unhealthy DaemonSet should be reported on via their associated
  ClusterOperator, and sufficiently unhealthy CRD should be reported
  on when we fail to push any custom resources consuming them
  (Task.Run retries will give the API server time to ready itself
  after accepting a CRD update before the CVO fails its sync cycle).

But from [1]:

  > It might take a few seconds for the endpoint to be created. You
  > can watch the Established condition of your
  > CustomResourceDefinition to be true or watch the discovery
  > information of the API server for your resource to show up.

So I was correct that we will hear about CRD issues when we fail to
push a dependent custom resource.  But I was not correct in claiming
that the CRD controller set no conditions.  And the code I removed in
cc9292a was in fact looking at the Established condition already.

This commit restores the Established check, but without the previous
PollImmediateUntil wait.

[1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
wking added a commit to wking/cluster-version-operator that referenced this pull request May 4, 2022
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for
with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming:

  There's no object status for CRDs or DaemonSets that marks "we are
  really hurting".  The v1.18.0 Kubernetes CRD and DaemonSet
  controllers do not set any conditions in their operand status
  (although the API for those conditions exists [2,3]).  With this
  commit, we have very minimal wait logic for either.  Sufficiently
  unhealthy DaemonSet should be reported on via their associated
  ClusterOperator, and sufficiently unhealthy CRD should be reported
  on when we fail to push any custom resources consuming them
  (Task.Run retries will give the API server time to ready itself
  after accepting a CRD update before the CVO fails its sync cycle).

But from [1]:

  > It might take a few seconds for the endpoint to be created. You
  > can watch the Established condition of your
  > CustomResourceDefinition to be true or watch the discovery
  > information of the API server for your resource to show up.

So I was correct that we will hear about CRD issues when we fail to
push a dependent custom resource.  But I was not correct in claiming
that the CRD controller set no conditions.  And the code I removed in
cc9292a was in fact looking at the Established condition already.

This commit restores the Established check, but without the previous
PollImmediateUntil wait.

[1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
wking added a commit to wking/cluster-version-operator that referenced this pull request May 4, 2022
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for
with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming:

  There's no object status for CRDs or DaemonSets that marks "we are
  really hurting".  The v1.18.0 Kubernetes CRD and DaemonSet
  controllers do not set any conditions in their operand status
  (although the API for those conditions exists [2,3]).  With this
  commit, we have very minimal wait logic for either.  Sufficiently
  unhealthy DaemonSet should be reported on via their associated
  ClusterOperator, and sufficiently unhealthy CRD should be reported
  on when we fail to push any custom resources consuming them
  (Task.Run retries will give the API server time to ready itself
  after accepting a CRD update before the CVO fails its sync cycle).

But from [1]:

  > It might take a few seconds for the endpoint to be created. You
  > can watch the Established condition of your
  > CustomResourceDefinition to be true or watch the discovery
  > information of the API server for your resource to show up.

So I was correct that we will hear about CRD issues when we fail to
push a dependent custom resource.  But I was not correct in claiming
that the CRD controller set no conditions.  I was probably confused by
e8ffccb (lib: Add autogeneration for some resource* functionality,
2020-07-29, openshift#420), which broke the health-check inputs as described in
002591d (lib/resourcebuilder: Use actual resource in check*Health
calls, 2022-05-03, openshift#771).  The code I removed in cc9292a was in
fact looking at the Established condition already.

This commit restores the Established check, but without the previous
PollImmediateUntil wait.

[1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
wking added a commit to wking/cluster-version-operator that referenced this pull request May 4, 2022
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for
with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming:

  There's no object status for CRDs or DaemonSets that marks "we are
  really hurting".  The v1.18.0 Kubernetes CRD and DaemonSet
  controllers do not set any conditions in their operand status
  (although the API for those conditions exists [2,3]).  With this
  commit, we have very minimal wait logic for either.  Sufficiently
  unhealthy DaemonSet should be reported on via their associated
  ClusterOperator, and sufficiently unhealthy CRD should be reported
  on when we fail to push any custom resources consuming them
  (Task.Run retries will give the API server time to ready itself
  after accepting a CRD update before the CVO fails its sync cycle).

But from [1]:

  > It might take a few seconds for the endpoint to be created. You
  > can watch the Established condition of your
  > CustomResourceDefinition to be true or watch the discovery
  > information of the API server for your resource to show up.

So I was correct that we will hear about CRD issues when we fail to
push a dependent custom resource.  But I was not correct in claiming
that the CRD controller set no conditions.  I was probably confused by
e8ffccb (lib: Add autogeneration for some resource* functionality,
2020-07-29, openshift#420), which broke the health-check inputs as described in
002591d (lib/resourcebuilder: Use actual resource in check*Health
calls, 2022-05-03, openshift#771).  The code I removed in cc9292a was in
fact looking at the Established condition already.

This commit restores the Established check, but without the previous
PollImmediateUntil wait.

[1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
wking added a commit to wking/cluster-version-operator that referenced this pull request May 4, 2022
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for
with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming:

  There's no object status for CRDs or DaemonSets that marks "we are
  really hurting".  The v1.18.0 Kubernetes CRD and DaemonSet
  controllers do not set any conditions in their operand status
  (although the API for those conditions exists [2,3]).  With this
  commit, we have very minimal wait logic for either.  Sufficiently
  unhealthy DaemonSet should be reported on via their associated
  ClusterOperator, and sufficiently unhealthy CRD should be reported
  on when we fail to push any custom resources consuming them
  (Task.Run retries will give the API server time to ready itself
  after accepting a CRD update before the CVO fails its sync cycle).

But from [1]:

  > It might take a few seconds for the endpoint to be created. You
  > can watch the Established condition of your
  > CustomResourceDefinition to be true or watch the discovery
  > information of the API server for your resource to show up.

So I was correct that we will hear about CRD issues when we fail to
push a dependent custom resource.  But I was not correct in claiming
that the CRD controller set no conditions.  I was probably confused by
e8ffccb (lib: Add autogeneration for some resource* functionality,
2020-07-29, openshift#420), which broke the health-check inputs as described in
002591d (lib/resourcebuilder: Use actual resource in check*Health
calls, 2022-05-03, openshift#771).  The code I removed in cc9292a was in
fact looking at the Established condition already.

This commit restores the Established check, but without the previous
PollImmediateUntil wait.

[1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
wking added a commit to wking/cluster-version-operator that referenced this pull request May 9, 2022
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for
with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming:

  There's no object status for CRDs or DaemonSets that marks "we are
  really hurting".  The v1.18.0 Kubernetes CRD and DaemonSet
  controllers do not set any conditions in their operand status
  (although the API for those conditions exists [2,3]).  With this
  commit, we have very minimal wait logic for either.  Sufficiently
  unhealthy DaemonSet should be reported on via their associated
  ClusterOperator, and sufficiently unhealthy CRD should be reported
  on when we fail to push any custom resources consuming them
  (Task.Run retries will give the API server time to ready itself
  after accepting a CRD update before the CVO fails its sync cycle).

But from [1]:

  > It might take a few seconds for the endpoint to be created. You
  > can watch the Established condition of your
  > CustomResourceDefinition to be true or watch the discovery
  > information of the API server for your resource to show up.

So I was correct that we will hear about CRD issues when we fail to
push a dependent custom resource.  But I was not correct in claiming
that the CRD controller set no conditions.  I was probably confused by
e8ffccb (lib: Add autogeneration for some resource* functionality,
2020-07-29, openshift#420), which broke the health-check inputs as described in
002591d (lib/resourcebuilder: Use actual resource in check*Health
calls, 2022-05-03, openshift#771).  The code I removed in cc9292a was in
fact looking at the Established condition already.

This commit restores the Established check, but without the previous
PollImmediateUntil wait.

[1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 2, 2022
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants