Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1843505: pkg/start: Release leader lease on graceful shutdown #424

Merged

Conversation

wking
Copy link
Member

@wking wking commented Aug 4, 2020

So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while:

I0802 10:06:01.056591       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cluster-version/version...
...
I0802 10:07:42.632719       1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version

and time out the Cluster did not acknowledge request to upgrade in a reasonable time testcase. Using ReleaseOnCancel has been the plan since 2b81f47 (#87). I'm not clear on why it (sometimes?) doesn't work today.

The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (#179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (#349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout.

I'm also pushing the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole start invocation.

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Aug 4, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1843505, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1843505: pkg/start: Release leader lease on graceful shutdown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 4, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 4, 2020
@wking wking force-pushed the gracefully-release-leader-lease branch 2 times, most recently from 6438ffc to 621d0cd Compare August 4, 2020 04:07
@wking
Copy link
Member Author

wking commented Aug 4, 2020

@wking
Copy link
Member Author

wking commented Aug 4, 2020

Tailing the CVO logs from the e2e-upgrade job (I grabbed the kubeconfig from the CI cluster console) gives:

I0804 05:00:54.430107       1 start.go:21] ClusterVersionOperator v1.0.0-337-g24893174-dirty
...
I0804 05:22:05.540840       1 start.go:172] Shutting down due to terminated
I0804 05:22:05.541090       1 task_graph.go:483] Canceled worker 3 while waiting for work
...
I0804 05:22:05.545382       1 task_graph.go:483] Canceled worker 0 while waiting for work
E0804 05:22:05.545604       1 metrics.go:160] Failed to gracefully shut down metrics server: accept tcp [::]:9099: use of closed network connection
...
I0804 05:22:05.558671       1 cvo.go:350] Shutting down ClusterVersionOperator
E0804 05:22:05.545078       1 task.go:81] error running apply for clusteroperator "machine-config" (510 of 608): Cluster operator machine-config is still updating
...
I0804 05:22:05.559098       1 task_graph.go:546] Result of work: [Cluster operator machine-config is still updating]
...
E0804 05:22:05.559401       1 sync_worker.go:339] unable to synchronize image (waiting 2m52.525702462s): Cluster operator machine-config is still updating
E0804 05:22:05.559623       1 start.go:221] close tcp [::]:9099: use of closed network connection
I0804 05:22:05.559722       1 start.go:238] Finished collecting operator goroutines.
I0804 05:22:05.559865       1 start.go:26] Graceful shutdown complete.

pkg/start/start.go Outdated Show resolved Hide resolved
pkg/start/start.go Outdated Show resolved Hide resolved
pkg/start/start.go Outdated Show resolved Hide resolved
@wking wking force-pushed the gracefully-release-leader-lease branch from 621d0cd to 235116d Compare August 4, 2020 21:27
Copy link
Contributor

@abhinavdahiya abhinavdahiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pkg/start/start.go Outdated Show resolved Hide resolved
@wking wking force-pushed the gracefully-release-leader-lease branch from 235116d to a38f5e5 Compare August 4, 2020 23:17
@wking
Copy link
Member Author

wking commented Aug 4, 2020

integration fails with:

I0804 21:30:55.966164    9245 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 3.469557764s
I0804 21:30:55.969344    9245 start.go:238] Finished collecting operator goroutines.
I0804 21:30:55.969433    9245 leaderelection.go:243] attempting to acquire leader lease  e2e-cvo-d6v6nv/e2e-cvo-d6v6nv...
I0804 21:30:55.969456    9245 start.go:206] Stopped leading; shutting down.
I0804 21:30:55.969705    9245 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 3m37.636795318s
...
--- FAIL: TestIntegrationCVO_gracefulStepDown (60.04s)
    start_integration_test.go:542: the controller should create a lock record on a config map
    start_integration_test.go:563: timed out waiting for the condition
    start_integration_test.go:520: failed to delete cluster version e2e-cvo-d6v6nv: clusterversions.config.openshift.io "e2e-cvo-d6v6nv" not found

Because, when we were not launching a metrics server (which we don't for those integration tests), there was nothing blocking Options.run from an early exit. I've pushed 235116d -> a38f5e5 to get the leaderelection bit into the errorChannelCount collector, so no we will wait until it completes before existing Options.run.

@wking wking force-pushed the gracefully-release-leader-lease branch from a38f5e5 to 6c4988e Compare August 4, 2020 23:45
@wking
Copy link
Member Author

wking commented Aug 5, 2020

Hrm:

I0804 23:49:24.651118    8781 start.go:222] Run context completed; beginning two-minute graceful shutdown period.
I0804 23:49:24.651174    8781 start.go:217] Waiting on 1 outstanding goroutines.
I0804 23:49:24.651184    8781 start.go:222] Run context completed; beginning two-minute graceful shutdown period.
I0804 23:49:24.651190    8781 start.go:217] Waiting on 1 outstanding goroutines.
...

/hold

until I figure out why we aren't breaking over to the shutting down branch.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2020
@wking wking force-pushed the gracefully-release-leader-lease branch from 6c4988e to d82e2bf Compare August 5, 2020 04:39
@wking
Copy link
Member Author

wking commented Aug 5, 2020

/hold cancel

I'd initialized the shutdownTimer variable inside the for loop, so it was getting cleared on each loop iteration 🤦. Moved it outside with 6c4988e -> d82e2bf.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2020
@wking
Copy link
Member Author

wking commented Aug 5, 2020

Hrm, still not seeing a formal stopped leading release in there:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/424/pull-ci-openshift-cluster-version-operator-master-integration/1290870079050747904/build-log.txt |grep 'start\.go\|start_integration_test\.go\|leaderelection\|Starting ClusterVersionOperator with minimum reconcile period\|Shutting down ClusterVersionOperator'
I0805 04:41:53.258885    8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 2m18.782310567s
I0805 04:41:53.263473    8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 5.440919882s
I0805 04:41:53.263482    8994 start.go:218] Waiting on 1 outstanding goroutines.
I0805 04:41:53.263525    8994 leaderelection.go:243] attempting to acquire leader lease  e2e-cvo-wtq8nd/e2e-cvo-wtq8nd...
I0805 04:41:53.271143    8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 2m35.17222288s
I0805 04:41:53.274128    8994 leaderelection.go:253] successfully acquired lease e2e-cvo-wtq8nd/e2e-cvo-wtq8nd
I0805 04:41:53.274609    8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s
I0805 04:41:53.471957    8994 start.go:222] Run context completed; beginning two-minute graceful shutdown period.
I0805 04:41:53.471990    8994 start.go:218] Waiting on 1 outstanding goroutines.
I0805 04:41:53.472174    8994 cvo.go:350] Shutting down ClusterVersionOperator
I0805 04:41:53.476123    8994 start.go:208] Stopped leading; shutting down.
I0805 04:41:53.476158    8994 start.go:244] Finished collecting operator goroutines.
I0805 04:41:53.479381    8994 cvo.go:350] Shutting down ClusterVersionOperator
    start_integration_test.go:542: the controller should create a lock record on a config map
    start_integration_test.go:566: verify the controller writes a leadership change event
    start_integration_test.go:575: after the context is closed, the lock record should be deleted quickly
    start_integration_test.go:592: timed out waiting for the condition
    start_integration_test.go:520: failed to delete cluster version e2e-cvo-wtq8nd: clusterversions.config.openshift.io "e2e-cvo-wtq8nd" not found
I0805 04:42:11.511270    8994 cvo.go:350] Shutting down ClusterVersionOperator
I0805 04:42:25.909230    8994 cvo.go:350] Shutting down ClusterVersionOperator
I0805 04:42:25.909230    8994 cvo.go:350] Shutting down ClusterVersionOperator

We see the Stopped leading; shutting down from our OnStoppedLeading. For some reason, cvo.go's Operator.Run seems to have a four Starting ClusterVersionOperator with... starts and five Shutting down ClusterVersionOperator stops. Maybe logs from other, parallel tests are getting mixed in there... And it would be nice if LeaderElector.release generated any logging, but it doesn't seem to.

},
OnStoppedLeading: func() {
klog.Info("Stopped leading; shutting down.")
runCancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you still need to exit, don't you? How confident are you that this truly resets everything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How confident are you that this truly resets everything?

If it doesn't, CI should turn it up, and we'll fix those bugs ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't sound like what I described on slack. If we lost lease, we exit immediately, no graceful step down. When we have lost our lease we should not be running.

Copy link
Member Author

@wking wking Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we lost lease, we exit immediately...

OnStoppedLeading is not just "lost lease", it is also "gracefully released lease". We could have logic here about checking postMainContext.Err() to guess about lost vs. released. But runCancel() should immediately block all cluster-object-writing activity, so I think this is sufficient as it stands to keep from fighting the new leader. For comparison, master gives a full 5s grace period after the cancel before forcing a hard exit. If reducing my current 2m grace period to 5s would make you happy with this line, I'm happy to make that change.

@smarterclayton
Copy link
Contributor

From chat:

Top loop has to know about two different paths and treat them differently:

  1. cancelled by interrupt - process needs to wait for cvo.Run to exit, then release lock, then exit
  2. cancelled by losing lease - process should shutdown as fast as possible (by closing the context and then racing to os.Exit())

@wking wking force-pushed the gracefully-release-leader-lease branch 2 times, most recently from c5f88a1 to 122754f Compare August 5, 2020 19:42
@wking
Copy link
Member Author

wking commented Aug 5, 2020

cancelled by losing lease - process should shutdown as fast as possible (by closing the context and then racing to os.Exit())

I've pushed d82e2bf -> 122754f, removing Context.Start and inlining the goroutine launching logic in Options.run so we can count and reap the goroutines Context.Start used to launch, and launch them with more appropriate contexts. Now both TERM and leader-loss will lead to an as-soon-as-possible runCancel (like my previous version, but different from master, which had the immediate cancel() followed by 100ms sleep and all that). The canceled runContext should stop both CVO.Run() and, if enabled, AutoUpdate.Run() immediately, but there is nothing in start.go asserting that those two Run() functions respect the cancel and immediately stop applying cluster changes. Letting the remaining components (metrics server, informers, etc.) continue for a brief period of graceful shutdown after the TERM/leader-loss trigger seems reasonable to me, so I'm not yet sold on the race to os.Exit.

@wking
Copy link
Member Author

wking commented Aug 5, 2020

Bunch of Entrypoint received interrupt: terminated.

/retest

@wking
Copy link
Member Author

wking commented Aug 5, 2020

integration still failing. I've pushed a WIP aabb0d5 to skip the other tests so we have clearer logs to analyze for the step-down failure.

wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 27, 2020
So the incoming cluster-version operator doesn't need to wait for the
outgoing operator's lease to expire, which can take a while [1]:

  I0802 10:06:01.056591       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cluster-version/version...
  ...
  I0802 10:07:42.632719       1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version

and time out the:

  Cluster did not acknowledge request to upgrade in a reasonable time

testcase [2].  Using ReleaseOnCancel has been the plan since
2b81f47 (cvo: Release our leader lease when we are gracefully
terminated, 2019-01-16, openshift#87).  I'm not clear on why it (sometimes?)
doesn't work today.

The discrepancy between the "exit after 2s no matter what" comment and
the 5s After dates back to dbedb7a (cvo: When the CVO restarts,
perform one final sync to write status, 2019-04-27, openshift#179), which
bumped the After from 2s to 5s, but forgot to bump the comment.  I'm
removing that code here in favor of the two-minute timeout from
b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349).
We still exit immediately on a second TERM, for folks who get
impatient waiting for the graceful timeout.

Decouple shutdownContext from the context passed into Options.run, to
allow TestIntegrationCVO_gracefulStepDown to request a graceful
shutdown.  And remove Context.Start(), inlining the logic in
Options.run so we can count and reap the goroutines it used to launch.
This also allows us to be more targeted with the context for each
goroutines:

* Informers are now launched before the lease controller, so they're
  up and running by the time we acquire the lease.  They remain
  running until the main operator CVO.Run() exits, after which we shut
  them down.  Having informers running before we have a lease is
  somewhat expensive in terms of API traffic, but we should rarely
  have two CVO pods competing for leadership since we transitioned to
  the Recreate Deployment strategy in 078686d
  (install/0000_00_cluster-version-operator_03_deployment: Set
  'strategy: Recreate', 2019-03-20, openshift#140) and 5d8a527
  (install/0000_00_cluster-version-operator_03_deployment: Fix
  Recreate strategy, 2019-04-03, openshift#155).  I don't see a way to block on
  their internal goroutine's completion, but maybe informers will grow
  an API for that in the future.

* The metrics server also continues to run until CVO.Run() exits,
  where previously we began gracefully shutting it down at the same
  time we started shutting down CVO.Run().  This ensures we are around
  and publishing any last-minute CVO.Run() changes.

* Leader election also continues to run until CVO.Run() exits.  We
  don't want to release the lease while we're still controlling
  things.

* CVO.Run() and AutoUpdate.Run() both stop immediately when the
  passed-in context is canceled or we call runCancel internally
  (because of a TERM, error from a goroutine, or loss of leadership).
  These are the only two goroutines that are actually writing to the
  API servers, so we want to shut them down as quickly as possible.

Drop an unnecessary runCancel() from the "shutting down" branch of the
error collector.  I'd added it in b30aa0e, but you can only ever
get into the "shutting down" branch if runCancel has already been
called.  And fix the scoping for the shutdownTimer variable so we
don't clear it on each for-loop iteration (oops :p, bug from
b30aa0e).

Add some logging to the error collector, so it's easier to see where
we are in the collection process from the operator logs.  Also start
logging collected goroutines by name, so we can figure out which may
still be outstanding.

Set terminationGracePeriodSeconds 130 to extend the default 30s [3],
to give the container the full two-minute graceful timeout window
before the kubelet steps in with a KILL.

Push the Background() initialization all the way up to the
command-line handler, to make it more obvious that the context is
scoped to the whole 'start' invocation.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7
[3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core

Cherry picked from cc1921d (openshift#424), around conflicts due to the lack
of TLS metrics and the Context pivots in 4.5.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 27, 2020
…ap lock release

From the godocs:

  $ grep -A5 '// HolderIdentity' vendor/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go
    // HolderIdentity is the ID that owns the lease. If empty, no one owns this lease and
    // all callers may acquire. Versions of this library prior to Kubernetes 1.14 will not
    // attempt to acquire leases with empty identities and will wait for the full lease
    // interval to expire before attempting to reacquire. This value is set to empty when
    // a client voluntarily steps down.
    HolderIdentity       string      `json:"holderIdentity"`

The previous assumption that the release would involve ConfigMap
deletion was born with the test in 2b81f47 (cvo: Release our leader
lease when we are gracefully terminated, 2019-01-16, openshift#87).

Cherry picked from dd09c3f (openshift#424), around conflicts due to the lack
of Context pivots in 4.5.
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 27, 2020
Clayton wants these in each goroutine we launch [1].  Obviously
there's no way to reach inside the informer Start()s and add it there.
I'm also adding this to the FIXME comment for rerolling the
auto-update worker goroutines; we'll get those straigtened out in
future work.

Cherry picked from 9c42a92 (openshift#424), around conflicts due to the lack
of Context pivots in 4.5.

[1]: openshift#424
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 27, 2020
Lala wanted the version included in the outgoing log line [1].  I'm
not sure why you'd be wondering which version of the CVO code was
running for that particular line, and not for other lines in the log,
but including the version there is easy enough.

While we're thinking about logging the CVO version, also remove the
useless %s formatting from the opening log line, because we don't need
to manipulate version.String at all.

[1]: openshift#424 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 9, 2020
Following cc1921d (pkg/start: Release leader lease on graceful
shutdown, 2020-08-03, openshift#424), which logs overall shutdown in
cmd/start.go, this commit will make it extremely clear in the CVO logs
when the metrics goroutine is wrapping up.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 23, 2020
Apparently there's something in the HTTPS server goroutine that can
hang up even if we've called Shutdown() on the server [1].  Defend
against that with a safety valve to abandon stuck goroutines if
shutdownContext expires.

Also pivot to resultChannel and asyncResult, so we can get names for
the collected channels (and more easily identify the stuck channels by
elimination), following the pattern set by cc1921d (pkg/start:
Release leader lease on graceful shutdown, 2020-08-03, openshift#424).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1891143#c1
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Nov 11, 2020
Apparently there's something in the HTTPS server goroutine that can
hang up even if we've called Shutdown() on the server [1].  Defend
against that with a safety valve to abandon stuck goroutines if
shutdownContext expires.

Also pivot to resultChannel and asyncResult, so we can get names for
the collected channels (and more easily identify the stuck channels by
elimination), following the pattern set by cc1921d (pkg/start:
Release leader lease on graceful shutdown, 2020-08-03, openshift#424).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1891143#c1
wking added a commit to wking/cluster-version-operator that referenced this pull request Jan 20, 2021
Address a bug introduced by cc1921d (pkg/start: Release leader
lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the
Operator.Run context would leave the operator with no time to attempt
the final sync [1]:

  E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

With this commit, I'm piping through shutdownContext, which gets a
two-minute grace period beyond runContext, to give the operator time
to push out that final status (which may include important information
like the fact that the incoming release image has completed
verification).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
wking added a commit to wking/cluster-version-operator that referenced this pull request Jan 20, 2021
Address a bug introduced by cc1921d (pkg/start: Release leader
lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the
Operator.Run context would leave the operator with no time to attempt
the final sync [1]:

  E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

With this commit, I'm piping through shutdownContext, which gets a
two-minute grace period beyond runContext, to give the operator time
to push out that final status (which may include important information
like the fact that the incoming release image has completed
verification).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Feb 10, 2021
Address a bug introduced by cc1921d (pkg/start: Release leader
lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the
Operator.Run context would leave the operator with no time to attempt
the final sync [1]:

  E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

With this commit, I'm piping through shutdownContext, which gets a
two-minute grace period beyond runContext, to give the operator time
to push out that final status (which may include important information
like the fact that the incoming release image has completed
verification).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 11, 2021
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from
early cancels, 2021-01-28, openshift#508) made us more robust to situations
where we are canceled after acquiring the leader lock but before we
got into Operator.Run's UntilWithContext.  However, there was still a
bug from cc1921d (pkg/start: Release leader lease on graceful
shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock
[1].  postMainContext is used for metrics, informers, and the leader
election loop.  We used to only call postMainCancel after reaping the
main goroutine, and obviously that will only work if we've launched
the main goroutine.  This commit adds a new launchedMain to track
that.  If launchedMain is true, we get the old handling.  If
launchedMain is still false when runContext.Done, we now call
postMainCancel without waiting to reap a non-existent main goroutine.

There's also a new postMainCancel when the shutdown timer expires.  I
don't expect us to ever need that, but it protects us from future bugs
like this one.

I've added launchedMain without guarding it behind a lock, and it is
touched by both the main Options.run goroutine and the leader-election
callback.  So there's a racy chance of:

1. Options.run goroutine: runContext canceled, so runContext.Done() matches in Options.run
2. Leader-election goroutine: Leader lock acquired
3. Options.run goroutine: !launchedMain, so we call postMainCancel()
4. Leader-election goroutine: launchedMain set true
5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...)

I'm trusting Operator.Run to respect runContext there and not do
anything significant, so the fact that we are already tearing down all
the post-main stuff won't cause problems.  Previous fixes like
a9e075a will help with that.  But there could still be bugs in
Operator.Run.  A lock around launchedMain that avoided calling
Operator.Run when runContext was already done would protect against
that, but seems like overkill in an already complicated goroutine
tangle.  Without the lock, we just have to field and fix any future
Operator.Run runContext issues as we find them.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 11, 2021
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from
early cancels, 2021-01-28, openshift#508) made us more robust to situations
where we are canceled after acquiring the leader lock but before we
got into Operator.Run's UntilWithContext.  However, there was still a
bug from cc1921d (pkg/start: Release leader lease on graceful
shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock
[1].  postMainContext is used for metrics, informers, and the leader
election loop.  We used to only call postMainCancel after reaping the
main goroutine, and obviously that will only work if we've launched
the main goroutine.  This commit adds a new launchedMain to track
that.  If launchedMain is true, we get the old handling.  If
launchedMain is still false when runContext.Done, we now call
postMainCancel without waiting to reap a non-existent main goroutine.

There's also a new postMainCancel when the shutdown timer expires.  I
don't expect us to ever need that, but it protects us from future bugs
like this one.

I've added launchedMain without guarding it behind a lock, and it is
touched by both the main Options.run goroutine and the leader-election
callback.  So there's a racy chance of:

1. Options.run goroutine: runContext canceled, so runContext.Done() matches
2. Leader-election goroutine: Leader lock acquired
3. Options.run goroutine: !launchedMain, so we call postMainCancel()
4. Leader-election goroutine: launchedMain set true
5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...)

I'm trusting Operator.Run to respect runContext there and not do
anything significant, so the fact that we are already tearing down all
the post-main stuff won't cause problems.  Previous fixes like
a9e075a will help with that.  But there could still be bugs in
Operator.Run.  A lock around launchedMain that avoided calling
Operator.Run when runContext was already done would protect against
that, but seems like overkill in an already complicated goroutine
tangle.  Without the lock, we just have to field and fix any future
Operator.Run runContext issues as we find them.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 16, 2021
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from
early cancels, 2021-01-28, openshift#508) made us more robust to situations
where we are canceled after acquiring the leader lock but before we
got into Operator.Run's UntilWithContext.  However, there was still a
bug from cc1921d (pkg/start: Release leader lease on graceful
shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock
[1].  postMainContext is used for metrics, informers, and the leader
election loop.  We used to only call postMainCancel after reaping the
main goroutine, and obviously that will only work if we've launched
the main goroutine.  This commit adds a new launchedMain to track
that.  If launchedMain is true, we get the old handling.  If
launchedMain is still false when runContext.Done, we now call
postMainCancel without waiting to reap a non-existent main goroutine.

There's also a new postMainCancel when the shutdown timer expires.  I
don't expect us to ever need that, but it protects us from future bugs
like this one.

I've added launchedMain without guarding it behind a lock, and it is
touched by both the main Options.run goroutine and the leader-election
callback.  So there's a racy chance of:

1. Options.run goroutine: runContext canceled, so runContext.Done() matches
2. Leader-election goroutine: Leader lock acquired
3. Options.run goroutine: !launchedMain, so we call postMainCancel()
4. Leader-election goroutine: launchedMain set true
5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...)

I'm trusting Operator.Run to respect runContext there and not do
anything significant, so the fact that we are already tearing down all
the post-main stuff won't cause problems.  Previous fixes like
a9e075a will help with that.  But there could still be bugs in
Operator.Run.  A lock around launchedMain that avoided calling
Operator.Run when runContext was already done would protect against
that, but seems like overkill in an already complicated goroutine
tangle.  Without the lock, we just have to field and fix any future
Operator.Run runContext issues as we find them.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 20, 2021
Address a bug introduced by cc1921d (pkg/start: Release leader
lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the
Operator.Run context would leave the operator with no time to attempt
the final sync [1]:

  E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

With this commit, I'm piping through shutdownContext, which gets a
two-minute grace period beyond runContext, to give the operator time
to push out that final status (which may include important information
like the fact that the incoming release image has completed
verification).

---

This commit picks c4ddf03 (pkg/cvo: Use shutdownContext for final
status synchronization, 2021-01-19, openshift#517) back to 4.5.  It's not a
clean pick, because we're missing changes like:

* b72e843 (Bug 1822844: Block z level upgrades if
  ClusterVersionOverridesSet set, 2020-04-30, openshift#364).
* 1d1de3b (Use context to add timeout to cincinnati HTTP request,
  2019-01-15, openshift#410).

which also touched these lines.  But we've gotten this far without
backporting rhbz#1822844, and openshift#410 was never associated with a bug in
the first place, so instead of pulling back more of 4.6 to get a clean
pick, I've just manually reconciled the pick conflicts.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 23, 2021
Address a bug introduced by cc1921d (pkg/start: Release leader
lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the
Operator.Run context would leave the operator with no time to attempt
the final sync [1]:

  E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

With this commit, I'm piping through shutdownContext, which gets a
two-minute grace period beyond runContext, to give the operator time
to push out that final status (which may include important information
like the fact that the incoming release image has completed
verification).

---

This commit picks c4ddf03 (pkg/cvo: Use shutdownContext for final
status synchronization, 2021-01-19, openshift#517) back to 4.5.  It's not a
clean pick, because we're missing changes like:

* b72e843 (Bug 1822844: Block z level upgrades if
  ClusterVersionOverridesSet set, 2020-04-30, openshift#364).
* 1d1de3b (Use context to add timeout to cincinnati HTTP request,
  2019-01-15, openshift#410).

which also touched these lines.  But we've gotten this far without
backporting rhbz#1822844, and openshift#410 was never associated with a bug in
the first place, so instead of pulling back more of 4.6 to get a clean
pick, I've just manually reconciled the pick conflicts.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 23, 2021
Address a bug introduced by cc1921d (pkg/start: Release leader
lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the
Operator.Run context would leave the operator with no time to attempt
the final sync [1]:

  E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

With this commit, I'm piping through shutdownContext, which gets a
two-minute grace period beyond runContext, to give the operator time
to push out that final status (which may include important information
like the fact that the incoming release image has completed
verification).

---

This commit picks c4ddf03 (pkg/cvo: Use shutdownContext for final
status synchronization, 2021-01-19, openshift#517) back to 4.5.  It's not a
clean pick, because we're missing changes like:

* b72e843 (Bug 1822844: Block z level upgrades if
  ClusterVersionOverridesSet set, 2020-04-30, openshift#364).
* 1d1de3b (Use context to add timeout to cincinnati HTTP request,
  2019-01-15, openshift#410).

which also touched these lines.  But we've gotten this far without
backporting rhbz#1822844, and openshift#410 was never associated with a bug in
the first place, so instead of pulling back more of 4.6 to get a clean
pick, I've just manually reconciled the pick conflicts.

Removing Start from pkg/start (again) fixes a buggy re-introduction in
the manually-backported 20421b6 (*: Add lots of Context and options
arguments, 2020-07-24, openshift#470).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
jkyros added a commit to jkyros/machine-config-operator that referenced this pull request Jun 18, 2022
This:
- adds main/lease contexts to the controller
- sets up a counter and channels to track goroutine completion
- sets up a signal handler to catch when the controller is being
  terminated so we can cancel our contexts
- gracefully shuts down the controller upon receipt of a SIGINT/SIGTERM

The reason this does not use sync.WaitGroup instead is that
sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how
many', so the channels are more useful.

Cribbed off of what the CVO did here:
openshift/cluster-version-operator#424
jkyros added a commit to jkyros/machine-config-operator that referenced this pull request Jun 18, 2022
This:
- adds main/lease contexts to the operator
- sets up a counter and channels to track goroutine completion
- sets up a signal handler to catch when the operator is being
  terminated so we can cancel our contexts
- gracefully shuts down the operator upon receipt of a SIGINT/SIGTERM

The reason this does not use sync.WaitGroup instead is that
sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how
many', so the channels are more useful.

Cribbed off of what the CVO did here:
openshift/cluster-version-operator#424
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants