New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1843505: pkg/start: Release leader lease on graceful shutdown #424
Bug 1843505: pkg/start: Release leader lease on graceful shutdown #424
Conversation
@wking: This pull request references Bugzilla bug 1843505, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
6438ffc
to
621d0cd
Compare
/retest |
Tailing the CVO logs from the e2e-upgrade job (I grabbed the kubeconfig from the CI cluster console) gives:
|
621d0cd
to
235116d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
235116d
to
a38f5e5
Compare
integration fails with:
Because, when we were not launching a metrics server (which we don't for those integration tests), there was nothing blocking |
a38f5e5
to
6c4988e
Compare
Hrm:
/hold until I figure out why we aren't breaking over to the |
6c4988e
to
d82e2bf
Compare
Hrm, still not seeing a formal $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/424/pull-ci-openshift-cluster-version-operator-master-integration/1290870079050747904/build-log.txt |grep 'start\.go\|start_integration_test\.go\|leaderelection\|Starting ClusterVersionOperator with minimum reconcile period\|Shutting down ClusterVersionOperator'
I0805 04:41:53.258885 8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 2m18.782310567s
I0805 04:41:53.263473 8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 5.440919882s
I0805 04:41:53.263482 8994 start.go:218] Waiting on 1 outstanding goroutines.
I0805 04:41:53.263525 8994 leaderelection.go:243] attempting to acquire leader lease e2e-cvo-wtq8nd/e2e-cvo-wtq8nd...
I0805 04:41:53.271143 8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 2m35.17222288s
I0805 04:41:53.274128 8994 leaderelection.go:253] successfully acquired lease e2e-cvo-wtq8nd/e2e-cvo-wtq8nd
I0805 04:41:53.274609 8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s
I0805 04:41:53.471957 8994 start.go:222] Run context completed; beginning two-minute graceful shutdown period.
I0805 04:41:53.471990 8994 start.go:218] Waiting on 1 outstanding goroutines.
I0805 04:41:53.472174 8994 cvo.go:350] Shutting down ClusterVersionOperator
I0805 04:41:53.476123 8994 start.go:208] Stopped leading; shutting down.
I0805 04:41:53.476158 8994 start.go:244] Finished collecting operator goroutines.
I0805 04:41:53.479381 8994 cvo.go:350] Shutting down ClusterVersionOperator
start_integration_test.go:542: the controller should create a lock record on a config map
start_integration_test.go:566: verify the controller writes a leadership change event
start_integration_test.go:575: after the context is closed, the lock record should be deleted quickly
start_integration_test.go:592: timed out waiting for the condition
start_integration_test.go:520: failed to delete cluster version e2e-cvo-wtq8nd: clusterversions.config.openshift.io "e2e-cvo-wtq8nd" not found
I0805 04:42:11.511270 8994 cvo.go:350] Shutting down ClusterVersionOperator
I0805 04:42:25.909230 8994 cvo.go:350] Shutting down ClusterVersionOperator
I0805 04:42:25.909230 8994 cvo.go:350] Shutting down ClusterVersionOperator We see the |
}, | ||
OnStoppedLeading: func() { | ||
klog.Info("Stopped leading; shutting down.") | ||
runCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you still need to exit, don't you? How confident are you that this truly resets everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How confident are you that this truly resets everything?
If it doesn't, CI should turn it up, and we'll fix those bugs ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't sound like what I described on slack. If we lost lease, we exit immediately, no graceful step down. When we have lost our lease we should not be running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we lost lease, we exit immediately...
OnStoppedLeading
is not just "lost lease", it is also "gracefully released lease". We could have logic here about checking postMainContext.Err()
to guess about lost vs. released. But runCancel()
should immediately block all cluster-object-writing activity, so I think this is sufficient as it stands to keep from fighting the new leader. For comparison, master gives a full 5s grace period after the cancel before forcing a hard exit. If reducing my current 2m grace period to 5s would make you happy with this line, I'm happy to make that change.
From chat: Top loop has to know about two different paths and treat them differently:
|
c5f88a1
to
122754f
Compare
I've pushed d82e2bf -> 122754f, removing |
Bunch of /retest |
integration still failing. I've pushed a WIP aabb0d5 to skip the other tests so we have clearer logs to analyze for the step-down failure. |
So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while [1]: I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... ... I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version and time out the: Cluster did not acknowledge request to upgrade in a reasonable time testcase [2]. Using ReleaseOnCancel has been the plan since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). I'm not clear on why it (sometimes?) doesn't work today. The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (cvo: When the CVO restarts, perform one final sync to write status, 2019-04-27, openshift#179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout. Decouple shutdownContext from the context passed into Options.run, to allow TestIntegrationCVO_gracefulStepDown to request a graceful shutdown. And remove Context.Start(), inlining the logic in Options.run so we can count and reap the goroutines it used to launch. This also allows us to be more targeted with the context for each goroutines: * Informers are now launched before the lease controller, so they're up and running by the time we acquire the lease. They remain running until the main operator CVO.Run() exits, after which we shut them down. Having informers running before we have a lease is somewhat expensive in terms of API traffic, but we should rarely have two CVO pods competing for leadership since we transitioned to the Recreate Deployment strategy in 078686d (install/0000_00_cluster-version-operator_03_deployment: Set 'strategy: Recreate', 2019-03-20, openshift#140) and 5d8a527 (install/0000_00_cluster-version-operator_03_deployment: Fix Recreate strategy, 2019-04-03, openshift#155). I don't see a way to block on their internal goroutine's completion, but maybe informers will grow an API for that in the future. * The metrics server also continues to run until CVO.Run() exits, where previously we began gracefully shutting it down at the same time we started shutting down CVO.Run(). This ensures we are around and publishing any last-minute CVO.Run() changes. * Leader election also continues to run until CVO.Run() exits. We don't want to release the lease while we're still controlling things. * CVO.Run() and AutoUpdate.Run() both stop immediately when the passed-in context is canceled or we call runCancel internally (because of a TERM, error from a goroutine, or loss of leadership). These are the only two goroutines that are actually writing to the API servers, so we want to shut them down as quickly as possible. Drop an unnecessary runCancel() from the "shutting down" branch of the error collector. I'd added it in b30aa0e, but you can only ever get into the "shutting down" branch if runCancel has already been called. And fix the scoping for the shutdownTimer variable so we don't clear it on each for-loop iteration (oops :p, bug from b30aa0e). Add some logging to the error collector, so it's easier to see where we are in the collection process from the operator logs. Also start logging collected goroutines by name, so we can figure out which may still be outstanding. Set terminationGracePeriodSeconds 130 to extend the default 30s [3], to give the container the full two-minute graceful timeout window before the kubelet steps in with a KILL. Push the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole 'start' invocation. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7 [3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core Cherry picked from cc1921d (openshift#424), around conflicts due to the lack of TLS metrics and the Context pivots in 4.5.
…ap lock release From the godocs: $ grep -A5 '// HolderIdentity' vendor/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go // HolderIdentity is the ID that owns the lease. If empty, no one owns this lease and // all callers may acquire. Versions of this library prior to Kubernetes 1.14 will not // attempt to acquire leases with empty identities and will wait for the full lease // interval to expire before attempting to reacquire. This value is set to empty when // a client voluntarily steps down. HolderIdentity string `json:"holderIdentity"` The previous assumption that the release would involve ConfigMap deletion was born with the test in 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). Cherry picked from dd09c3f (openshift#424), around conflicts due to the lack of Context pivots in 4.5.
Clayton wants these in each goroutine we launch [1]. Obviously there's no way to reach inside the informer Start()s and add it there. I'm also adding this to the FIXME comment for rerolling the auto-update worker goroutines; we'll get those straigtened out in future work. Cherry picked from 9c42a92 (openshift#424), around conflicts due to the lack of Context pivots in 4.5. [1]: openshift#424
Lala wanted the version included in the outgoing log line [1]. I'm not sure why you'd be wondering which version of the CVO code was running for that particular line, and not for other lines in the log, but including the version there is easy enough. While we're thinking about logging the CVO version, also remove the useless %s formatting from the opening log line, because we don't need to manipulate version.String at all. [1]: openshift#424 (comment)
Following cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), which logs overall shutdown in cmd/start.go, this commit will make it extremely clear in the CVO logs when the metrics goroutine is wrapping up.
Apparently there's something in the HTTPS server goroutine that can hang up even if we've called Shutdown() on the server [1]. Defend against that with a safety valve to abandon stuck goroutines if shutdownContext expires. Also pivot to resultChannel and asyncResult, so we can get names for the collected channels (and more easily identify the stuck channels by elimination), following the pattern set by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1891143#c1
Apparently there's something in the HTTPS server goroutine that can hang up even if we've called Shutdown() on the server [1]. Defend against that with a safety valve to abandon stuck goroutines if shutdownContext expires. Also pivot to resultChannel and asyncResult, so we can get names for the collected channels (and more easily identify the stuck channels by elimination), following the pattern set by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1891143#c1
Address a bug introduced by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the Operator.Run context would leave the operator with no time to attempt the final sync [1]: E0119 22:24:15.924216 1 cvo.go:344] unable to perform final sync: context canceled With this commit, I'm piping through shutdownContext, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
Address a bug introduced by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the Operator.Run context would leave the operator with no time to attempt the final sync [1]: E0119 22:24:15.924216 1 cvo.go:344] unable to perform final sync: context canceled With this commit, I'm piping through shutdownContext, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
Address a bug introduced by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the Operator.Run context would leave the operator with no time to attempt the final sync [1]: E0119 22:24:15.924216 1 cvo.go:344] unable to perform final sync: context canceled With this commit, I'm piping through shutdownContext, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels, 2021-01-28, openshift#508) made us more robust to situations where we are canceled after acquiring the leader lock but before we got into Operator.Run's UntilWithContext. However, there was still a bug from cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock [1]. postMainContext is used for metrics, informers, and the leader election loop. We used to only call postMainCancel after reaping the main goroutine, and obviously that will only work if we've launched the main goroutine. This commit adds a new launchedMain to track that. If launchedMain is true, we get the old handling. If launchedMain is still false when runContext.Done, we now call postMainCancel without waiting to reap a non-existent main goroutine. There's also a new postMainCancel when the shutdown timer expires. I don't expect us to ever need that, but it protects us from future bugs like this one. I've added launchedMain without guarding it behind a lock, and it is touched by both the main Options.run goroutine and the leader-election callback. So there's a racy chance of: 1. Options.run goroutine: runContext canceled, so runContext.Done() matches in Options.run 2. Leader-election goroutine: Leader lock acquired 3. Options.run goroutine: !launchedMain, so we call postMainCancel() 4. Leader-election goroutine: launchedMain set true 5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...) I'm trusting Operator.Run to respect runContext there and not do anything significant, so the fact that we are already tearing down all the post-main stuff won't cause problems. Previous fixes like a9e075a will help with that. But there could still be bugs in Operator.Run. A lock around launchedMain that avoided calling Operator.Run when runContext was already done would protect against that, but seems like overkill in an already complicated goroutine tangle. Without the lock, we just have to field and fix any future Operator.Run runContext issues as we find them. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels, 2021-01-28, openshift#508) made us more robust to situations where we are canceled after acquiring the leader lock but before we got into Operator.Run's UntilWithContext. However, there was still a bug from cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock [1]. postMainContext is used for metrics, informers, and the leader election loop. We used to only call postMainCancel after reaping the main goroutine, and obviously that will only work if we've launched the main goroutine. This commit adds a new launchedMain to track that. If launchedMain is true, we get the old handling. If launchedMain is still false when runContext.Done, we now call postMainCancel without waiting to reap a non-existent main goroutine. There's also a new postMainCancel when the shutdown timer expires. I don't expect us to ever need that, but it protects us from future bugs like this one. I've added launchedMain without guarding it behind a lock, and it is touched by both the main Options.run goroutine and the leader-election callback. So there's a racy chance of: 1. Options.run goroutine: runContext canceled, so runContext.Done() matches 2. Leader-election goroutine: Leader lock acquired 3. Options.run goroutine: !launchedMain, so we call postMainCancel() 4. Leader-election goroutine: launchedMain set true 5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...) I'm trusting Operator.Run to respect runContext there and not do anything significant, so the fact that we are already tearing down all the post-main stuff won't cause problems. Previous fixes like a9e075a will help with that. But there could still be bugs in Operator.Run. A lock around launchedMain that avoided calling Operator.Run when runContext was already done would protect against that, but seems like overkill in an already complicated goroutine tangle. Without the lock, we just have to field and fix any future Operator.Run runContext issues as we find them. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels, 2021-01-28, openshift#508) made us more robust to situations where we are canceled after acquiring the leader lock but before we got into Operator.Run's UntilWithContext. However, there was still a bug from cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock [1]. postMainContext is used for metrics, informers, and the leader election loop. We used to only call postMainCancel after reaping the main goroutine, and obviously that will only work if we've launched the main goroutine. This commit adds a new launchedMain to track that. If launchedMain is true, we get the old handling. If launchedMain is still false when runContext.Done, we now call postMainCancel without waiting to reap a non-existent main goroutine. There's also a new postMainCancel when the shutdown timer expires. I don't expect us to ever need that, but it protects us from future bugs like this one. I've added launchedMain without guarding it behind a lock, and it is touched by both the main Options.run goroutine and the leader-election callback. So there's a racy chance of: 1. Options.run goroutine: runContext canceled, so runContext.Done() matches 2. Leader-election goroutine: Leader lock acquired 3. Options.run goroutine: !launchedMain, so we call postMainCancel() 4. Leader-election goroutine: launchedMain set true 5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...) I'm trusting Operator.Run to respect runContext there and not do anything significant, so the fact that we are already tearing down all the post-main stuff won't cause problems. Previous fixes like a9e075a will help with that. But there could still be bugs in Operator.Run. A lock around launchedMain that avoided calling Operator.Run when runContext was already done would protect against that, but seems like overkill in an already complicated goroutine tangle. Without the lock, we just have to field and fix any future Operator.Run runContext issues as we find them. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
Address a bug introduced by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the Operator.Run context would leave the operator with no time to attempt the final sync [1]: E0119 22:24:15.924216 1 cvo.go:344] unable to perform final sync: context canceled With this commit, I'm piping through shutdownContext, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification). --- This commit picks c4ddf03 (pkg/cvo: Use shutdownContext for final status synchronization, 2021-01-19, openshift#517) back to 4.5. It's not a clean pick, because we're missing changes like: * b72e843 (Bug 1822844: Block z level upgrades if ClusterVersionOverridesSet set, 2020-04-30, openshift#364). * 1d1de3b (Use context to add timeout to cincinnati HTTP request, 2019-01-15, openshift#410). which also touched these lines. But we've gotten this far without backporting rhbz#1822844, and openshift#410 was never associated with a bug in the first place, so instead of pulling back more of 4.6 to get a clean pick, I've just manually reconciled the pick conflicts. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
Address a bug introduced by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the Operator.Run context would leave the operator with no time to attempt the final sync [1]: E0119 22:24:15.924216 1 cvo.go:344] unable to perform final sync: context canceled With this commit, I'm piping through shutdownContext, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification). --- This commit picks c4ddf03 (pkg/cvo: Use shutdownContext for final status synchronization, 2021-01-19, openshift#517) back to 4.5. It's not a clean pick, because we're missing changes like: * b72e843 (Bug 1822844: Block z level upgrades if ClusterVersionOverridesSet set, 2020-04-30, openshift#364). * 1d1de3b (Use context to add timeout to cincinnati HTTP request, 2019-01-15, openshift#410). which also touched these lines. But we've gotten this far without backporting rhbz#1822844, and openshift#410 was never associated with a bug in the first place, so instead of pulling back more of 4.6 to get a clean pick, I've just manually reconciled the pick conflicts. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
Address a bug introduced by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the Operator.Run context would leave the operator with no time to attempt the final sync [1]: E0119 22:24:15.924216 1 cvo.go:344] unable to perform final sync: context canceled With this commit, I'm piping through shutdownContext, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification). --- This commit picks c4ddf03 (pkg/cvo: Use shutdownContext for final status synchronization, 2021-01-19, openshift#517) back to 4.5. It's not a clean pick, because we're missing changes like: * b72e843 (Bug 1822844: Block z level upgrades if ClusterVersionOverridesSet set, 2020-04-30, openshift#364). * 1d1de3b (Use context to add timeout to cincinnati HTTP request, 2019-01-15, openshift#410). which also touched these lines. But we've gotten this far without backporting rhbz#1822844, and openshift#410 was never associated with a bug in the first place, so instead of pulling back more of 4.6 to get a clean pick, I've just manually reconciled the pick conflicts. Removing Start from pkg/start (again) fixes a buggy re-introduction in the manually-backported 20421b6 (*: Add lots of Context and options arguments, 2020-07-24, openshift#470). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
This: - adds main/lease contexts to the controller - sets up a counter and channels to track goroutine completion - sets up a signal handler to catch when the controller is being terminated so we can cancel our contexts - gracefully shuts down the controller upon receipt of a SIGINT/SIGTERM The reason this does not use sync.WaitGroup instead is that sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how many', so the channels are more useful. Cribbed off of what the CVO did here: openshift/cluster-version-operator#424
This: - adds main/lease contexts to the operator - sets up a counter and channels to track goroutine completion - sets up a signal handler to catch when the operator is being terminated so we can cancel our contexts - gracefully shuts down the operator upon receipt of a SIGINT/SIGTERM The reason this does not use sync.WaitGroup instead is that sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how many', so the channels are more useful. Cribbed off of what the CVO did here: openshift/cluster-version-operator#424
So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while:
and time out the
Cluster did not acknowledge request to upgrade in a reasonable time
testcase. UsingReleaseOnCancel
has been the plan since 2b81f47 (#87). I'm not clear on why it (sometimes?) doesn't work today.The discrepancy between the "exit after 2s no matter what" comment and the 5s
After
dates back to dbedb7a (#179), which bumped theAfter
from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (#349). We still exit immediately on a secondTERM
, for folks who get impatient waiting for the graceful timeout.I'm also pushing the
Background()
initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the wholestart
invocation.