Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1872906: pkg/start: Release leader lease on graceful shutdown #446

Conversation

wking
Copy link
Member

@wking wking commented Aug 27, 2020

Picking #349 and #424 back to 4.5 so we release the leader lock when we exit, which in turn allows the incoming CVO pod to pick up the leader lock and update the ClusterVersion status in a timely manner.

A number of mostly-minor manual tweaks to adjust to 4.5, which is missing #358's TLS metrics and #406 and #410's Context changes.

We've had it since 2b81f47 (cvo: Release our leader lease when we
are gracefully terminated, 2019-01-16, openshift#87), but it's redundant
vs. "ListenAddr is not an empty string".

I'm also switching to:

  o.ListenAddr != ""

instead of:

  len(o.ListenAddr) > 0

because it seems slightly easier to understand, but obviously either
will work.

Cherry-picked from 07e5809 (openshift#349), around conflicts due to the lack
of TLS metrics in 4.5.
Somewhat like the example in [1].  This pushes the server management
down into a new RunMetrics method, which we then run in its own
goroutine.  This is initial groundwork; I expect we will port more of
our child goroutines to this framework in follow-up work.

Cherry-picked from b30aa0e (openshift#349), around conflicts due to the lack
of TLS metrics in 4.5.

[1]: https://golang.org/pkg/net/http/#Server.Shutdown
Pulling this up out of cvo.New() while working to decouple metrics
handling from the core CVO goroutine.
These are not just for available updates, they're also for downloading
signatures.  Placing them in a separate file makes it easier to focus
on the code that is specific to available updates.
So the incoming cluster-version operator doesn't need to wait for the
outgoing operator's lease to expire, which can take a while [1]:

  I0802 10:06:01.056591       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cluster-version/version...
  ...
  I0802 10:07:42.632719       1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version

and time out the:

  Cluster did not acknowledge request to upgrade in a reasonable time

testcase [2].  Using ReleaseOnCancel has been the plan since
2b81f47 (cvo: Release our leader lease when we are gracefully
terminated, 2019-01-16, openshift#87).  I'm not clear on why it (sometimes?)
doesn't work today.

The discrepancy between the "exit after 2s no matter what" comment and
the 5s After dates back to dbedb7a (cvo: When the CVO restarts,
perform one final sync to write status, 2019-04-27, openshift#179), which
bumped the After from 2s to 5s, but forgot to bump the comment.  I'm
removing that code here in favor of the two-minute timeout from
b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349).
We still exit immediately on a second TERM, for folks who get
impatient waiting for the graceful timeout.

Decouple shutdownContext from the context passed into Options.run, to
allow TestIntegrationCVO_gracefulStepDown to request a graceful
shutdown.  And remove Context.Start(), inlining the logic in
Options.run so we can count and reap the goroutines it used to launch.
This also allows us to be more targeted with the context for each
goroutines:

* Informers are now launched before the lease controller, so they're
  up and running by the time we acquire the lease.  They remain
  running until the main operator CVO.Run() exits, after which we shut
  them down.  Having informers running before we have a lease is
  somewhat expensive in terms of API traffic, but we should rarely
  have two CVO pods competing for leadership since we transitioned to
  the Recreate Deployment strategy in 078686d
  (install/0000_00_cluster-version-operator_03_deployment: Set
  'strategy: Recreate', 2019-03-20, openshift#140) and 5d8a527
  (install/0000_00_cluster-version-operator_03_deployment: Fix
  Recreate strategy, 2019-04-03, openshift#155).  I don't see a way to block on
  their internal goroutine's completion, but maybe informers will grow
  an API for that in the future.

* The metrics server also continues to run until CVO.Run() exits,
  where previously we began gracefully shutting it down at the same
  time we started shutting down CVO.Run().  This ensures we are around
  and publishing any last-minute CVO.Run() changes.

* Leader election also continues to run until CVO.Run() exits.  We
  don't want to release the lease while we're still controlling
  things.

* CVO.Run() and AutoUpdate.Run() both stop immediately when the
  passed-in context is canceled or we call runCancel internally
  (because of a TERM, error from a goroutine, or loss of leadership).
  These are the only two goroutines that are actually writing to the
  API servers, so we want to shut them down as quickly as possible.

Drop an unnecessary runCancel() from the "shutting down" branch of the
error collector.  I'd added it in b30aa0e, but you can only ever
get into the "shutting down" branch if runCancel has already been
called.  And fix the scoping for the shutdownTimer variable so we
don't clear it on each for-loop iteration (oops :p, bug from
b30aa0e).

Add some logging to the error collector, so it's easier to see where
we are in the collection process from the operator logs.  Also start
logging collected goroutines by name, so we can figure out which may
still be outstanding.

Set terminationGracePeriodSeconds 130 to extend the default 30s [3],
to give the container the full two-minute graceful timeout window
before the kubelet steps in with a KILL.

Push the Background() initialization all the way up to the
command-line handler, to make it more obvious that the context is
scoped to the whole 'start' invocation.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7
[3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core

Cherry picked from cc1921d (openshift#424), around conflicts due to the lack
of TLS metrics and the Context pivots in 4.5.
…ap lock release

From the godocs:

  $ grep -A5 '// HolderIdentity' vendor/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go
    // HolderIdentity is the ID that owns the lease. If empty, no one owns this lease and
    // all callers may acquire. Versions of this library prior to Kubernetes 1.14 will not
    // attempt to acquire leases with empty identities and will wait for the full lease
    // interval to expire before attempting to reacquire. This value is set to empty when
    // a client voluntarily steps down.
    HolderIdentity       string      `json:"holderIdentity"`

The previous assumption that the release would involve ConfigMap
deletion was born with the test in 2b81f47 (cvo: Release our leader
lease when we are gracefully terminated, 2019-01-16, openshift#87).

Cherry picked from dd09c3f (openshift#424), around conflicts due to the lack
of Context pivots in 4.5.
Clayton wants these in each goroutine we launch [1].  Obviously
there's no way to reach inside the informer Start()s and add it there.
I'm also adding this to the FIXME comment for rerolling the
auto-update worker goroutines; we'll get those straigtened out in
future work.

Cherry picked from 9c42a92 (openshift#424), around conflicts due to the lack
of Context pivots in 4.5.

[1]: openshift#424
Lala wanted the version included in the outgoing log line [1].  I'm
not sure why you'd be wondering which version of the CVO code was
running for that particular line, and not for other lines in the log,
but including the version there is easy enough.

While we're thinking about logging the CVO version, also remove the
useless %s formatting from the opening log line, because we don't need
to manipulate version.String at all.

[1]: openshift#424 (comment)
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Aug 27, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1872906, which is invalid:

  • expected dependent Bugzilla bug 1843505 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1872906: pkg/start: Release leader lease on graceful shutdown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 27, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 27, 2020
Now that we have CI that cares about this (hooray!).  Generated with:

  $ go fmt ./...

using:

  $ go version
  go version go1.14.4 linux/amd64
@wking
Copy link
Member Author

wking commented Sep 22, 2020

4.6 bug is VERIFIED, so we should be unblocked here:

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Sep 22, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1872906, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.z) matches configured target release for branch (4.5.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 1843505 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA))
  • dependent Bugzilla bug 1843505 targets the "4.6.0" release, which is one of the valid target releases: 4.6.0, 4.6.z
  • bug has dependents

In response to this:

4.6 bug is VERIFIED, so we should be unblocked here:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 22, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@markmc markmc added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 25, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@knobunc
Copy link

knobunc commented Sep 26, 2020

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Oct 1, 2020

e2e blew up on Kube API access, with lots of :6443: i/o timeout occurred.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

13 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Oct 2, 2020

Maybe openshift/release#12391 helped.

/retest

@wking
Copy link
Member Author

wking commented Oct 2, 2020

All green again after openshift/release#12391. Ready for relabeling when the patch-manger is ready.

@mfojtik mfojtik added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Oct 7, 2020
@sdodson
Copy link
Member

sdodson commented Oct 7, 2020

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@sdodson: This pull request references Bugzilla bug 1872906, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.z) matches configured target release for branch (4.5.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 1843505 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA))
  • dependent Bugzilla bug 1843505 targets the "4.6.0" release, which is one of the valid target releases: 4.6.0, 4.6.z
  • bug has dependents

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. and removed bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. labels Oct 7, 2020
@mfojtik
Copy link
Member

mfojtik commented Oct 7, 2020

patch-manager notes: approved as this should fix frequent 4.5->4.6 upgrade failure

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 2c849e5 into openshift:release-4.5 Oct 7, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 1872906 has been moved to the MODIFIED state.

In response to this:

Bug 1872906: pkg/start: Release leader lease on graceful shutdown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the gracefully-release-leader-lease-4.5 branch October 7, 2020 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants