Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1872906: pkg/start: Release leader lease on graceful shutdown #446

Commits on Aug 26, 2020

  1. pkg/start: Drop the internal EnableMetrics

    We've had it since 2b81f47 (cvo: Release our leader lease when we
    are gracefully terminated, 2019-01-16, openshift#87), but it's redundant
    vs. "ListenAddr is not an empty string".
    
    I'm also switching to:
    
      o.ListenAddr != ""
    
    instead of:
    
      len(o.ListenAddr) > 0
    
    because it seems slightly easier to understand, but obviously either
    will work.
    
    Cherry-picked from 07e5809 (openshift#349), around conflicts due to the lack
    of TLS metrics in 4.5.
    wking committed Aug 26, 2020
    Configuration menu
    Copy the full SHA
    55ff603 View commit details
    Browse the repository at this point in the history
  2. pkg/cvo/metrics: Graceful server shutdown

    Somewhat like the example in [1].  This pushes the server management
    down into a new RunMetrics method, which we then run in its own
    goroutine.  This is initial groundwork; I expect we will port more of
    our child goroutines to this framework in follow-up work.
    
    Cherry-picked from b30aa0e (openshift#349), around conflicts due to the lack
    of TLS metrics in 4.5.
    
    [1]: https://golang.org/pkg/net/http/#Server.Shutdown
    wking committed Aug 26, 2020
    Configuration menu
    Copy the full SHA
    d257c32 View commit details
    Browse the repository at this point in the history

Commits on Aug 27, 2020

  1. pkg/start: Register metrics directly

    Pulling this up out of cvo.New() while working to decouple metrics
    handling from the core CVO goroutine.
    wking committed Aug 27, 2020
    Configuration menu
    Copy the full SHA
    f8774c0 View commit details
    Browse the repository at this point in the history
  2. pkg/cvo/egress: Pull HTTPS/Proxy egress into separate file

    These are not just for available updates, they're also for downloading
    signatures.  Placing them in a separate file makes it easier to focus
    on the code that is specific to available updates.
    wking committed Aug 27, 2020
    Configuration menu
    Copy the full SHA
    d8ca134 View commit details
    Browse the repository at this point in the history
  3. pkg/start: Release leader lease on graceful shutdown

    So the incoming cluster-version operator doesn't need to wait for the
    outgoing operator's lease to expire, which can take a while [1]:
    
      I0802 10:06:01.056591       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cluster-version/version...
      ...
      I0802 10:07:42.632719       1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version
    
    and time out the:
    
      Cluster did not acknowledge request to upgrade in a reasonable time
    
    testcase [2].  Using ReleaseOnCancel has been the plan since
    2b81f47 (cvo: Release our leader lease when we are gracefully
    terminated, 2019-01-16, openshift#87).  I'm not clear on why it (sometimes?)
    doesn't work today.
    
    The discrepancy between the "exit after 2s no matter what" comment and
    the 5s After dates back to dbedb7a (cvo: When the CVO restarts,
    perform one final sync to write status, 2019-04-27, openshift#179), which
    bumped the After from 2s to 5s, but forgot to bump the comment.  I'm
    removing that code here in favor of the two-minute timeout from
    b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349).
    We still exit immediately on a second TERM, for folks who get
    impatient waiting for the graceful timeout.
    
    Decouple shutdownContext from the context passed into Options.run, to
    allow TestIntegrationCVO_gracefulStepDown to request a graceful
    shutdown.  And remove Context.Start(), inlining the logic in
    Options.run so we can count and reap the goroutines it used to launch.
    This also allows us to be more targeted with the context for each
    goroutines:
    
    * Informers are now launched before the lease controller, so they're
      up and running by the time we acquire the lease.  They remain
      running until the main operator CVO.Run() exits, after which we shut
      them down.  Having informers running before we have a lease is
      somewhat expensive in terms of API traffic, but we should rarely
      have two CVO pods competing for leadership since we transitioned to
      the Recreate Deployment strategy in 078686d
      (install/0000_00_cluster-version-operator_03_deployment: Set
      'strategy: Recreate', 2019-03-20, openshift#140) and 5d8a527
      (install/0000_00_cluster-version-operator_03_deployment: Fix
      Recreate strategy, 2019-04-03, openshift#155).  I don't see a way to block on
      their internal goroutine's completion, but maybe informers will grow
      an API for that in the future.
    
    * The metrics server also continues to run until CVO.Run() exits,
      where previously we began gracefully shutting it down at the same
      time we started shutting down CVO.Run().  This ensures we are around
      and publishing any last-minute CVO.Run() changes.
    
    * Leader election also continues to run until CVO.Run() exits.  We
      don't want to release the lease while we're still controlling
      things.
    
    * CVO.Run() and AutoUpdate.Run() both stop immediately when the
      passed-in context is canceled or we call runCancel internally
      (because of a TERM, error from a goroutine, or loss of leadership).
      These are the only two goroutines that are actually writing to the
      API servers, so we want to shut them down as quickly as possible.
    
    Drop an unnecessary runCancel() from the "shutting down" branch of the
    error collector.  I'd added it in b30aa0e, but you can only ever
    get into the "shutting down" branch if runCancel has already been
    called.  And fix the scoping for the shutdownTimer variable so we
    don't clear it on each for-loop iteration (oops :p, bug from
    b30aa0e).
    
    Add some logging to the error collector, so it's easier to see where
    we are in the collection process from the operator logs.  Also start
    logging collected goroutines by name, so we can figure out which may
    still be outstanding.
    
    Set terminationGracePeriodSeconds 130 to extend the default 30s [3],
    to give the container the full two-minute graceful timeout window
    before the kubelet steps in with a KILL.
    
    Push the Background() initialization all the way up to the
    command-line handler, to make it more obvious that the context is
    scoped to the whole 'start' invocation.
    
    [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log
    [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7
    [3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core
    
    Cherry picked from cc1921d (openshift#424), around conflicts due to the lack
    of TLS metrics and the Context pivots in 4.5.
    wking committed Aug 27, 2020
    Configuration menu
    Copy the full SHA
    905b305 View commit details
    Browse the repository at this point in the history
  4. pkg/start/start_integration_test: Do not assume "deleted" for ConfigM…

    …ap lock release
    
    From the godocs:
    
      $ grep -A5 '// HolderIdentity' vendor/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go
        // HolderIdentity is the ID that owns the lease. If empty, no one owns this lease and
        // all callers may acquire. Versions of this library prior to Kubernetes 1.14 will not
        // attempt to acquire leases with empty identities and will wait for the full lease
        // interval to expire before attempting to reacquire. This value is set to empty when
        // a client voluntarily steps down.
        HolderIdentity       string      `json:"holderIdentity"`
    
    The previous assumption that the release would involve ConfigMap
    deletion was born with the test in 2b81f47 (cvo: Release our leader
    lease when we are gracefully terminated, 2019-01-16, openshift#87).
    
    Cherry picked from dd09c3f (openshift#424), around conflicts due to the lack
    of Context pivots in 4.5.
    wking committed Aug 27, 2020
    Configuration menu
    Copy the full SHA
    c8af639 View commit details
    Browse the repository at this point in the history
  5. pkg/start: Fill in deferred HandleCrash

    Clayton wants these in each goroutine we launch [1].  Obviously
    there's no way to reach inside the informer Start()s and add it there.
    I'm also adding this to the FIXME comment for rerolling the
    auto-update worker goroutines; we'll get those straigtened out in
    future work.
    
    Cherry picked from 9c42a92 (openshift#424), around conflicts due to the lack
    of Context pivots in 4.5.
    
    [1]: openshift#424
    wking committed Aug 27, 2020
    Configuration menu
    Copy the full SHA
    c8f99b2 View commit details
    Browse the repository at this point in the history
  6. cmd/start: Include the version in the outgoing log line

    Lala wanted the version included in the outgoing log line [1].  I'm
    not sure why you'd be wondering which version of the CVO code was
    running for that particular line, and not for other lines in the log,
    but including the version there is easy enough.
    
    While we're thinking about logging the CVO version, also remove the
    useless %s formatting from the opening log line, because we don't need
    to manipulate version.String at all.
    
    [1]: openshift#424 (comment)
    wking committed Aug 27, 2020
    Configuration menu
    Copy the full SHA
    a42bfb7 View commit details
    Browse the repository at this point in the history
  7. *: Wash through 'go fmt'

    Now that we have CI that cares about this (hooray!).  Generated with:
    
      $ go fmt ./...
    
    using:
    
      $ go version
      go version go1.14.4 linux/amd64
    wking committed Aug 27, 2020
    Configuration menu
    Copy the full SHA
    65bcffd View commit details
    Browse the repository at this point in the history