Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-32091: Add Top-level Context for Create Commands #8063

Merged

Conversation

patrickdillon
Copy link
Contributor

@patrickdillon patrickdillon commented Feb 26, 2024

Adds a top-level context when running create commands. The context is passed through the asset store and then to any PostRun functions. The main motivation is for a clean shutdown of CAPI controllers, but there are a wide variety of potential use cases.

This PR uses the pattern introduced in #6009 to enable the introduction of a GenerateWithContext function for assets and an adapter to drop the context and call the original Generate function. Currently only the Cluster asset implements GenerateWithContext.

Currently the PR achieves the primary goal of shutting down capi controllers on interrupt. There are some remaining issues I would like to clear up:

  • on interrupt, the error message simply says context cancelled. Would be better for the error message to indicate a user interrupt
  • Need to test that the logrus exit handler is properly shutting down the controllers on error (not just interrupt)
  • the interrupt signal handler is being short circuited by the call to logrus.Exit, this means the controllers are not "gracefully shutting down", but on the other hand the processes are being properly killed. It's unclear to me whether we need any special handling now that we have the top-level context.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 26, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 26, 2024

@patrickdillon: This pull request references CORS-3241 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Adds a top-level context when running create commands. The context is passed through the asset store and then to any PostRun functions. The main motivation is for a clean shutdown of CAPI controllers, but there are a wide variety of potential use cases.

This PR uses the pattern introduced in #6009 to enable the introduction of a GenerateWithContext function for assets and an adapter to drop the context and call the original Generate function. Currently only the Cluster asset implements GenerateWithContext.

Currently the PR achieves the primary goal of shutting down capi controllers on interrupt. There are some remaining issues I would like to clear up:

  • on interrupt, the error message simply says context cancelled. Would be better for the error message to indicate a user interrupt
  • Need to test that the logrus exit handler is properly shutting down the controllers on error (not just interrupt)
  • the interrupt signal handler is being short circuited by the call to logrus.Exit, this means the controllers are not "gracefully shutting down", but on the other hand the processes are being properly killed. It's unclear to me whether we need any special handling now that we have the top-level context.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@patrickdillon
Copy link
Contributor Author

/cc @andfasano @vincepri

pkg/asset/store/store.go Show resolved Hide resolved
@@ -310,6 +301,14 @@ func (i *InfraProvider) Provision(dir string, parents asset.Parents) ([]*asset.F
}

logrus.Infof("Cluster API resources have been created. Waiting for cluster to become ready...")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this sneak into this PR? This doesn't seem directly related to the work on contexts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could move to a different commit or PR, but it's somewhat related with the focus of a clean shutdown of the capi controllers.

pkg/clusterapi/system.go Outdated Show resolved Hide resolved
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2024
@patrickdillon
Copy link
Contributor Author

/assign @vincepri

Even though we are using this context when starting the controllers

ps.Cmd = exec.CommandContext(ctx, ps.Path, ps.Args...) //nolint:gosec

We still need to handle the CTRL+C user interrupt, because the context will not be done/cancelled with an interrupt, right?

How can we ensure that the CAPI system Teardown() function has enough time to complete? In the current implementation, the Teardown function begins, but when the context is cancelled an error is returned which ends up calling logrus.Exit before Teardown reports it is done:

if err != nil {
if strings.Contains(err.Error(), asset.InstallConfigError) {
logrus.Error(err)
logrus.Exit(exitCodeInstallConfigError)
}
if strings.Contains(err.Error(), asset.ClusterCreationError) {
logrus.Error(err)
logrus.Exit(exitCodeInfrastructureFailed)
}
logrus.Fatal(err)
}

This is similar to what you handled in #7693 and #7864 but I still don't see how to make it work here.

@vincepri
Copy link
Contributor

@patrickdillon In this case we'd have to create a new context, as a parent of the one that we use from signals. We can then wait for ctx.Done() before exiting and calling the Teardown function

pkg/clusterapi/system.go Outdated Show resolved Hide resolved
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 21, 2024
@patrickdillon patrickdillon force-pushed the create-cmd-context branch 3 times, most recently from 0ec14af to 401fccb Compare March 26, 2024 20:35
cmd/openshift-install/create.go Outdated Show resolved Hide resolved
cmd/openshift-install/create.go Outdated Show resolved Hide resolved
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 26, 2024
@patrickdillon patrickdillon force-pushed the create-cmd-context branch 2 times, most recently from 94aecd0 to 1e4ce95 Compare March 29, 2024 23:10
@bfournie
Copy link
Contributor

bfournie commented Apr 9, 2024

/cc @bfournie

@openshift-bot
Copy link
Contributor

openshift-bot commented Apr 10, 2024

@patrickdillon: This pull request references CORS-3241 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Adds a top-level context when running create commands. The context is passed through the asset store and then to any PostRun functions. The main motivation is for a clean shutdown of CAPI controllers, but there are a wide variety of potential use cases.

This PR uses the pattern introduced in #6009 to enable the introduction of a GenerateWithContext function for assets and an adapter to drop the context and call the original Generate function. Currently only the Cluster asset implements GenerateWithContext.

Currently the PR achieves the primary goal of shutting down capi controllers on interrupt. There are some remaining issues I would like to clear up:

  • on interrupt, the error message simply says context cancelled. Would be better for the error message to indicate a user interrupt
  • Need to test that the logrus exit handler is properly shutting down the controllers on error (not just interrupt)
  • the interrupt signal handler is being short circuited by the call to logrus.Exit, this means the controllers are not "gracefully shutting down", but on the other hand the processes are being properly killed. It's unclear to me whether we need any special handling now that we have the top-level context.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@r4f4
Copy link
Contributor

r4f4 commented Apr 11, 2024

Tried this with a ctrl-c right after capi started:

^CWARNING Received interrupt signal
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run controller "aws infrastructure provider": failed to prepare controller "aws infrastructure provider" webhook options: the server was unable to return a response in the time allotted, but may still be processing the request (get mutatingwebhookconfigurations.admissionregistration.k8s.io capa-mutating-webhook-configuration)
INFO Shutting down local Cluster API control plane...
INFO Local Cluster API system has completed operations

@patrickdillon
Copy link
Contributor Author

Tried this with a ctrl-c right after capi started:

^CWARNING Received interrupt signal
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run controller "aws infrastructure provider": failed to prepare controller "aws infrastructure provider" webhook options: the server was unable to return a response in the time allotted, but may still be processing the request (get mutatingwebhookconfigurations.admissionregistration.k8s.io capa-mutating-webhook-configuration)
INFO Shutting down local Cluster API control plane...
INFO Local Cluster API system has completed operations

Yup that is the expected result. The returned error will vary depending on where/why exit.

In terms of UX, I wonder if we should move these to debug level but I think that is mostly a nit.

> INFO Shutting down local Cluster API control plane...
> INFO Local Cluster API system has completed operations

Adds a top-level context to be passed down from main, as well
as an interrupt handler with graceful shutdown logic. The signal
handler allows us to trap a user interrupt and run graceful shutdown
logic rather than exiting immediately. This graceful shutdown
allows us to run any cleanup operations, particularly we can shutdown
locally-running CAPI controller processes. Otherise they can potentially
leak and continue running in the background (continuing to perform
reconcile actions such as creating cloud resources).
Introduces the asset generator interface, which allows generating
assets with a passed in context. Provides an adapter for assets
that do not implement GenerateWithContext. The adapter simply
wraps the asset and calls the original Generate (without context)
function.
Plumbs the top-level context into the agent, create, & gather commands
and into the asset graph.
Removes the original signal handler, which is replaced by
the signal handler in installer main.
pick 1464d8847d main: add top-level context and graceful shutdown
Typically the CAPI system needs to continue to run until
bootstrap destroy (because it is used in the bootstrap destroy
process). Shut it down if we are preserving bootstrap resources.
Updates tests to accept a context after the interface for the
asset store was updated.
@patrickdillon patrickdillon changed the title CORS-3241: Add Top-level Context for Create Commands OCPBUGS-32091: Add Top-level Context for Create Commands Apr 11, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 11, 2024
@openshift-ci-robot
Copy link
Contributor

@patrickdillon: This pull request references Jira Issue OCPBUGS-32091, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Adds a top-level context when running create commands. The context is passed through the asset store and then to any PostRun functions. The main motivation is for a clean shutdown of CAPI controllers, but there are a wide variety of potential use cases.

This PR uses the pattern introduced in #6009 to enable the introduction of a GenerateWithContext function for assets and an adapter to drop the context and call the original Generate function. Currently only the Cluster asset implements GenerateWithContext.

Currently the PR achieves the primary goal of shutting down capi controllers on interrupt. There are some remaining issues I would like to clear up:

  • on interrupt, the error message simply says context cancelled. Would be better for the error message to indicate a user interrupt
  • Need to test that the logrus exit handler is properly shutting down the controllers on error (not just interrupt)
  • the interrupt signal handler is being short circuited by the call to logrus.Exit, this means the controllers are not "gracefully shutting down", but on the other hand the processes are being properly killed. It's unclear to me whether we need any special handling now that we have the top-level context.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from gpei April 11, 2024 20:07
Copy link
Contributor

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2024
@r4f4
Copy link
Contributor

r4f4 commented Apr 12, 2024

/approve

Copy link
Contributor

openshift-ci bot commented Apr 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: r4f4, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2024
Copy link
Contributor

openshift-ci bot commented Apr 12, 2024

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-e2e-agent-compact-ipv4 401fccb link false /test okd-e2e-agent-compact-ipv4
ci/prow/altinfra-e2e-aws-ovn-localzones 0398c63 link false /test altinfra-e2e-aws-ovn-localzones
ci/prow/altinfra-e2e-aws-ovn-shared-vpc-edge-zones 0398c63 link false /test altinfra-e2e-aws-ovn-shared-vpc-edge-zones
ci/prow/altinfra-e2e-aws-ovn-single-node 0398c63 link false /test altinfra-e2e-aws-ovn-single-node
ci/prow/altinfra-e2e-aws-custom-security-groups 0398c63 link false /test altinfra-e2e-aws-custom-security-groups
ci/prow/altinfra-e2e-aws-ovn-fips 0398c63 link false /test altinfra-e2e-aws-ovn-fips
ci/prow/altinfra-e2e-aws-ovn-wavelengthzones 0398c63 link false /test altinfra-e2e-aws-ovn-wavelengthzones
ci/prow/okd-e2e-aws-ovn-upgrade a132f85 link false /test okd-e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit d8c7872 into openshift:master Apr 12, 2024
31 of 32 checks passed
@openshift-ci-robot
Copy link
Contributor

@patrickdillon: Jira Issue OCPBUGS-32091: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-32091 has been moved to the MODIFIED state.

In response to this:

Adds a top-level context when running create commands. The context is passed through the asset store and then to any PostRun functions. The main motivation is for a clean shutdown of CAPI controllers, but there are a wide variety of potential use cases.

This PR uses the pattern introduced in #6009 to enable the introduction of a GenerateWithContext function for assets and an adapter to drop the context and call the original Generate function. Currently only the Cluster asset implements GenerateWithContext.

Currently the PR achieves the primary goal of shutting down capi controllers on interrupt. There are some remaining issues I would like to clear up:

  • on interrupt, the error message simply says context cancelled. Would be better for the error message to indicate a user interrupt
  • Need to test that the logrus exit handler is properly shutting down the controllers on error (not just interrupt)
  • the interrupt signal handler is being short circuited by the call to logrus.Exit, this means the controllers are not "gracefully shutting down", but on the other hand the processes are being properly killed. It's unclear to me whether we need any special handling now that we have the top-level context.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-installer-altinfra-container-v4.16.0-202404121144.p0.gd8c7872.assembly.stream.el8 for distgit ose-installer-altinfra.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. capi jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants