Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-32183: catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node #3201

Merged

Conversation

joelanford
Copy link
Member

Description of the change:

Motivation for the change:

Architectural changes:

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 11, 2024
Copy link

openshift-ci bot commented Apr 11, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

// currentPods refers to the current pod instances of the catalog source
currentPods := c.currentPods(logger, source)

tmpPods := currentPods[:0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nifty! didn't know about this one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this reminds me there's a new kid on the block (slices.DeleteFunc). I'll switch to that as the intent is more clear.

@joelanford joelanford force-pushed the force-delete-dead-catalog-pods branch 2 times, most recently from d856987 to 0038478 Compare April 12, 2024 18:41
@joelanford joelanford marked this pull request as ready for review April 12, 2024 18:42
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 12, 2024
…to unreachable node

Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
@joelanford joelanford force-pushed the force-delete-dead-catalog-pods branch from 0038478 to 82f4997 Compare April 12, 2024 19:01
@joelanford joelanford changed the title catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node OCPBUGS-32183: catalog-operator: delete catalog pods stuck in Terminating state due to unreachable node Apr 12, 2024
@grokspawn
Copy link
Contributor

would love to better understand the need for the errors --> pkgerrors change, to know if that's a pattern we should look for elsewhere.

@joelanford
Copy link
Member Author

errors is the name of a very popular package in the Go standard library.

This package was authored prior to standard library Go having fmt.Errorf("blah blah: %w", err). Before Go supported that, pkg/errors.Wrapf(err, "message") was a popular way to wrap errors.

I made this change to give the errors name to stdlib errors over the third-party errors package.

@@ -523,6 +545,29 @@ func imageChanged(logger *logrus.Entry, updatePod *corev1.Pod, servingPods []*co
return false
}

func isPodDead(pod *corev1.Pod) bool {
for _, check := range []func(*corev1.Pod) bool{
Copy link
Contributor

@grokspawn grokspawn Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever way to provide an extensible approach to a series of checks... but do we need it? It seems we would have the same benefits by doing s/isPodDead/isPodDeletedByTaintManager/g

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat confident we'll find another way for pods to be dead. We've seen similar issues in operator-lib leader for life.

So I figured I'd make things super easy for ourselves next time around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, do we think we'll need to expand this in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I ain't agin' it, but I generally like to supply it when it's needed.
/lgtm
from me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not objecting... just asking.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2024
@joelanford joelanford added this pull request to the merge queue Apr 12, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 12, 2024
@tmshort tmshort added this pull request to the merge queue Apr 15, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 15, 2024
@grokspawn grokspawn added this pull request to the merge queue Apr 15, 2024
Merged via the queue into operator-framework:master with commit 68c24cf Apr 15, 2024
14 checks passed
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 14, 2024
[PR 3201](operator-framework#3201) attempted to solve for the
issue by deleting the pods stuck in Terminating due to unreachable node.
However, the logic to do that was included in `EnsureRegistryServer`, which only
gets executed in polling in requested by the user.

This PR fixes the issue by modifying the `RegistryReconciler` interface with the
introduction of a new component for the interface: `RegistryCleaner`.
This promotes the job of cleaning up the pods that are stuck (and any other resources
that may need to be cleaned) to a first class status. The `RegistryCleaner` is then
called as the first step in the Catalog Operator registry reconclier, so that the pods
stuck are cleaned up before the rest of the reconciler logic is executed.
The PR provides implementation of `RegistryCleaner` for the `GrpcReconciler`,
`ConfigMapReconciler` and the `GrpcAddressRegistryReconciler` implementations of
`RegistryReconciler` interface.
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 14, 2024
[PR 3201](operator-framework#3201) attempted to solve for the
issue by deleting the pods stuck in Terminating due to unreachable node.
However, the logic to do that was included in `EnsureRegistryServer`, which only
gets executed if polling in requested by the user.

This PR fixes the issue by modifying the `RegistryReconciler` interface with the
introduction of a new component for the interface: `RegistryCleaner`.
This promotes the job of cleaning up the pods that are stuck (and any other resources
that may need to be cleaned) to a first class status. The `RegistryCleaner` is then
called as the first step in the Catalog Operator registry reconclier, so that the pods
stuck are cleaned up before the rest of the reconciler logic is executed.
The PR provides implementation of `RegistryCleaner` for the `GrpcReconciler`,
`ConfigMapReconciler` and the `GrpcAddressRegistryReconciler` implementations of
`RegistryReconciler` interface.
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 14, 2024
[PR 3201](operator-framework#3201) attempted to solve for the
issue by deleting the pods stuck in Terminating due to unreachable node.
However, the logic to do that was included in `EnsureRegistryServer`, which only
gets executed if polling in requested by the user.

This PR fixes the issue by modifying the `RegistryReconciler` interface with the
introduction of a new component for the interface: `RegistryCleaner`.
This promotes the job of cleaning up the pods that are stuck (and any other resources
that may need to be cleaned) to a first class status. The `RegistryCleaner` is then
called as the first step in the Catalog Operator registry reconclier, so that the pods
stuck are cleaned up before the rest of the reconciler logic is executed.
The PR provides implementation of `RegistryCleaner` for the `GrpcReconciler`,
`ConfigMapReconciler` and the `GrpcAddressRegistryReconciler` implementations of
`RegistryReconciler` interface.
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 21, 2024
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 21, 2024
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 22, 2024
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 22, 2024
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 22, 2024
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 29, 2024
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 29, 2024
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 29, 2024
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 29, 2024
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.
github-merge-queue bot pushed a commit that referenced this pull request Aug 30, 2024
[PR 3201](#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.
openshift-bot pushed a commit to openshift-bot/operator-framework-olm that referenced this pull request Aug 31, 2024
[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
openshift-bot pushed a commit to openshift-bot/operator-framework-olm that referenced this pull request Sep 3, 2024
[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
anik120 added a commit to anik120/operator-framework-olm that referenced this pull request Sep 4, 2024
[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
anik120 added a commit to anik120/operator-framework-olm that referenced this pull request Sep 5, 2024
[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
anik120 added a commit to anik120/operator-framework-olm that referenced this pull request Sep 5, 2024
[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
anik120 added a commit to anik120/operator-framework-olm that referenced this pull request Sep 16, 2024
…ilure (#3366)

[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
anik120 added a commit to anik120/operator-framework-olm that referenced this pull request Sep 18, 2024
…ilure (#3366)

[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
anik120 added a commit to anik120/operator-framework-olm that referenced this pull request Sep 18, 2024
…ilure (#3366)

[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
anik120 added a commit to anik120/operator-framework-olm that referenced this pull request Sep 18, 2024
…ilure (#3366)

[PR 3201](operator-framework/operator-lifecycle-manager#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: f2431893193e7112f78298ad7682ff3e1b179d8c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants