Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make k0s reset fail if it can't reach containerd #4434

Merged

Conversation

juanluisvaladas
Copy link
Contributor

Description

Prior to this commit, if the containerd unix socket wasn't listening grpc.Dial would try to connect forever.

This makes the dial retry about 3 times and we retry the whole ListPods operation 3 times and after that we assume this step has failed.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added

Checklist:

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

@juanluisvaladas juanluisvaladas added the backport/release-1.30 PR that needs to be backported/cherrypicked to the release-1.30 branch label May 16, 2024
@juanluisvaladas juanluisvaladas requested a review from a team as a code owner May 16, 2024 12:15
twz123
twz123 previously approved these changes May 16, 2024
Copy link
Member

@twz123 twz123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this commit, if the containerd unix socket wasn't listening grpc.Dial would try to connect forever.

What lead to this situation? Did containerd crash? I mean, the reset step is starting its own containerd instance just in order to be able to gracefully stop containers (and doesn't restart it in case it crashes. Maybe it should use the containerd component to do it "the usual way"? Maybe it should short-circuit the list-and-stop loop when it detects that containerd terminated?). Having containerd not listening on its socket file makes me think that we might have another problem here. Did you consider to use non-blocking dials? Not sure if a retry even makes sense when there's nothing listening. The only reason for retrying this is that containerd is still starting up didn't start to listen yet. On the other hand, a blocking dial would be preferable then.

Anyhow, /approve

pkg/cleanup/containers.go Outdated Show resolved Hide resolved
conn, err := grpc.Dial(addr, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock())
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(5*time.Second))
defer cancel()
conn, err := grpc.DialContext(ctx, addr, grpc.WithBlock(), grpc.WithTransportCredentials(insecure.NewCredentials()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My IDE is telling me grpc.DialContext is deprecated:

Deprecated: use NewClient instead. Will be supported throughout 1.x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's odd because my IDE told me grpc.Dial using grpc.WithTimeout was deprecated and to use this instead, but it's indeed deprecated. I'll revisit this.

@@ -105,10 +106,12 @@ func getRuntimeClient(addr string) (pb.RuntimeServiceClient, *grpc.ClientConn, e
}

func getRuntimeClientConnection(addr string) (*grpc.ClientConn, error) {
conn, err := grpc.Dial(addr, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock())
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(5*time.Second))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider to add proper context propagation here, i.e. adding a context to, say, StopContainer and friends to use it here?

var err error
pods, err = c.Config.containerRuntime.ListContainers()
if err != nil {
return err
}
return nil
})
}, retry.Attempts(3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default of 10 times was too much, I guess? If we had a context here, we could support Ctrl+C from the console.

@juanluisvaladas
Copy link
Contributor Author

juanluisvaladas commented May 17, 2024

After checking this further, grpc.NewClient ignores grpc.WithBlock(), so even if we pass a grpc.WithContextDialer it won't really have the effect we want.

I think the most sensible way of dealing with this is starting the connection in the background with grpc.NewClient and if it fails to connect, we'll get the error listing the pod sandbox. As there aren't practically any steps in between instantiating the client and using i, and it's instantiated on every request to the CRI I don't think this changes the logic at all, even if that bit is in the background.

Finally, I reverted the attempts to the default value of 10 because the code is much faster and 10 doesn't take that long to fail anyway.

The context propagation for control+c is indeed useful, so I added it.


var pods []string
ctx := context.Background()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this is not sufficient to support Ctrl+C. A clean solution probably needs to use proper context propagation using a context that's setup somewhere at the beginning of the reset subcommand, which is hooked up to signal handlers that cancel it, e.g. by using os/signal.NotifyContext(...). That's separate from this PR's intent, though, and can be followed up later on. We can probably leave the contexts alone here and add them at a later stage. If we'd keep them here, I'd suggest to indicate that this is not yet complete:

Suggested change
ctx := context.Background()
ctx := context.TODO()

pkg/cleanup/containers.go Outdated Show resolved Hide resolved
@juanluisvaladas juanluisvaladas force-pushed the fix-reset-failedcontainerd branch 3 times, most recently from bb07173 to 72d7434 Compare May 22, 2024 10:08
@juanluisvaladas
Copy link
Contributor Author

juanluisvaladas commented May 22, 2024

I made two changes:
1- Applied both suggestions.
2- Removed gprc.WithBlock() from the grpc.NewClient call. I apparently added it accidentally in the previous commit. This option is ignored in grpc.NewClient so effectively this change isn't doing anything, but it shouldn't be there.

Copy link
Member

@twz123 twz123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Note to self: The culprit was that grpc.WithBlock() made the call to grpc.Dial not return and retry forever internally, effectively making our own retry loop useless.

This can happen if the internal containerd fails to start or crashes, since there's no monitoring or supervision at all for the spawned process. Startup failures can happen, as k0s reset may very well be executed on nodes that are already broken in one way or another.

The same goes for an external runtime if one is configured but not behaving well.

@juanluisvaladas Would you mind rebasing this?

Prior to this commit, if the containerd unix socket wasn't listening
grpc.Dial would try to connect forever.

This commit establishes the connection in the background and the actual
call will fail if it has to.

Also we implement a single context for all the operations so that we can
cancel the execution with control c.

Co-authored-by: Tom Wieczorek <twz123@users.noreply.github.com>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
@juanluisvaladas juanluisvaladas merged commit 34d4b5e into k0sproject:main Jun 6, 2024
13 checks passed
@k0s-bot
Copy link

k0s-bot commented Jun 6, 2024

Successfully created backport PR for release-1.30:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cli backport/release-1.30 PR that needs to be backported/cherrypicked to the release-1.30 branch bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants