Make k0s reset fail if it can't reach containerd #4434

juanluisvaladas · 2024-05-16T12:15:47Z

Description

Prior to this commit, if the containerd unix socket wasn't listening grpc.Dial would try to connect forever.

This makes the dial retry about 3 times and we retry the whole ListPods operation 3 times and after that we assume this step has failed.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

How Has This Been Tested?

Manual test
Auto test added

Checklist:

twz123

Prior to this commit, if the containerd unix socket wasn't listening grpc.Dial would try to connect forever.

What lead to this situation? Did containerd crash? I mean, the reset step is starting its own containerd instance just in order to be able to gracefully stop containers (and doesn't restart it in case it crashes. Maybe it should use the containerd component to do it "the usual way"? Maybe it should short-circuit the list-and-stop loop when it detects that containerd terminated?). Having containerd not listening on its socket file makes me think that we might have another problem here. Did you consider to use non-blocking dials? Not sure if a retry even makes sense when there's nothing listening. The only reason for retrying this is that containerd is still starting up didn't start to listen yet. On the other hand, a blocking dial would be preferable then.

Anyhow, /approve

pkg/cleanup/containers.go

twz123 · 2024-05-16T13:31:25Z

pkg/container/runtime/cri.go

-	conn, err := grpc.Dial(addr, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock())
+	ctx, cancel := context.WithTimeout(context.Background(), time.Duration(5*time.Second))
+	defer cancel()
+	conn, err := grpc.DialContext(ctx, addr, grpc.WithBlock(), grpc.WithTransportCredentials(insecure.NewCredentials()))


My IDE is telling me grpc.DialContext is deprecated:

Deprecated: use NewClient instead. Will be supported throughout 1.x.

That's odd because my IDE told me grpc.Dial using grpc.WithTimeout was deprecated and to use this instead, but it's indeed deprecated. I'll revisit this.

twz123 · 2024-05-16T13:31:41Z

pkg/container/runtime/cri.go

@@ -105,10 +106,12 @@ func getRuntimeClient(addr string) (pb.RuntimeServiceClient, *grpc.ClientConn, e
 }

 func getRuntimeClientConnection(addr string) (*grpc.ClientConn, error) {
-	conn, err := grpc.Dial(addr, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock())
+	ctx, cancel := context.WithTimeout(context.Background(), time.Duration(5*time.Second))


Did you consider to add proper context propagation here, i.e. adding a context to, say, StopContainer and friends to use it here?

twz123 · 2024-05-16T13:33:35Z

pkg/cleanup/containers.go

 		var err error
 		pods, err = c.Config.containerRuntime.ListContainers()
 		if err != nil {
 			return err
 		}
 		return nil
-	})
+	}, retry.Attempts(3))


The default of 10 times was too much, I guess? If we had a context here, we could support Ctrl+C from the console.

juanluisvaladas · 2024-05-17T09:29:28Z

After checking this further, grpc.NewClient ignores grpc.WithBlock(), so even if we pass a grpc.WithContextDialer it won't really have the effect we want.

I think the most sensible way of dealing with this is starting the connection in the background with grpc.NewClient and if it fails to connect, we'll get the error listing the pod sandbox. As there aren't practically any steps in between instantiating the client and using i, and it's instantiated on every request to the CRI I don't think this changes the logic at all, even if that bit is in the background.

Finally, I reverted the attempts to the default value of 10 because the code is much faster and 10 doesn't take that long to fail anyway.

The context propagation for control+c is indeed useful, so I added it.

twz123 · 2024-05-22T07:08:53Z

pkg/cleanup/containers.go


 	var pods []string
+	ctx := context.Background()


Unfortunately, this is not sufficient to support Ctrl+C. A clean solution probably needs to use proper context propagation using a context that's setup somewhere at the beginning of the reset subcommand, which is hooked up to signal handlers that cancel it, e.g. by using os/signal.NotifyContext(...). That's separate from this PR's intent, though, and can be followed up later on. We can probably leave the contexts alone here and add them at a later stage. If we'd keep them here, I'd suggest to indicate that this is not yet complete:

Suggested change

ctx := context.Background()

ctx := context.TODO()

pkg/cleanup/containers.go

juanluisvaladas · 2024-05-22T10:10:24Z

I made two changes:
1- Applied both suggestions.
2- Removed gprc.WithBlock() from the grpc.NewClient call. I apparently added it accidentally in the previous commit. This option is ignored in grpc.NewClient so effectively this change isn't doing anything, but it shouldn't be there.

twz123

Alright. Note to self: The culprit was that grpc.WithBlock() made the call to grpc.Dial not return and retry forever internally, effectively making our own retry loop useless.

This can happen if the internal containerd fails to start or crashes, since there's no monitoring or supervision at all for the spawned process. Startup failures can happen, as k0s reset may very well be executed on nodes that are already broken in one way or another.

The same goes for an external runtime if one is configured but not behaving well.

@juanluisvaladas Would you mind rebasing this?

Prior to this commit, if the containerd unix socket wasn't listening grpc.Dial would try to connect forever. This commit establishes the connection in the background and the actual call will fail if it has to. Also we implement a single context for all the operations so that we can cancel the execution with control c. Co-authored-by: Tom Wieczorek <twz123@users.noreply.github.com> Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>

k0s-bot · 2024-06-06T11:09:58Z

Successfully created backport PR for release-1.30:

[Backport release-1.30] Make k0s reset fail if it can't reach containerd #4560

juanluisvaladas added the backport/release-1.30 PR that needs to be backported/cherrypicked to the release-1.30 branch label May 16, 2024

juanluisvaladas requested a review from a team as a code owner May 16, 2024 12:15

juanluisvaladas requested review from kke and twz123 May 16, 2024 12:15

juanluisvaladas force-pushed the fix-reset-failedcontainerd branch from 88e8162 to bc107f6 Compare May 16, 2024 12:17

twz123 previously approved these changes May 16, 2024

View reviewed changes

juanluisvaladas dismissed twz123’s stale review via a0ebcc1 May 17, 2024 09:31

juanluisvaladas force-pushed the fix-reset-failedcontainerd branch from bc107f6 to a0ebcc1 Compare May 17, 2024 09:31

twz123 reviewed May 22, 2024

View reviewed changes

pkg/cleanup/containers.go Outdated Show resolved Hide resolved

juanluisvaladas force-pushed the fix-reset-failedcontainerd branch 3 times, most recently from bb07173 to 72d7434 Compare May 22, 2024 10:08

juanluisvaladas enabled auto-merge May 22, 2024 10:50

twz123 approved these changes Jun 5, 2024

View reviewed changes

juanluisvaladas force-pushed the fix-reset-failedcontainerd branch from 72d7434 to 4e7b82a Compare June 6, 2024 11:09

juanluisvaladas merged commit 34d4b5e into k0sproject:main Jun 6, 2024
13 checks passed

k0s-bot mentioned this pull request Jun 6, 2024

[Backport release-1.30] Make k0s reset fail if it can't reach containerd #4560

Merged

twz123 added area/cli bug Something isn't working labels Jun 6, 2024

This was referenced Jun 6, 2024

[Backport release-1.29] Make k0s reset fail if it can't reach containerd #4564

Open

k0s reset hangs #4211

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make k0s reset fail if it can't reach containerd #4434

Make k0s reset fail if it can't reach containerd #4434

juanluisvaladas commented May 16, 2024

twz123 left a comment

twz123 May 16, 2024

juanluisvaladas May 16, 2024

twz123 May 16, 2024

twz123 May 16, 2024

juanluisvaladas commented May 17, 2024 •

edited

twz123 May 22, 2024

juanluisvaladas commented May 22, 2024 •

edited

twz123 left a comment

k0s-bot commented Jun 6, 2024

Make k0s reset fail if it can't reach containerd #4434

Make k0s reset fail if it can't reach containerd #4434

Conversation

juanluisvaladas commented May 16, 2024

Description

Type of change

How Has This Been Tested?

Checklist:

twz123 left a comment

Choose a reason for hiding this comment

twz123 May 16, 2024

Choose a reason for hiding this comment

juanluisvaladas May 16, 2024

Choose a reason for hiding this comment

twz123 May 16, 2024

Choose a reason for hiding this comment

twz123 May 16, 2024

Choose a reason for hiding this comment

juanluisvaladas commented May 17, 2024 • edited

twz123 May 22, 2024

Choose a reason for hiding this comment

juanluisvaladas commented May 22, 2024 • edited

twz123 left a comment

Choose a reason for hiding this comment

k0s-bot commented Jun 6, 2024

juanluisvaladas commented May 17, 2024 •

edited

juanluisvaladas commented May 22, 2024 •

edited