-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make k0s reset fail if it can't reach containerd #4434
Make k0s reset fail if it can't reach containerd #4434
Conversation
88e8162
to
bc107f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to this commit, if the containerd unix socket wasn't listening grpc.Dial would try to connect forever.
What lead to this situation? Did containerd crash? I mean, the reset step is starting its own containerd instance just in order to be able to gracefully stop containers (and doesn't restart it in case it crashes. Maybe it should use the containerd component to do it "the usual way"? Maybe it should short-circuit the list-and-stop loop when it detects that containerd terminated?). Having containerd not listening on its socket file makes me think that we might have another problem here. Did you consider to use non-blocking dials? Not sure if a retry even makes sense when there's nothing listening. The only reason for retrying this is that containerd is still starting up didn't start to listen yet. On the other hand, a blocking dial would be preferable then.
Anyhow, /approve
pkg/container/runtime/cri.go
Outdated
conn, err := grpc.Dial(addr, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock()) | ||
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(5*time.Second)) | ||
defer cancel() | ||
conn, err := grpc.DialContext(ctx, addr, grpc.WithBlock(), grpc.WithTransportCredentials(insecure.NewCredentials())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My IDE is telling me grpc.DialContext
is deprecated:
Deprecated: use NewClient instead. Will be supported throughout 1.x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's odd because my IDE told me grpc.Dial using grpc.WithTimeout was deprecated and to use this instead, but it's indeed deprecated. I'll revisit this.
pkg/container/runtime/cri.go
Outdated
@@ -105,10 +106,12 @@ func getRuntimeClient(addr string) (pb.RuntimeServiceClient, *grpc.ClientConn, e | |||
} | |||
|
|||
func getRuntimeClientConnection(addr string) (*grpc.ClientConn, error) { | |||
conn, err := grpc.Dial(addr, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock()) | |||
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(5*time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider to add proper context propagation here, i.e. adding a context to, say, StopContainer
and friends to use it here?
pkg/cleanup/containers.go
Outdated
var err error | ||
pods, err = c.Config.containerRuntime.ListContainers() | ||
if err != nil { | ||
return err | ||
} | ||
return nil | ||
}) | ||
}, retry.Attempts(3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default of 10 times was too much, I guess? If we had a context here, we could support Ctrl+C from the console.
After checking this further, I think the most sensible way of dealing with this is starting the connection in the background with Finally, I reverted the attempts to the default value of 10 because the code is much faster and 10 doesn't take that long to fail anyway. The context propagation for control+c is indeed useful, so I added it. |
bc107f6
to
a0ebcc1
Compare
pkg/cleanup/containers.go
Outdated
|
||
var pods []string | ||
ctx := context.Background() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this is not sufficient to support Ctrl+C. A clean solution probably needs to use proper context propagation using a context that's setup somewhere at the beginning of the reset subcommand, which is hooked up to signal handlers that cancel it, e.g. by using os/signal.NotifyContext(...)
. That's separate from this PR's intent, though, and can be followed up later on. We can probably leave the contexts alone here and add them at a later stage. If we'd keep them here, I'd suggest to indicate that this is not yet complete:
ctx := context.Background() | |
ctx := context.TODO() |
bb07173
to
72d7434
Compare
I made two changes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. Note to self: The culprit was that grpc.WithBlock()
made the call to grpc.Dial
not return and retry forever internally, effectively making our own retry loop useless.
This can happen if the internal containerd fails to start or crashes, since there's no monitoring or supervision at all for the spawned process. Startup failures can happen, as k0s reset may very well be executed on nodes that are already broken in one way or another.
The same goes for an external runtime if one is configured but not behaving well.
@juanluisvaladas Would you mind rebasing this?
Prior to this commit, if the containerd unix socket wasn't listening grpc.Dial would try to connect forever. This commit establishes the connection in the background and the actual call will fail if it has to. Also we implement a single context for all the operations so that we can cancel the execution with control c. Co-authored-by: Tom Wieczorek <twz123@users.noreply.github.com> Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
72d7434
to
4e7b82a
Compare
Successfully created backport PR for |
Description
Prior to this commit, if the containerd unix socket wasn't listening grpc.Dial would try to connect forever.
This makes the dial retry about 3 times and we retry the whole ListPods operation 3 times and after that we assume this step has failed.
Type of change
How Has This Been Tested?
Checklist: