New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: add grpc.WithBlock option #40137
daemon: add grpc.WithBlock option #40137
Conversation
WithBlock makes sure that the following containerd request is reliable. In one edge case with high load pressure, kernel kills dockerd, containerd and containerd-shims caused by OOM. When both dockerd and containerd restart, but containerd will take time to recover all the existing containers. Before containerd serving, dockerd will failed with gRPC error. That bad thing is that restore action will still ignore the any non-NotFound errors and returns running state for already stopped container. It is unexpected behavior. And we need to restart dockerd to make sure that anything is OK. It is painful. Add WithBlock can prevent the edge case. And n common case, the containerd will be serving in shortly. It is not harm to add WithBlock for containerd connection. Signed-off-by: Wei Fu <fuweid89@gmail.com>
ping @thaJeztah and @AkihiroSuda |
Isn't this just Docker mishandling the errors? My understanding is edit: my understanding may have been based on a much older version of grpc, I cannot prove this based on the currently vendored version. |
@dmcgowan For now,
Basically, Yes. Checkout the following code and it seems that dockerd need to introduce new state for container if failed to restore the running container. But if we add // daemon/daemon.go
alive, _, process, err = daemon.containerd.Restore(context.Background(), c.ID, c.InitializeStdio)
if err != nil && !errdefs.IsNotFound(err) {
logrus.Errorf("Failed to restore container %s with containerd: %s", c.ID, err)
return
} |
any update? |
@dmcgowan PTAL |
@cpuguy83 PTAL? |
any updates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for your patience!
Added cherry-pick since this seems like something we'd want on 19.03 @fuweid Did you want to open a backport? |
WithBlock makes sure that the following containerd request is reliable.
In one edge case with high load pressure, kernel kills dockerd, containerd
and containerd-shims caused by OOM. When both dockerd and containerd
restart, but containerd will take time to recover all the existing
containers. Before containerd serving, dockerd will failed with gRPC
error. That bad thing is that restore action will still ignore the
any non-NotFound errors and returns running state for
already stopped container. It is unexpected behavior. And
we need to restart dockerd to make sure that anything is OK.
It is painful. Add WithBlock can prevent the edge case. And
n common case, the containerd will be serving in shortly.
It is not harm to add WithBlock for containerd connection.
Signed-off-by: Wei Fu fuweid89@gmail.com