Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: add grpc.WithBlock option #40137

Merged

Conversation

fuweid
Copy link
Contributor

@fuweid fuweid commented Oct 25, 2019

WithBlock makes sure that the following containerd request is reliable.

In one edge case with high load pressure, kernel kills dockerd, containerd
and containerd-shims caused by OOM. When both dockerd and containerd
restart, but containerd will take time to recover all the existing
containers. Before containerd serving, dockerd will failed with gRPC
error. That bad thing is that restore action will still ignore the
any non-NotFound errors and returns running state for
already stopped container. It is unexpected behavior. And
we need to restart dockerd to make sure that anything is OK.

It is painful. Add WithBlock can prevent the edge case. And
n common case, the containerd will be serving in shortly.
It is not harm to add WithBlock for containerd connection.

Signed-off-by: Wei Fu fuweid89@gmail.com

WithBlock makes sure that the following containerd request is reliable.

In one edge case with high load pressure, kernel kills dockerd, containerd
and containerd-shims caused by OOM. When both dockerd and containerd
restart, but containerd will take time to recover all the existing
containers. Before containerd serving, dockerd will failed with gRPC
error. That bad thing is that restore action will still ignore the
any non-NotFound errors and returns running state for
already stopped container. It is unexpected behavior. And
we need to restart dockerd to make sure that anything is OK.

It is painful. Add WithBlock can prevent the edge case. And
n common case, the containerd will be serving in shortly.
It is not harm to add WithBlock for containerd connection.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid
Copy link
Contributor Author

fuweid commented Oct 25, 2019

ping @thaJeztah and @AkihiroSuda

@thaJeztah
Copy link
Member

@dmcgowan @cpuguy83 ptal 🤗

@dmcgowan
Copy link
Member

dmcgowan commented Oct 28, 2019

Before containerd serving, dockerd will failed with gRPC
error. That bad thing is that restore action will still ignore the
any non-NotFound errors and returns running state for
already stopped container.

Isn't this just Docker mishandling the errors? My understanding is WithBlock was actually changing the underlying connection to not use a pool and wasn't handling reconnects after that.

edit: my understanding may have been based on a much older version of grpc, I cannot prove this based on the currently vendored version.

@fuweid
Copy link
Contributor Author

fuweid commented Oct 29, 2019

@dmcgowan For now, WithBlock just blocks dial call until the connection is ready. And it will not impact the following reconnect. I test in local and dockerd will reconnect if containerd is down temporarily.

Isn't this just Docker mishandling the errors?

Basically, Yes. Checkout the following code and it seems that dockerd need to introduce new state for container if failed to restore the running container. But if we add WithBlock, it can reduce unexpected behaviors caused by connection error.

// daemon/daemon.go
alive, _, process, err = daemon.containerd.Restore(context.Background(), c.ID, c.InitializeStdio)
if err != nil && !errdefs.IsNotFound(err) {
     logrus.Errorf("Failed to restore container %s with containerd: %s", c.ID, err)
     return
}

@fuweid
Copy link
Contributor Author

fuweid commented Nov 4, 2019

any update?

@thaJeztah
Copy link
Member

@dmcgowan PTAL

@AkihiroSuda
Copy link
Member

@cpuguy83 PTAL?

@fuweid
Copy link
Contributor Author

fuweid commented Feb 20, 2020

any updates?

Copy link
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for your patience!

@cpuguy83 cpuguy83 merged commit 62bd5a3 into moby:master Feb 21, 2020
@cpuguy83
Copy link
Member

Added cherry-pick since this seems like something we'd want on 19.03

@fuweid Did you want to open a backport?

@fuweid fuweid deleted the me-wait-for-remote-containerd-before-reload branch February 22, 2020 06:09
@fuweid
Copy link
Contributor Author

fuweid commented Feb 22, 2020

Added cherry-pick since this seems like something we'd want on 19.03

@fuweid Did you want to open a backport?

@cpuguy83 Will do that! Thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants