New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ContainerWait on remove: don't stuck on rm fail #34999

Merged
merged 1 commit into from Oct 29, 2017

Conversation

@kolyshkin
Contributor

kolyshkin commented Sep 27, 2017

Currently, if a container removal has failed for some reason, any client waiting for removal (e.g. docker run --rm) is stuck, waiting for removal to succeed while it has failed already. For more details and the reproducer, please check #34945

This commit addresses that by allowing ContainerWait() with container.WaitCondition == "removed" argument to return an error in case of removal failure. The ContainerWaitOKBody response returned to a client is amended with an error message, and the Client.ContainerWait() is modified to return the error, if any, to the client.

Note that this feature is only available for API version >= 1.34. The old behavior (i.e. do not return from wait on removal error) is being kept for current and older clients, except for the added empty Error string field in returned JSON, which to my best knowledge should not cause any compatibility problems.

Now, docker-cli would need a separate commit to bump the API to 1.34 and to show an error returned, if any. The current version of docker-cli will work as is.

  • [v2: recreate the waitRemove channel after closing]
  • [v3: document new api; keep legacy behavior for older clients]
@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah
Member

thaJeztah commented Sep 27, 2017

@tiborvass

This comment has been minimized.

Show comment
Hide comment
@tiborvass

tiborvass Sep 27, 2017

Collaborator

Lol, I just saw this PR, FYI I also opened #35001. Probably unrelated.

Collaborator

tiborvass commented Sep 27, 2017

Lol, I just saw this PR, FYI I also opened #35001. Probably unrelated.

Show outdated Hide outdated container/state.go Outdated
// It sets an error and closes the internal waitRemove channel to unblock
// callers waiting for the container to be removed.
func (s *State) SetRemovalError(err error) {
s.SetError(err)
s.Lock()
close(s.waitRemove) // Unblock those waiting on remove.

This comment has been minimized.

@jlhawn

jlhawn Sep 27, 2017

Contributor

A problem still exists where another API request could come along after this to try to remove the container again. If it succeeds and SetRemoved() is called then it will panic because we've closed this channel again which was already closed.

It should be okay to just create a new s.waitRemove channel after closing the original one.

@jlhawn

jlhawn Sep 27, 2017

Contributor

A problem still exists where another API request could come along after this to try to remove the container again. If it succeeds and SetRemoved() is called then it will panic because we've closed this channel again which was already closed.

It should be okay to just create a new s.waitRemove channel after closing the original one.

This comment has been minimized.

@kolyshkin

kolyshkin Sep 27, 2017

Contributor

Sure. I have updated the patch to reopen the channel

@kolyshkin

kolyshkin Sep 27, 2017

Contributor

Sure. I have updated the patch to reopen the channel

@dnephin dnephin referenced this pull request Sep 28, 2017

Open

Multiple run of `docker run --rm` hangs in a host #115

2 of 3 tasks complete
@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Sep 29, 2017

Contributor

Lol, I just saw this PR, FYI I also opened #35001. Probably unrelated.

@tiborvass yeah, we both were that close to that lucky number, 350 x 💯

Contributor

kolyshkin commented Sep 29, 2017

Lol, I just saw this PR, FYI I also opened #35001. Probably unrelated.

@tiborvass yeah, we both were that close to that lucky number, 350 x 💯

@vieux

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Oct 3, 2017

Collaborator
Collaborator

vieux commented Oct 3, 2017

@jlhawn

jlhawn approved these changes Oct 4, 2017

@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Oct 5, 2017

Contributor

Note to self: need to test a combo of old daemon and new client (that has ContainerWaitOKBody.Error field).

Update: it works just fine (except, of course, docker cli is stuck which is what this PR fixes). I probably checked this before but forgot )

Contributor

kolyshkin commented Oct 5, 2017

Note to self: need to test a combo of old daemon and new client (that has ContainerWaitOKBody.Error field).

Update: it works just fine (except, of course, docker cli is stuck which is what this PR fixes). I probably checked this before but forgot )

@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Oct 5, 2017

Contributor

Anything else I can do to move this forward? To me it seems quite a critical issue.

Contributor

kolyshkin commented Oct 5, 2017

Anything else I can do to move this forward? To me it seems quite a critical issue.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Oct 5, 2017

Contributor

I don’t know, I’m not sure this is right. It’s a breaking change for sure. The existing behavior is correct in that the container is not removed.

The specific client hang issue could be fixed by after some interval checking on the status (ie, is it in a “Dead” state.

Contributor

cpuguy83 commented Oct 5, 2017

I don’t know, I’m not sure this is right. It’s a breaking change for sure. The existing behavior is correct in that the container is not removed.

The specific client hang issue could be fixed by after some interval checking on the status (ie, is it in a “Dead” state.

@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Oct 6, 2017

Contributor

I don’t know, I’m not sure this is right. It’s a breaking change for sure.

I thought about it, and tried all combinations of old/new client/daemon and it works either same or better. Can you think of any other client that can break so I can check it? Do you mean that with this change, a client can no longer assume the container is removed? This can be fixed a client (check the container status after getting the response, or just check for response.Error == "").

To me, the current API of not reporting a removal error is sort of useless/broken. If we are interested in "container was removed" event, we are most probably also interested in the "container removal failure", too.

In case we want to be extra careful about API, we can deprecate this one and add another one, which works like the one in this PR.

The specific client hang issue could be fixed by after some interval checking
on the status (ie, is it in a “Dead” state.

Surely, but this would look like a kludge, a work around broken API. Current way is simple and elegant. Polling for status is not.

Contributor

kolyshkin commented Oct 6, 2017

I don’t know, I’m not sure this is right. It’s a breaking change for sure.

I thought about it, and tried all combinations of old/new client/daemon and it works either same or better. Can you think of any other client that can break so I can check it? Do you mean that with this change, a client can no longer assume the container is removed? This can be fixed a client (check the container status after getting the response, or just check for response.Error == "").

To me, the current API of not reporting a removal error is sort of useless/broken. If we are interested in "container was removed" event, we are most probably also interested in the "container removal failure", too.

In case we want to be extra careful about API, we can deprecate this one and add another one, which works like the one in this PR.

The specific client hang issue could be fixed by after some interval checking
on the status (ie, is it in a “Dead” state.

Surely, but this would look like a kludge, a work around broken API. Current way is simple and elegant. Polling for status is not.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Oct 11, 2017

Contributor

The issue is more about caller's expectation when the API returns.
We should at least make sure that older API versions keep the old behavior.

Contributor

cpuguy83 commented Oct 11, 2017

The issue is more about caller's expectation when the API returns.
We should at least make sure that older API versions keep the old behavior.

@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Oct 12, 2017

Contributor

We should at least make sure that older API versions keep the old behavior.

I have updated the patch to enable old-style behavior for API version < 1.34. Tested with unmodified docker-cli (which hangs on removal error as it was before), as well as patched client which works as expected (prints an error).

@cpuguy83 PTAL

Contributor

kolyshkin commented Oct 12, 2017

We should at least make sure that older API versions keep the old behavior.

I have updated the patch to enable old-style behavior for API version < 1.34. Tested with unmodified docker-cli (which hangs on removal error as it was before), as well as patched client which works as expected (prints an error).

@cpuguy83 PTAL

Show outdated Hide outdated api/server/router/container/container_routes.go Outdated
Show outdated Hide outdated api/types/container/container_wait.go Outdated
@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Oct 12, 2017

Contributor

@cpuguy83 I made changes based on your last review, PTAL once time permits.

Contributor

kolyshkin commented Oct 12, 2017

@cpuguy83 I made changes based on your last review, PTAL once time permits.

@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Oct 16, 2017

Contributor

OK, I now have three working versions of this patch, all implementing slightly different behavior for older clients in case of removal failure. They are:

  1. Same as for new clients (i.e. receive an error which they will ignore as they are not aware).
  2. Same as before this patch (i.e. clients keep waiting forever).
  3. Same as when daemon exits (i.e. we close the connection), telling the client the wait operation has failed.

Once again, the above are just variations of how to handle old clients when removal has failed. New clients will receive an error as described in the commit message, and can handle it.

We need to choose one. Number 1 is surely an incompatibility in the API. Number 2 is keeping the old (bad) API as is -- the problem is clients are stuck in waiting for something that might never happen. Number 3 is somewhat ugly way to report an error while keeping the API compatible.

Please share your opinion. As it fixes the client stuck I'd love this to move forward.

Contributor

kolyshkin commented Oct 16, 2017

OK, I now have three working versions of this patch, all implementing slightly different behavior for older clients in case of removal failure. They are:

  1. Same as for new clients (i.e. receive an error which they will ignore as they are not aware).
  2. Same as before this patch (i.e. clients keep waiting forever).
  3. Same as when daemon exits (i.e. we close the connection), telling the client the wait operation has failed.

Once again, the above are just variations of how to handle old clients when removal has failed. New clients will receive an error as described in the commit message, and can handle it.

We need to choose one. Number 1 is surely an incompatibility in the API. Number 2 is keeping the old (bad) API as is -- the problem is clients are stuck in waiting for something that might never happen. Number 3 is somewhat ugly way to report an error while keeping the API compatible.

Please share your opinion. As it fixes the client stuck I'd love this to move forward.

@jlhawn

This comment has been minimized.

Show comment
Hide comment
@jlhawn

jlhawn Oct 16, 2017

Contributor

I personally like option 3 the best because it at least forces an error on the client, but I-Am-Not-A-Maintainer.

Contributor

jlhawn commented Oct 16, 2017

I personally like option 3 the best because it at least forces an error on the client, but I-Am-Not-A-Maintainer.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Oct 17, 2017

Contributor

Did we test that the docker CLI would return a non-zero exit status with option 3?

Contributor

cpuguy83 commented Oct 17, 2017

Did we test that the docker CLI would return a non-zero exit status with option 3?

@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Oct 18, 2017

Contributor

Did we test that the docker CLI would return a non-zero exit status with option 3?

I tested all three variations:

1. docker-cli from 17.05-ce (client API 1.29 == using legacy wait)

It does not exit with non-zero in case engine is closing the connection. I tested it just by stopping a daemon while having a few docker run processes stuck in wait, and got this:

ERRO[0043] error getting events from daemon: unexpected EOF 
ERRO[0043] error getting events from daemon: unexpected EOF 
ERRO[0043] error getting events from daemon: unexpected EOF 
EXIT CODE 0
EXIT CODE 0
EXIT CODE 0

So, we have a bug in legacy wait code, but it's not the subject of this PR (although it made me think looking into engine legacy wait implementation, as it suffers from the same issue and I didn't realize it).

2. docker-cli from 17.06-ce (client API 1.30)

Got this:

ERRO[0010] error waiting for container: EOF             
EXIT 125

IMHO it's pretty good (and way better than just being stuck).

3. docker-cli with my patch (API 1.34)

ERRO[0004] Error waiting for container: driver "overlay" failed to remove root filesystem for 5d8a7e9ae4158a3dfce7bed45e9ad1e39fedd8f43f8c94307eed39ea1fbd696b: remove /var/lib/docker/overlay/cb3f2f712b65531bec29d82aa53cf100c01aa138c35335d73539e61dd6397dae/merged: device or resource busy 
EXIT 125
Contributor

kolyshkin commented Oct 18, 2017

Did we test that the docker CLI would return a non-zero exit status with option 3?

I tested all three variations:

1. docker-cli from 17.05-ce (client API 1.29 == using legacy wait)

It does not exit with non-zero in case engine is closing the connection. I tested it just by stopping a daemon while having a few docker run processes stuck in wait, and got this:

ERRO[0043] error getting events from daemon: unexpected EOF 
ERRO[0043] error getting events from daemon: unexpected EOF 
ERRO[0043] error getting events from daemon: unexpected EOF 
EXIT CODE 0
EXIT CODE 0
EXIT CODE 0

So, we have a bug in legacy wait code, but it's not the subject of this PR (although it made me think looking into engine legacy wait implementation, as it suffers from the same issue and I didn't realize it).

2. docker-cli from 17.06-ce (client API 1.30)

Got this:

ERRO[0010] error waiting for container: EOF             
EXIT 125

IMHO it's pretty good (and way better than just being stuck).

3. docker-cli with my patch (API 1.34)

ERRO[0004] Error waiting for container: driver "overlay" failed to remove root filesystem for 5d8a7e9ae4158a3dfce7bed45e9ad1e39fedd8f43f8c94307eed39ea1fbd696b: remove /var/lib/docker/overlay/cb3f2f712b65531bec29d82aa53cf100c01aa138c35335d73539e61dd6397dae/merged: device or resource busy 
EXIT 125
@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Oct 18, 2017

Contributor

TL;DR: option number 3 works as intended; need to see the legacyWait, too (but it's a separate issue)

Contributor

kolyshkin commented Oct 18, 2017

TL;DR: option number 3 works as intended; need to see the legacyWait, too (but it's a separate issue)

@jlhawn

This comment has been minimized.

Show comment
Hide comment
@jlhawn

jlhawn Oct 18, 2017

Contributor
Contributor

jlhawn commented Oct 18, 2017

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Oct 18, 2017

Contributor

Three works for me then.

Contributor

cpuguy83 commented Oct 18, 2017

Three works for me then.

@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Oct 18, 2017

Contributor

OK, I have updated the patch to use option 3.

@thaJeztah can you PTAL if API version and docs are fine?

Contributor

kolyshkin commented Oct 18, 2017

OK, I have updated the patch to use option 3.

@thaJeztah can you PTAL if API version and docs are fine?

@kolyshkin

This comment has been minimized.

Show comment
Hide comment
@kolyshkin

kolyshkin Oct 25, 2017

Contributor

So, it seems we have a consensus and I am not aware of anything that prevents this from being merged.

@cpuguy83 I made changes based on your last review two weeks ago, PTAL once time permits.

Contributor

kolyshkin commented Oct 25, 2017

So, it seems we have a consensus and I am not aware of anything that prevents this from being merged.

@cpuguy83 I made changes based on your last review two weeks ago, PTAL once time permits.

ContainerWait on remove: don't stuck on rm fail
Currently, if a container removal has failed for some reason,
any client waiting for removal (e.g. `docker run --rm`) is
stuck, waiting for removal to succeed while it has failed already.
For more details and the reproducer, please check
#34945

This commit addresses that by allowing `ContainerWait()` with
`container.WaitCondition == "removed"` argument to return an
error in case of removal failure. The `ContainerWaitOKBody`
stucture returned to a client is amended with a pointer to `struct Error`,
containing an error message string, and the `Client.ContainerWait()`
is modified to return the error, if any, to the client.

Note that this feature is only available for API version >= 1.34.
In order for the old clients to be unstuck, we just close the connection
without writing anything -- this causes client's error.

Now, docker-cli would need a separate commit to bump the API to 1.34
and to show an error returned, if any.

[v2: recreate the waitRemove channel after closing]
[v3: document; keep legacy behavior for older clients]
[v4: convert Error from string to pointer to a struct]
[v5: don't emulate old behavior, send empty response in error case]
[v6: rename legacy* vars to include version suffix]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@cpuguy83

LGTM

@jlhawn

jlhawn approved these changes Oct 25, 2017

My last review is pretty old, but still LGTM

properties:
Message:
description: "Details of an error"
type: "string"

This comment has been minimized.

@thaJeztah

thaJeztah Oct 26, 2017

Member

This comment has been minimized.

@thaJeztah

thaJeztah Oct 26, 2017

Member

Hm, guess we can't because of the existing StatusCode in the response 😞

@thaJeztah

thaJeztah Oct 26, 2017

Member

Hm, guess we can't because of the existing StatusCode in the response 😞

This comment has been minimized.

@kolyshkin

kolyshkin Oct 26, 2017

Contributor

AFAIR I tried using ErrorDetail and I could not, as it creates a circular package dependency 😱

@kolyshkin

kolyshkin Oct 26, 2017

Contributor

AFAIR I tried using ErrorDetail and I could not, as it creates a circular package dependency 😱

@thaJeztah

LGTM, thanks!

@thaJeztah thaJeztah removed this from backlog in maintainers-session Oct 26, 2017

@yongtang yongtang merged commit 220d6c4 into moby:master Oct 29, 2017

6 checks passed

dco-signed All commits are signed
experimental Jenkins build Docker-PRs-experimental 37506 has succeeded
Details
janky Jenkins build Docker-PRs 46198 has succeeded
Details
powerpc Jenkins build Docker-PRs-powerpc 6655 has succeeded
Details
windowsRS1 Jenkins build Docker-PRs-WoW-RS1 17779 has succeeded
Details
z Jenkins build Docker-PRs-s390x 6396 has succeeded
Details

kolyshkin added a commit to kolyshkin/cli that referenced this pull request Dec 15, 2017

Show container wait error
If container wait has failed, show an error from the engine
and return an appropriate exit code.

This requires engine changes from moby/moby#34999

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

@kolyshkin kolyshkin deleted the kolyshkin:wait-on-rm branch Dec 15, 2017

docker-jenkins pushed a commit to docker/docker-ce that referenced this pull request Jan 17, 2018

Show container wait error
If container wait has failed, show an error from the engine
and return an appropriate exit code.

This requires engine changes from moby/moby#34999

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Upstream-commit: 8471742a8ae3887482d5e8b35d808e045d043ab8
Component: cli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment