Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump hcsshim to get some fixes. #42270

Merged
merged 2 commits into from Apr 20, 2021
Merged

Bump hcsshim to get some fixes. #42270

merged 2 commits into from Apr 20, 2021

Conversation

cpuguy83
Copy link
Member

@cpuguy83 cpuguy83 commented Apr 7, 2021

closes #42269

This also requires bumping winio.

This addresses some customer issues for us.

This also requires bumping winio.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
@tianon
Copy link
Member

tianon commented Apr 7, 2021

Heh, looks like you lost a small race with @katiewasnothere / #42269, although you've also updated go-winio here (but that's the only difference I see).

Edit: and the link to microsoft/hcsshim@e811ee7 (microsoft/hcsshim#991) over there, which I gather is the specific commit that fixes the customer issues?

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CI is running again; windows CI is currently failing (is being worked on)

@thaJeztah
Copy link
Member

@cpuguy83 does this need backporting? (if so, also needs #41689)

@olljanat
Copy link
Contributor

olljanat commented Apr 8, 2021

This addresses some customer issues for us.

@cpuguy83 @katiewasnothere any possibility to open up about which kind of which kind of issue(s) are expected to be fixed by this?

@cpuguy83
Copy link
Member Author

cpuguy83 commented Apr 8, 2021

@olljanat Specific case is too much logging in errors.

@thaJeztah
Copy link
Member

Ok this looks like a regression / change in behavior:

=== RUN   TestDockerSuite/TestExecAPIStartInvalidCommand
    --- FAIL: TestDockerSuite/TestExecAPIStartInvalidCommand (1.97s)
        docker_api_exec_test.go:270: assertion failed: 500 (resp.StatusCode int) != 400 (code int): response body: {"message":"container 163342b8273daefbf14ee3cae8f02bdc53ca085e76684c15c0c64bccc07b7fcb encountered an error during hcsshim::System::CreateProcess: failure in a Windows system call: The system cannot find the file specified. (0x2)"}

Does hcsshim return a NotFound error, and do we translate that to HTTP 400 ?
I think 400 is the correct code for this case, so we should updated the test to
check for 400 or 500 errors

But we should update our API docs to reflect that a 400 error can be returned;
it currently only shows 404 or 500, no other errors.

Screenshot 2021-04-09 at 14 36 41

@thaJeztah
Copy link
Member

Hmm... funny, so looking at the test, older API versions could return a 404 in the test (that definitely was a bug: "command not found" !== "container not found");

if versions.LessThan(testEnv.DaemonAPIVersion(), "1.32") {
startExec(c, id, http.StatusNotFound)
} else {
startExec(c, id, http.StatusBadRequest)
}

@thaJeztah
Copy link
Member

Actually thinking now if 400 is correct; the request itself for this case (start a container) is valid, but the container configuration is invalid, so 🤷‍♂️ depends on how you look at it I guess.

@thaJeztah
Copy link
Member

I suspect the problem is here;

moby/daemon/errors.go

Lines 141 to 147 in 7b9275c

// if we receive an internal error from the initial start of a container then lets
// return it instead of entering the restart loop
// set to 127 for container cmd not found/does not exist)
if contains(errDesc, cmd) &&
(contains(errDesc, "executable file not found") ||
contains(errDesc, "no such file or directory") ||
contains(errDesc, "system cannot find the file specified") ||

Looking at the error message returned (wrapped for readability);

container 163342b8273daefbf14ee3cae8f02bdc53ca085e76684c15c0c64bccc07b7fcb
encountered an error during hcsshim::System::CreateProcess:
failure in a Windows system call:
The system cannot find the file specified. (0x2)

✅ The string does contain system cannot find the file specified, which we match in that function:

contains(errDesc, "system cannot find the file specified") ||

❌ The string does not contain cmd (invalid), and because of that, contains(errDesc, cmd) won't match;
if contains(errDesc, cmd) &&

Erm... so actually that second is not true (for some reason); it does match; it looks for entrypoint (perhaps that's an empty string, so will result in "match anything";

return translateContainerdStartErr(ec.Entrypoint, ec.SetExitCode, err)
). Possibly we can find what it has set for Entrypoint and Argss, because we trigger an event;
daemon.LogContainerEventWithAttributes(c, "exec_start: "+ec.Entrypoint+" "+strings.Join(ec.Args, " "), attributes)

And when we do match, we return a startInvalidConfigError(errDesc) error, which is an errdefs.InvalidParameter() ( -> 400).

retErr = startInvalidConfigError(errDesc)

So was it previously broken? Is it broken on Linux (if we return a 500 there)?

@thaJeztah
Copy link
Member

@cpuguy83 could you have a peek at the failures? (I tried finding what's causing it in my comment above)

@cpuguy83
Copy link
Member Author

It seems like specifically because we took out the extra error details the error check does not match anymore.

Whether or not the command path is in the error message is a an
implementation detail.
For example, on Windows the only reason this ever matched was because it
dumped the entire container config into the error message, but this had
nothing to do with the actual error.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
@thaJeztah
Copy link
Member

Wait.. why is this failing? Broken package? https://ci-next.docker.com/public/blue/organizations/jenkins/moby/detail/PR-42270/3/pipeline/224

[2021-04-14T23:12:37.016Z] #22 43.99       copying Cython/Utility/CppSupport.cpp -> build/lib.linux-aarch64-3.7/Cython/Utility
[2021-04-14T23:12:37.016Z] #22 43.99       running build_ext
[2021-04-14T23:12:37.016Z] #22 43.99       building 'Cython.Plex.Scanners' extension
[2021-04-14T23:12:37.016Z] #22 43.99       creating build/temp.linux-aarch64-3.7
[2021-04-14T23:12:37.016Z] #22 43.99       creating build/temp.linux-aarch64-3.7/tmp
[2021-04-14T23:12:37.016Z] #22 43.99       creating build/temp.linux-aarch64-3.7/tmp/pip-install-jasgbmp7
[2021-04-14T23:12:37.016Z] #22 43.99       creating build/temp.linux-aarch64-3.7/tmp/pip-install-jasgbmp7/Cython
[2021-04-14T23:12:37.016Z] #22 43.99       creating build/temp.linux-aarch64-3.7/tmp/pip-install-jasgbmp7/Cython/Cython
[2021-04-14T23:12:37.016Z] #22 43.99       creating build/temp.linux-aarch64-3.7/tmp/pip-install-jasgbmp7/Cython/Cython/Plex
[2021-04-14T23:12:37.016Z] #22 43.99       aarch64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.7m -c /tmp/pip-install-jasgbmp7/Cython/Cython/Plex/Scanners.c -o build/temp.linux-aarch64-3.7/tmp/pip-install-jasgbmp7/Cython/Cython/Plex/Scanners.o
[2021-04-14T23:12:37.016Z] #22 43.99       /tmp/pip-install-jasgbmp7/Cython/Cython/Plex/Scanners.c:21:10: fatal error: Python.h: No such file or directory
[2021-04-14T23:12:37.016Z] #22 43.99        #include "Python.h"
[2021-04-14T23:12:37.016Z] #22 43.99                 ^~~~~~~~~~
[2021-04-14T23:12:37.016Z] #22 43.99       compilation terminated.
[2021-04-14T23:12:37.016Z] #22 43.99       error: command 'aarch64-linux-gnu-gcc' failed with exit status 1
[2021-04-14T23:12:37.016Z] #22 43.99
[2021-04-14T23:12:37.016Z] #22 43.99       ----------------------------------------
[2021-04-14T23:12:37.016Z] #22 43.99   Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-jasgbmp7/Cython/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-if5qclwe/install-record.txt --single-version-externally-managed --prefix /tmp/pip-build-env-_dtiuyfw --compile" failed with error code 1 in /tmp/pip-install-jasgbmp7/Cython/
[2021-04-14T23:12:37.016Z] #22 43.99
[2021-04-14T23:12:37.016Z] #22 43.99   ----------------------------------------
[2021-04-14T23:12:37.016Z] #22 44.08 Command "/usr/bin/python3 -m pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-_dtiuyfw --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- setuptools wheel Cython" failed with error code 1 in None
[2021-04-14T23:12:37.592Z] #22 ERROR: executor failed running [/bin/sh -c pip3 install yamllint==1.16.0]: exit code: 1

Let me restart

@cpuguy83
Copy link
Member Author

This one I've seen a couple of time srecently, but seems like a newly introduced problem (just not on this PR, I think?)

--- FAIL: TestContainerKillOnDaemonStart (14.48s)
    daemon_test.go:46: [dab2d17437a7d] failed to start daemon with arguments [--data-root /go/src/github.com/docker/docker/bundles/test-integration/TestContainerKillOnDaemonStart/dab2d17437a7d/root --exec-root /tmp/dxr/dab2d17437a7d --pidfile /go/src/github.com/docker/docker/bundles/test-integration/TestContainerKillOnDaemonStart/dab2d17437a7d/docker.pid --userland-proxy=true --containerd-namespace dab2d17437a7d --containerd-plugins-namespace dab2d17437a7dp --containerd /var/run/docker/containerd/containerd.sock --host unix:///tmp/docker-integration/dab2d17437a7d.sock --debug --storage-driver overlay2] : [dab2d17437a7d] daemon exited during startup: exit status 1
    daemon_test.go:38: assertion failed: error is not nil: Cannot connect to the Docker daemon at unix:///tmp/docker-integration/dab2d17437a7d.sock. Is the docker daemon running?

This one looks legit, I don't think I've seen that.

=== RUN   TestDockerSuite/TestRunTwoConcurrentContainers
    --- FAIL: TestDockerSuite/TestRunTwoConcurrentContainers (64.13s)
        docker_cli_run_test.go:811: assertion failed: error is not nil: 
            Command:  d:\CI\PR-42270\4\binary\docker.exe run busybox sleep 2
            ExitCode: 125
            Error:    exit status 125
            Stdout:   
            Stderr:   d:\CI\PR-42270\4\binary\docker.exe: Error response from daemon: hcsshim::PrepareLayer - failed failed in Win32: The device is not ready. (0x15).
            
            
            Failures:
            ExitCode was 125 expected 0
            Expected no error

@cpuguy83
Copy link
Member Author

TestContainerKillOnDaemoNStart looks like a libnetwork issue:

time="2021-04-15T13:05:15.796975427Z" level=debug msg="Cleaning up old mountid : done."
failed to start daemon: Error initializing network controller: error obtaining controller instance: unable to add return rule in DOCKER-ISOLATION-STAGE-2 chain:  (iptables failed: iptables --wait -A DOCKER-ISOLATION-STAGE-2 -j RETURN: iptables: No chain/target/match by that name.
 (exit status 1))

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah
Copy link
Member

@tianon - good to go?

Copy link
Member

@tianon tianon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@tianon tianon merged commit 72fef53 into moby:master Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants