vsock crash #50

tailhook · 2016-08-04T17:00:20Z

Hi,

We have an annoying crash of docker for mac which boils down to invalid ringbuf check here AFAIU.

Log:

Aug  4 19:28:49 estus Docker[42833] <Error>: Socket.TCPV4.read connected TCP: caught Uwt.Uwt_error(Uwt.ECONNRESET, "uwt_read", "") returning Eof
Aug  4 19:28:49 estus Docker[42833] <Error>: Socket.TCPV4.write connected TCP: caught Uwt.Uwt_error(Uwt.EPIPE, "write", "") returning Eof
Aug  4 19:28:49 estus Docker[42833] <Error>: Socket.Stream: caught Uwt.Uwt_error(Uwt.ENOTCONN, "shutdown", "")
Aug  4 19:28:49 estus Docker[42833] <Error>: Socket.TCPV4.write connected TCP: caught Uwt.Uwt_error(Uwt.EPIPE, "write", "") returning Eof
Aug  4 19:28:49 estus Docker[42833] <Error>: Socket.Stream: caught Uwt.Uwt_error(Uwt.ENOTCONN, "shutdown", "")
Aug  4 19:28:55 estus Docker[42833] <Error>: Socket.TCPV4.write connected TCP: caught Uwt.Uwt_error(Uwt.EPIPE, "write", "") returning Eof
Aug  4 19:28:55 estus Docker[42833] <Error>: Socket.Stream: caught Uwt.Uwt_error(Uwt.ENOTCONN, "shutdown", "")
Aug  4 19:28:55 estus Docker[42833] <Error>: Socket.TCPV4.write connected TCP: caught Uwt.Uwt_error(Uwt.EPIPE, "write", "") returning Eof
Aug  4 19:28:55 estus Docker[42833] <Error>: Socket.Stream: caught Uwt.Uwt_error(Uwt.ENOTCONN, "shutdown", "")
Aug  4 19:28:55 estus Docker[42832] <Error>: Fatal unexpected exception: Socket.Closed
Aug  4 19:28:55 estus Docker[42834] <Notice>: virtio-net-vpnkit: initialising, opts="uuid=01b794ee-68a9-4b92-9c79-b2b44e5f271e,path=/Users/free2use/Library/Containers/com.docker.docker/Data/s50,macfile=/Users/free2use/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/mac.0"
Aug  4 19:28:55 estus Docker[42834] <Notice>: Error reading vsyslog length: EOF
Aug  4 19:28:55 estus Docker[42834] <Notice>: Assertion failed: (sc->reply_cons != sc->reply_prod), function send_response_common, file src/pci_virtio_sock.c, line 879.

As far as I can see the error is here:

    if (sc->reply_prod == VTSOCK_REPLYRINGSZ)
        sc->reply_prod = 0;
    /* check for reply ring overflow */
    /* XXX correct check? */
    DPRINTF(("TX: QUEUING REPLY IN SLOT %x (prod %x, cons %x)\n",
         slot, sc->reply_prod, sc->reply_cons));
    assert(sc->reply_cons != sc->reply_prod);

I.e. someone should "XXX correct check" :)

Alternatively, if it's hard to fix could we make VTSOCK_REPLYRINGSZ bigger (if it will help to make error more rare)?

The text was updated successfully, but these errors were encountered:

ijc · 2016-08-08T09:40:37Z

Hi @tailhook & thanks for your report. Sorry for the delay replying, I've been AFK travelling.

The assertion is checking whether the reply ring is full, I'm reasonably confident the check is correct, although as the comment indicates a second pair of eyes wouldn't go amis ;-). However this is slightly moot in light of seeing the assertion actually fire, even if the check were wrong it would certainly only be an off-by-one, which would indicate that either the ring had but a single slot left or had overflowed by one slot (the latter is obviously worse, but as I'll explain still a bit moot).

The issue is that if the reply ring has filled up (+/-1 if the check is wrong, but one slot either way isn't going to save us or change what follows) then due to a protocol bug in the vsock protocol spec there is actually nothing we can do other than crash or deadlock. The assert turns the deadlock into a crash, which is obviously not ideal either but better than some other component silently locking up. There is a description of the deadlock in 1ea0301 which is where the reply ring was introduced (previously the deadlock was much more obvious, there are also some subsequent followup fixes but that is the main one). This all stems from the assertion in the design that vsock is reliable and therefore cannot drop anything, while also not including any back pressure mechanisms, while also including cross talk between the rx and tx rings (i.e. a tx request can result in a requirement to have a free rx slot).

I raised this deadlock upstream a while back and had a discussion with the guys doing the protocol specification (as well as the Linux frontend implementation) and they were going to look at remedying it in the next version. I see in my inbox this morning that I have a new set of frontend patches (RFC v6) and a new version of the specification proposal (v7, nb the spec proposal and the frontend implementation are independently numbered, but v6 of the patches corresponds to v7 of the spec) and a changelog which specifically mentions this issue, so hopefully something has changed (or been more tightly specified) which will resolve this issue at root.

That's going to take a little time to digest and to make the necessary updates on both ends. In the meantime increasing VTSOCK_REPLYRINGSZ is not a bad idea, as you can see in 1ea0301 I already doubled it once to 2x ring size, going to 4x ring size can't hurt.

If you could give details of your repro case (ideally in the form of a Dockerfile or docker-compose.yml and a simple docker build/run or docker-compose up that would be handy since your workload seems to be even better at provoking this behaviour than ones I've found previously, to be hitting the assert it must be provoking many outstanding vsock tx requests compared with the ring size.

tailhook · 2016-08-08T12:39:16Z

If you could give details of your repro case

It's hard to do. We have a huge web project in python. We run nodemon (to start python proccess) and webpack --watch for monitoring filesystem events. There are literally tens of thousands of inotify watches (default 8k limit doesn't work for us). Both things run on the (osxfs) volume.

The python process is a web application with a pretty large HTML pages (~10Mb). The crash usually happens when we refresh a page in a webbrowser. In fact when we use minified version of page (which is about 200Kb, i.e. 50x smaller) the failure is more rare. So it probably depends on amount of traffic between host system and containers. Each web page visit also fetches data from a database that is outside of this physical box.

Also, each web page touches session file which is put on shared volume. And it may happen (don't know how to verify it) that some inotify watches may follow that file changes for whatever reason we misconfigured the watchers.

This is the story. I can't publish all of our sources. But if you'll give me some hints which exact operations influence the ringbuf usage, we can come up with some smaller script to reproduce the issue: do we need filesystem operations? are network connections or packets fill the ringbuf? what about inotify?

gluxon · 2016-08-08T17:21:53Z

This is the story. I can't publish all of our sources. But if you'll give me some hints which exact operations influence the ringbuf usage, we can come up with some smaller script to reproduce the issue: do we need filesystem operations? are network connections or packets fill the ringbuf? what about inotify?

tailhook

This is very well said. My project is in a similar situation. While I open source all my personal projects, the one this bug affects is at my job, so source code is difficult to share. I attempted to start an nginx image that loads an HTML page returning many large images and wasn't able to reproduce the bug. The workaround I used until I discovered I could increase the reply rings size was just disabling images. I agree that knowing exactly what causes reply rings to increment would definitely help me to upload a testcase.

@ijc25 Thank you for your detailed comment! Reading it was very interesting after looking at the source code myself. I appreciate a developer that explains how an issue is occurring so the rest of us could learn.

dsheets · 2016-08-19T10:24:23Z

@ijc any phrase like "fix [issue ref]" will close an issue even if preceded by "not". :-P

ijc · 2016-08-19T10:33:19Z

@dsheets: thanks! I was careful with my wording in #51 but apparently got sloppy when I pulled the changes over.

kudos · 2016-09-10T23:59:33Z

I'm seeing a similar crash:

11/09/2016 00:45:09.939 Docker[39599]: Assertion failed: (!REPLY_RING_EMPTY(sc)), function send_response_common, file src/pci_virtio_sock.c, line 950.
11/09/2016 00:45:09.942 Docker[39599]: virtio-net-vpnkit: initialising, opts="uuid=38112b0d-c8d1-483d-8240-07cfe3f720af,path=/Users/jonathan/Library/Containers/com.docker.docker/Data/s50,macfile=/Users/jonathan/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/mac.0"
11/09/2016 00:45:09.943 Docker[39599]: Interface will have uuid 38112b0d-c8d1-483d-8240-07cfe3f720af
11/09/2016 00:45:09.943 Docker[39599]: Connection established with MAC=c0:ff:ee:c0:ff:ee and MTU 1500
11/09/2016 00:45:09.943 Docker[39599]: virtio-9p: initialising path=/Users/jonathan/Library/Containers/com.docker.docker/Data/s40,tag=db
11/09/2016 00:45:09.943 Docker[39599]: virtio-9p: initialising path=/Users/jonathan/Library/Containers/com.docker.docker/Data/s51,tag=port
11/09/2016 00:45:09.943 Docker[39599]: linkname /Users/jonathan/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty
11/09/2016 00:45:09.943 Docker[39599]: COM1 connected to /dev/ttys002
11/09/2016 00:45:09.944 Docker[39599]: COM1 linked to /Users/jonathan/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty
11/09/2016 00:45:09.947 Docker[39598]: PPP.listen: closing connection
11/09/2016 00:45:09.948 Docker[39597]: Fatal unexpected exception: Socket.Closed
11/09/2016 00:45:09.951 Docker[39598]: Socket.Stream: caught Uwt.Uwt_error(Uwt.ENOTCONN, "shutdown", "")
11/09/2016 00:45:09.953 Docker[21966]: Reap com.docker.osxfs (pid 39597): exit status 1
11/09/2016 00:45:10.011 Docker[21966]: Reap com.docker.driver.amd64-linux (pid 39599): exit status 0
11/09/2016 00:45:10.013 Docker[21976]: VM shutdown at 2016-09-11 00:45:09 +0100 IST

Unlike @tailhook, I was running a single container without any volumes mounted.

telecoda · 2016-09-12T14:53:36Z

Experiencing the same issue here. Using docker-compose with multiple containers. Only occurs with a large amount of i/o.

yacn · 2016-09-12T17:40:29Z

I'm seeing the same crash as @kudos with the latest beta, running two containers (portal and a DB) with the code mounted into the container.

9/12/16 12:00:27.181 PM Docker[47588]: Assertion failed: (!REPLY_RING_EMPTY(sc)), function send_response_common, file src/pci_virtio_sock.c, line 950.
9/12/16 12:00:27.181 PM Docker[47588]: virtio-net-vpnkit: initialising, opts="uuid=32529f86-e3f9-40d5-a388-c47224df3252,path=/Users/iboehman/Library/Containers/com.docker.docker/Data/s50,macfile=/Users/iboehman/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/mac.0"
9/12/16 12:00:27.181 PM Docker[47588]: Interface will have uuid 32529f86-e3f9-40d5-a388-c47224df3252
9/12/16 12:00:27.181 PM Docker[47588]: Connection established with MAC=c0:ff:ee:c0:ff:ee and MTU 1500
9/12/16 12:00:27.181 PM Docker[47588]: virtio-9p: initialising path=/Users/iboehman/Library/Containers/com.docker.docker/Data/s40,tag=db
9/12/16 12:00:27.182 PM Docker[47588]: virtio-9p: initialising path=/Users/iboehman/Library/Containers/com.docker.docker/Data/s51,tag=port
9/12/16 12:00:27.182 PM Docker[47588]: linkname /Users/iboehman/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty
9/12/16 12:00:27.182 PM Docker[47588]: COM1 connected to /dev/ttys001
9/12/16 12:00:27.183 PM Docker[47588]: COM1 linked to /Users/iboehman/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty
9/12/16 12:00:27.183 PM Docker[47588]: received sigusr1, pausing
9/12/16 12:00:27.183 PM Docker[47588]: received sigusr2, unpausing
9/12/16 12:00:27.183 PM Docker[47588]: received sigusr1, pausing
9/12/16 12:00:27.183 PM Docker[47588]: received sigusr2, unpausing
9/12/16 12:00:27.188 PM Docker[47587]: PPP.listen: closing connection
9/12/16 12:00:27.189 PM Docker[47586]: Fatal unexpected exception: Socket.Closed
9/12/16 12:00:27.201 PM Docker[37616]: Reap com.docker.osxfs (pid 47586): exit status 1
9/12/16 12:00:27.222 PM Docker[37623]: VM shutdown at 2016-09-12 12:00:27 -0400 EDT
9/12/16 12:00:27.229 PM Docker[37616]: Reap com.docker.driver.amd64-linux (pid 47588): exit status 0
9/12/16 12:00:27.244 PM Docker[47587]: Socket.Stream: caught Uwt.Uwt_error(Uwt.ENOTCONN, "shutdown", "")

… of rx v7 of the virtio vsock spec proposal adds a requirement to continue processing the TX ring so long as "additional resources" are available in order to avoid deadlock. In this implementation those additional resources take to form of the reply_ring. In order to avoid deadlock we must ensure that we process the reply ring in preference to actual data RX frequently enough that we avoid filling the reply ring too. Since the reply ring is 2x the virtio ring size I _think_ it should be sufficient to ensure that we check for new reply entries periodically (something less than the virtio ring size) rather than every time. This is discussed in several replies to the spec posting at <1470324277-19300-1-git-send-email-stefanha@redhat.com>. A more formal consideration of how deadlock is avoided using this scheme should be forthcoming (see the above thread). Worst case is this change doesn't avoid deadlock, it can't make things any worse. This should hopefully address moby#50. Signed-off-by: Ian Campbell <ian.campbell@docker.com>

ijc · 2016-09-16T08:50:01Z

I've merged the changes in #59 and the new Linux frontend patches into Docker for Mac's master branch so they will be present in the next beta, which will be beta27 due (hopefully) next week (but possibly as long as the week after depending on other bugs etc).

I think these changes will fix this issue so I am going to close it now while it is on my mind.

Once beta27 is released (if you install beta26 now you should get it via autoupdate) please do retest and file a bug against https://github.com/docker/for-mac (via the "Diagnose & Feedback" entry in the whale menu) if the issue persists or recurs in a different form.

Thanks for you report(s) and your patience.

Ian.

telecoda · 2016-09-16T11:51:13Z

@ijc25 - nice work. I have been testing today and I'm not experiencing the implosion of Docker I was getting before with heavy i/o 👍

ijc · 2016-09-16T11:54:07Z

@telecoda the changes referred to here are not yet in any released version of docker for mac (they will be in beta27 when it is released). So I don't know what change has fixed things for you, but I'm glad everything is ok!

telecoda · 2016-09-16T15:41:26Z

@ijc25 sorry, I was a little premature with my congrats. That'll explain why I still have the same issue..

bryanhelmig · 2016-09-29T04:19:16Z

Now that beta27 landed - looks like I'm still getting some sort of error in this class, here are some syslog -k Sender Docker logs https://gist.github.com/bryanhelmig/dca8725d5fcd4649432fe4e5d068dd51.

ijc · 2016-09-29T08:58:29Z

@bryanhelmig Yes, that does look rather similar :-(.

Please could you open a fresh ticket against https://github.com/docker/for-mac with the full output of diagnostics. Any sort of reproduction case would also be useful in tracking this down.

I understand that many of the repro cases are when building proprietary software but if someone was able to par their case down to just the basic build system skeleton required to trigger the issue (removing all the proprietary stuff) or find a subset of their code which they were willing/able to post then we'll stand a much better chance of nailing this once and for all.

bryanhelmig · 2016-10-05T04:11:12Z

I will try and set up a build that does the following things we're doing:

a mounted volume backing both js and python
a handful of very large javascript non-minified debug build files that are served through a django dev server
a non trivial django view utilizing many imports

I will see if we can reproduce it. Would a github repo with a docker-compose.yml be sufficient?

mpauly · 2016-10-05T11:11:51Z

In a similar configuration (also django with django-static-precompiler and less) we encountered similar problems. However these turned out to be extremly difficult to replicate in a more minimalistic setup. We were able to get arround this by re-enabling sendfile in our uwsig config which we had previously disabled in our dev setup to fix a bug in an older version of docker for mac.
Commenting out disable-sendfile = true in our uwsgi config .ini file solved the problem - I hope this helps.

bryanhelmig · 2016-10-05T16:20:20Z

@mpauly curious to hear a bit about your uwsgi + dev server setup.

bryanhelmig · 2016-10-06T19:26:14Z

@ijc25 Hey Ian - I have a reproducible sample project here https://github.com/bryanhelmig/docker-compose-crash-repro. It should be as simple as cloning, building and running the container to repro. You can tweak up/down the pressure as needed.

Can you take a peek and let me know if it happens for you? Happy to go further - this is a massive pain point for us now.

mpauly · 2016-10-07T09:17:35Z

Unfortunately I was a bit early on this - even after enabling sendfile the bug reoccured, only less frequently. I also created a sample project to reproduce https://github.com/mpauly/dockerbug. It looks fairly similar to what @bryanhelmig did.

bryanhelmig · 2016-10-10T15:01:27Z

@ijc25 thanks from us as well - looking forward to giving beta28 a spin.

bryanhelmig · 2016-10-13T17:12:40Z

First pass with beta28 and it looks 👍 . If you don't see me complain/link to a new issue, this is resolved in the wild for us!

mpauly · 2016-10-18T15:59:43Z

Same here, beta28 solves the problem. Thanks!

tailhook mentioned this issue Aug 5, 2016

Docker for Mac: com.docker.docker/Data/*00000003.00000948: connect: connection refused moby/moby#24045

Closed

ijc mentioned this issue Aug 11, 2016

vsock backend updates #51

Merged

dmfenton mentioned this issue Aug 18, 2016

Docker and osxfs crash with heavy writes to host mounted volume docker/for-mac#351

Closed

ijc closed this as completed Aug 19, 2016

dsheets reopened this Aug 19, 2016

ijc mentioned this issue Sep 13, 2016

Update virt-sock backend to v7 spec #59

Merged

ijc closed this as completed Sep 16, 2016

goneflyin mentioned this issue Sep 20, 2016

APIError during ruby gem installation activity within container docker/for-mac#678

Closed

dsheets mentioned this issue Sep 29, 2016

Docker container crashed unexpectedly docker/for-mac#676

Closed

samoht mentioned this issue Oct 10, 2016

Container (and Docker) keeps restarting docker/for-mac#186

Closed

This was referenced Oct 14, 2016

docker hangs in 'starting' after build-attempt with --no-cache --pull docker/for-mac#706

Closed

Containers hang on file i/o on a host volume that's an NFS share docker/for-mac#664

Closed

ijc mentioned this issue Nov 1, 2016

Docker Restarts Unexpectedly Restarting the Containers too docker/for-mac#878

Closed

This was referenced Nov 1, 2016

Docker 1.12.1 macOSX crash docker/for-mac#883

Closed

dial unix Library/Containers/com.docker.docker/Data/*00000003.00000948: connect: connection refused docker/for-mac#818

Closed

yallop mentioned this issue Jan 27, 2017

Fatal Error com.docker.osx.hyperkit.linux failed to start Exit code 1 docker/for-mac#1226

Closed

samoht mentioned this issue Jan 31, 2017

docker cli becomes unresponsive docker/for-mac#933

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vsock crash #50

vsock crash #50

tailhook commented Aug 4, 2016 •

edited

Loading

ijc commented Aug 8, 2016

tailhook commented Aug 8, 2016

gluxon commented Aug 8, 2016

dsheets commented Aug 19, 2016

ijc commented Aug 19, 2016

kudos commented Sep 10, 2016

telecoda commented Sep 12, 2016

yacn commented Sep 12, 2016

ijc commented Sep 16, 2016

telecoda commented Sep 16, 2016

ijc commented Sep 16, 2016

telecoda commented Sep 16, 2016

bryanhelmig commented Sep 29, 2016

ijc commented Sep 29, 2016

bryanhelmig commented Oct 5, 2016

mpauly commented Oct 5, 2016 •

edited

Loading

bryanhelmig commented Oct 5, 2016

bryanhelmig commented Oct 6, 2016

mpauly commented Oct 7, 2016

bryanhelmig commented Oct 10, 2016

bryanhelmig commented Oct 13, 2016

mpauly commented Oct 18, 2016

vsock crash #50

vsock crash #50

Comments

tailhook commented Aug 4, 2016 • edited Loading

ijc commented Aug 8, 2016

tailhook commented Aug 8, 2016

gluxon commented Aug 8, 2016

dsheets commented Aug 19, 2016

ijc commented Aug 19, 2016

kudos commented Sep 10, 2016

telecoda commented Sep 12, 2016

yacn commented Sep 12, 2016

ijc commented Sep 16, 2016

telecoda commented Sep 16, 2016

ijc commented Sep 16, 2016

telecoda commented Sep 16, 2016

bryanhelmig commented Sep 29, 2016

ijc commented Sep 29, 2016

bryanhelmig commented Oct 5, 2016

mpauly commented Oct 5, 2016 • edited Loading

bryanhelmig commented Oct 5, 2016

bryanhelmig commented Oct 6, 2016

mpauly commented Oct 7, 2016

bryanhelmig commented Oct 10, 2016

bryanhelmig commented Oct 13, 2016

mpauly commented Oct 18, 2016

tailhook commented Aug 4, 2016 •

edited

Loading

mpauly commented Oct 5, 2016 •

edited

Loading