Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launching system containers preloaded with lots of inner containers may lead to Sysbox gRPC timeouts #176

Closed
longzmkm opened this issue Jan 5, 2021 · 12 comments
Assignees
Labels
bug Something isn't working
Projects

Comments

@longzmkm
Copy link

longzmkm commented Jan 5, 2021

newland@student:$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
test latest c68f9e9b046c 10 hours ago 4.24GB
dockerd dind fafe8522f3dc 14 hours ago 978MB
docker dind 8d54f09a2b77 2 weeks ago 256MB
alpine latest 389fef711851 2 weeks ago 5.58MB
newland@student:
$ docker run --runtime=sysbox-runc -d -p 10004:22 -p 10005:20805 -p 10006:20905 --name test test:latest
1eeef69975b981878a24ccd058b7ef58094f3085d790dd5a120723ff27049e4c
docker: Error response from daemon: OCI runtime create failed: error in the container spec: invalid mount config: failed to request mounts from sysbox-mgr: failed to invoke ReqMounts via grpc: rpc error: code = DeadlineExceeded desc = context deadline exceeded: unknown.
newland@student:$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1eeef69975b9 test:latest "run.sh" 4 minutes ago Created 0.0.0.0:10004->22/tcp, 0.0.0.0:10005->20805/tcp, 0.0.0.0:10006->20905/tcp test
newland@student:
$ docker start 1eeef69975b9
1eeef69975b9
newland@student:$ docker logs -f 1eeef69975b9
time="2021-01-05T01:30:04.047934424Z" level=info msg="Starting up"
time="2021-01-05T01:30:04.108841021Z" level=info msg="libcontainerd: started new containerd process" pid=82
time="2021-01-05T01:30:04.108923121Z" level=info msg="parsed scheme: "unix"" module=grpc
time="2021-01-05T01:30:04.108938621Z" level=info msg="scheme "unix" not registered, fallback to default scheme" module=grpc
time="2021-01-05T01:30:04.108991221Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock 0 }] }" module=grpc
time="2021-01-05T01:30:04.109026821Z" level=info msg="ClientConn switching balancer to "pick_first"" module=grpc
time="2021-01-05T01:30:04.693753529Z" level=info msg="starting containerd" revision=ea765aba0d05254012b0b9e595e995c09186427f version=v1.3.9
time="2021-01-05T01:30:04.726213974Z" level=info msg="loading plugin "io.containerd.content.v1.content"..." type=io.containerd.content.v1
time="2021-01-05T01:30:04.726610774Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.btrfs"..." type=io.containerd.snapshotter.v1
time="2021-01-05T01:30:04.727539772Z" level=info msg="skip loading plugin "io.containerd.snapshotter.v1.btrfs"..." error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs (shiftfs) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2021-01-05T01:30:04.727623172Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.devmapper"..." type=io.containerd.snapshotter.v1
time="2021-01-05T01:30:04.727686072Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
time="2021-01-05T01:30:04.727708772Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.aufs"..." type=io.containerd.snapshotter.v1
time="2021-01-05T01:30:04.729167069Z" level=info msg="skip loading plugin "io.containerd.snapshotter.v1.aufs"..." error="modprobe aufs failed: "modprobe: can't change directory to '/lib/modules': No such file or directory\n": exit status 1: skip plugin" type=io.containerd.snapshotter.v1
time="2021-01-05T01:30:04.729197769Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.native"..." type=io.containerd.snapshotter.v1
time="2021-01-05T01:30:04.729669169Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." type=io.containerd.snapshotter.v1
time="2021-01-05T01:30:04.729921068Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1
time="2021-01-05T01:30:04.730157868Z" level=info msg="skip loading plugin "io.containerd.snapshotter.v1.zfs"..." error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2021-01-05T01:30:04.730205968Z" level=info msg="loading plugin "io.containerd.metadata.v1.bolt"..." type=io.containerd.metadata.v1
time="2021-01-05T01:30:04.730271767Z" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
time="2021-01-05T01:30:04.730299267Z" level=info msg="metadata content store policy set" policy=shared
time="2021-01-05T01:30:04.744942443Z" level=info msg="loading plugin "io.containerd.differ.v1.walking"..." type=io.containerd.differ.v1
time="2021-01-05T01:30:04.745004543Z" level=info msg="loading plugin "io.containerd.gc.v1.scheduler"..." type=io.containerd.gc.v1
time="2021-01-05T01:30:04.745072242Z" level=info msg="loading plugin "io.containerd.service.v1.containers-service"..." type=io.containerd.service.v1
time="2021-01-05T01:30:04.745105242Z" level=info msg="loading plugin "io.containerd.service.v1.content-service"..." type=io.containerd.service.v1
time="2021-01-05T01:30:04.745135942Z" level=info msg="loading plugin "io.containerd.service.v1.diff-service"..." type=io.containerd.service.v1
time="2021-01-05T01:30:04.745175942Z" level=info msg="loading plugin "io.containerd.service.v1.images-service"..." type=io.containerd.service.v1
time="2021-01-05T01:30:04.745200742Z" level=info msg="loading plugin "io.containerd.service.v1.leases-service"..." type=io.containerd.service.v1
time="2021-01-05T01:30:04.745221942Z" level=info msg="loading plugin "io.containerd.service.v1.namespaces-service"..." type=io.containerd.service.v1
time="2021-01-05T01:30:04.745241842Z" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1
time="2021-01-05T01:30:04.745262742Z" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1
time="2021-01-05T01:30:04.746095041Z" level=info msg="loading plugin "io.containerd.runtime.v2.task"..." type=io.containerd.runtime.v2
time="2021-01-05T01:30:04.746333040Z" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." type=io.containerd.monitor.v1
time="2021-01-05T01:30:04.746855539Z" level=info msg="loading plugin "io.containerd.service.v1.tasks-service"..." type=io.containerd.service.v1
time="2021-01-05T01:30:04.746915239Z" level=info msg="loading plugin "io.containerd.internal.v1.restart"..." type=io.containerd.internal.v1
time="2021-01-05T01:30:04.746973339Z" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.747013639Z" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.747038839Z" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.747076939Z" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.747098539Z" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.747124839Z" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.747181939Z" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.747239039Z" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.747266039Z" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
time="2021-01-05T01:30:04.747713238Z" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.747738838Z" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.747756038Z" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.747782538Z" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
time="2021-01-05T01:30:04.769498801Z" level=info msg=serving... address=/var/run/docker/containerd/containerd-debug.sock
time="2021-01-05T01:30:04.769676701Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock.ttrpc
time="2021-01-05T01:30:04.769789501Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock
time="2021-01-05T01:30:04.769813700Z" level=info msg="containerd successfully booted in 0.078494s"
time="2021-01-05T01:30:04.990579626Z" level=info msg="parsed scheme: "unix"" module=grpc
time="2021-01-05T01:30:04.990612726Z" level=info msg="scheme "unix" not registered, fallback to default scheme" module=grpc
time="2021-01-05T01:30:04.990630726Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock 0 }] }" module=grpc
time="2021-01-05T01:30:04.990640326Z" level=info msg="ClientConn switching balancer to "pick_first"" module=grpc
time="2021-01-05T01:30:04.992267723Z" level=info msg="parsed scheme: "unix"" module=grpc
time="2021-01-05T01:30:04.992316023Z" level=info msg="scheme "unix" not registered, fallback to default scheme" module=grpc
time="2021-01-05T01:30:04.992340223Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock 0 }] }" module=grpc
time="2021-01-05T01:30:04.992357023Z" level=info msg="ClientConn switching balancer to "pick_first"" module=grpc
time="2021-01-05T01:30:04.994839119Z" level=error msg="AUFS cannot be used in non-init user namespace" storage-driver=aufs
time="2021-01-05T01:30:04.994884219Z" level=error msg="[graphdriver] prior storage driver aufs failed: driver not supported"
time="2021-01-05T01:30:04.995595018Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
time="2021-01-05T01:30:04.995596718Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
failed to start daemon: error initializing graphdriver: driver not supported
newland@student:
$

@ctalledo
Copy link
Member

ctalledo commented Jan 5, 2021

Hi @longzmkm, thanks for giving Sysbox a try.

$ docker run --runtime=sysbox-runc -d -p 10004:22 -p 10005:20805 -p 10006:20905 --name test test:latest
1eeef69975b981878a24ccd058b7ef58094f3085d790dd5a120723ff27049e4c
docker: Error response from daemon: OCI runtime create failed: error in the container spec: invalid mount config: failed to request mounts from sysbox-mgr: failed to invoke ReqMounts via grpc: rpc error: code = DeadlineExceeded desc = context deadline exceeded: unknown.

gRPC timeouts are not normal, so something is off.

Some questions:

  1. Do you mean to say that the test:latest image is larger than 4GB in size? Can you show the output of docker image ls test:latest?

  2. Is the test:latest image preloaded with inner container images? If so, do you have a sense of what's the combined size of those inner images?

  3. Are you using the Sysbox release package (e.g., v0.2.1) or are you building Sysbox from the GitHub repo source? We recently added a fix to the GitHub repo that may address the problem you are seeing. The fix is not yet in the release package.

  4. Check the sysbox-mgr log (/var/log/sysbox-mgr.log). Is there something there regarding the cause of the error?

Thanks!

@ctalledo ctalledo self-assigned this Jan 5, 2021
@ctalledo ctalledo added the bug Something isn't working label Jan 5, 2021
@ctalledo ctalledo added this to To do in Sysbox Dev via automation Jan 5, 2021
@ctalledo ctalledo removed this from To do in Sysbox Dev Jan 5, 2021
@longzmkm
Copy link
Author

longzmkm commented Jan 5, 2021

  1. newland@student:~$ docker image ls test:latest
    REPOSITORY TAG IMAGE ID CREATED SIZE
    test latest c68f9e9b046c 12 hours ago 4.24GB
  2. test:latest images inner contain more than 20 images about 3.5G size.
  3. I use sysbox_0.2.1-0.ubuntu-focal_amd64.deb
  4. cat /var/log/sysbox-mgr.log
    INFO[2021-01-05 03:03:06] Caught OS signal: terminated
    INFO[2021-01-05 03:03:06] Stopping (gracefully) ...
    WARN[2021-01-05 03:03:06] The following containers are active and will stop operating properly:
    WARN[2021-01-05 03:03:06] container id: 99b34185b5f2f24243cac561313c955a4013dbbacf8b58e6087d3a5d6fb86d33
    WARN[2021-01-05 03:03:06] container id: 49d018883e44ee68137d0b7b7bd4503b26214910cb56ccaaaa854db140bd0285
    WARN[2021-01-05 03:03:06] container id: d3c2d3a254dce5b3bbd58f3ece095ea67aa383a432349f9a1b842721d861666d
    WARN[2021-01-05 03:03:06] container id: 696bb65bd869f1afe2d9d487eea4cb0a94d3f38167e1c0ba4cb2eaefab627a6c
    WARN[2021-01-05 03:03:06] container id: 75d21b89b8aa76099fe1a20a7d64ea107787a354ebe06a4f955bf37fddfcd8cc
    WARN[2021-01-05 03:03:06] container id: 36949e381adf5106fe7ebd1310a49e01c7e0b35707b1c2603a91f4a00942d31a
    WARN[2021-01-05 03:03:06] container id: efc1d8db677abb24c2e35714f76e15e8507714730e56d575876834f6931c5f05
    WARN[2021-01-05 03:03:06] container id: 1eeef69975b981878a24ccd058b7ef58094f3085d790dd5a120723ff27049e4c
    WARN[2021-01-05 03:03:07] container id: 1d2c050dd2cbb300778adedc179b8cbc894d80d03d4dd4201974a31058b4419f
    WARN[2021-01-05 03:03:07] dockerVolMgr: failed to destroy volumes for container 696bb65bd869f1afe2d9d487eea4cb0a94d3f38167e1c0ba4cb2eaefab627a6c: failed to stat /var/lib/sysbox/docker/696bb65bd869f1afe2d9d487eea4cb0a94d3f38167e1c0ba4cb2eaefab627a6c: stat /var/lib/sysbox/docker/696bb65bd869f1afe2d9d487eea4cb0a94d3f38167e1c0ba4cb2eaefab627a6c: no such file or directory
    WARN[2021-01-05 03:03:37] dockerVolMgr: failed to destroy volumes for container efc1d8db677abb24c2e35714f76e15e8507714730e56d575876834f6931c5f05: failed to stat /var/lib/sysbox/docker/efc1d8db677abb24c2e35714f76e15e8507714730e56d575876834f6931c5f05: stat /var/lib/sysbox/docker/efc1d8db677abb24c2e35714f76e15e8507714730e56d575876834f6931c5f05: no such file or directory
    WARN[2021-01-05 03:03:39] dockerVolMgr: failed to destroy volumes for container d3c2d3a254dce5b3bbd58f3ece095ea67aa383a432349f9a1b842721d861666d: failed to stat /var/lib/sysbox/docker/d3c2d3a254dce5b3bbd58f3ece095ea67aa383a432349f9a1b842721d861666d: stat /var/lib/sysbox/docker/d3c2d3a254dce5b3bbd58f3ece095ea67aa383a432349f9a1b842721d861666d: no such file or directory
    WARN[2021-01-05 03:03:39] dockerVolMgr: failed to destroy volumes for container 1eeef69975b981878a24ccd058b7ef58094f3085d790dd5a120723ff27049e4c: failed to stat /var/lib/sysbox/docker/1eeef69975b981878a24ccd058b7ef58094f3085d790dd5a120723ff27049e4c: stat /var/lib/sysbox/docker/1eeef69975b981878a24ccd058b7ef58094f3085d790dd5a120723ff27049e4c: no such file or directory
    WARN[2021-01-05 03:03:39] dockerVolMgr: failed to destroy volumes for container 1d2c050dd2cbb300778adedc179b8cbc894d80d03d4dd4201974a31058b4419f: failed to stat /var/lib/sysbox/docker/1d2c050dd2cbb300778adedc179b8cbc894d80d03d4dd4201974a31058b4419f: stat /var/lib/sysbox/docker/1d2c050dd2cbb300778adedc179b8cbc894d80d03d4dd4201974a31058b4419f: no such file or directory
    INFO[2021-01-05 03:04:09] Stopped.
    INFO[2021-01-05 03:04:17] Exiting.

Thanks!

@ctalledo
Copy link
Member

ctalledo commented Jan 5, 2021

Thanks @longzmkm.

From the sysbox-mgr log, I can see that the sysbox manager was terminated by something/someone:

INFO[2021-01-05 03:03:06] Caught OS signal: terminated
...
INFO[2021-01-05 03:04:09] Stopped.
INFO[2021-01-05 03:04:17] Exiting.

Thus, I suspect that the cause of the gRPC timeout you reported earlier is that the sysbox-mgr is not active at the time you are launching the large container (i.e., sysbox-runc sends a gRPC to the sysbox-mgr, but the sysbox-mgr never responds because it was previously killed).

The next question is to figure out what killed the sysbox-mgr. Some suggestions:

  1. Is there anything in the host's systemd log regarding sysbox-mgr being killed? I wonder if systemd killed the sysbox-mgr. I am speculating, but maybe systemd killed it because it used more resources than it was allocated, or the system was low on memory, etc.

  2. Does the problem reproduce 100% of the time when you launch a container with the large test:latest image? Or is it intermittent?

Thanks again.

@longzmkm
Copy link
Author

longzmkm commented Jan 5, 2021

  1. this is systemctl status docker ,I don't know if this is what you need。
    docker.service - Docker Application Container Engine
    Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
    Active: active (running) since Tue 2021-01-05 03:06:41 UTC; 2h 23min ago
    Docs: https://docs.docker.com
    Main PID: 1456 (dockerd)
    Tasks: 16
    CGroup: /system.slice/docker.service
    └─1456 /usr/bin/dockerd -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock

Jan 05 03:39:18 student dockerd[1456]: time="2021-01-05T03:39:18.982169370Z" level=info msg="ignoring event" container=23303cf4c75fc5bcca71fadc3517c0259f3546e6bb23c61f7260251081bca4d1 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 05 03:42:52 student dockerd[1456]: time="2021-01-05T03:42:52.848489820Z" level=info msg="ignoring event" container=10a11a72be88c2a02e9b740d002c249bdae6e71d3991ca3291756adec71fbdae module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 05 04:46:32 student dockerd[1456]: time="2021-01-05T04:46:32.565838491Z" level=info msg="ignoring event" container=7d30e7d289137659194135c53d6567a903b812f5ffea9fc006d8aca012655a44 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 05 04:49:25 student dockerd[1456]: time="2021-01-05T04:49:25.981686136Z" level=info msg="ignoring event" container=353356f58d4eedf2ce942cd635e759280edf22d457ecc970e16d93e803ca8460 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 05 05:06:38 student dockerd[1456]: time="2021-01-05T05:06:38.824039145Z" level=info msg="ignoring event" container=004e39abb0a86d154e263eda73055038e5f276f75ec914e3ee5d66175984c2e2 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 05 05:19:07 student dockerd[1456]: time="2021-01-05T05:19:07.522718210Z" level=info msg="ignoring event" container=17f28ef78a87fb1087ab8cf3c692fb4f4aa5891141c625c7a0d97348aefcc904 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 05 05:26:45 student dockerd[1456]: time="2021-01-05T05:26:45.027953450Z" level=error msg="stream copy error: reading from a closed fifo"
Jan 05 05:26:45 student dockerd[1456]: time="2021-01-05T05:26:45.030304207Z" level=error msg="stream copy error: reading from a closed fifo"
Jan 05 05:26:45 student dockerd[1456]: time="2021-01-05T05:26:45.279203752Z" level=error msg="a4a3a84a26a6bcbc8d632c0abdc94ea719537e893e0093e66d60635ceddca147 cleanup: failed to delete container from containerd: no such container"
Jan 05 05:26:45 student dockerd[1456]: time="2021-01-05T05:26:45.279276251Z" level=error msg="Handler for POST /v1.41/containers/a4a3a84a26a6bcbc8d632c0abdc94ea719537e893e0093e66d60635ceddca147/start returned error: OCI runtime create failed: error in the container spec: invalid mount config: failed to request mounts from sysbox-mgr: failed to invoke ReqMounts via grpc: rpc error: code = DeadlineExceeded desc = context deadline exceeded: unknown"

  1. Stable reproduce。

@ctalledo
Copy link
Member

ctalledo commented Jan 5, 2021

this is systemctl status docker ,I don't know if this is what you need。

No, that won't show much because Docker only knows that a container failed to deploy. What we need is the systemd journal log (e.g., journalctl -xe) and look for any messages indicating that the sysbox-mgr process was killed.

Stable reproduce。

Got it. Have you tried restarting Sysbox (sudo systemctl restart sysbox) and see if the problem reproduces. If it does reproduce, check in the systemd journal log again.

This is a bit tricky to debug, because something is killing sysbox-mgr in your system (leading to the gRPC timeout). In other words, it's not a bug in sysbox per se that it's causing the problem, but possibly some system constraint that sysbox is exceeding and thus causing it to be terminated.

Is the test:latest image you are using public? Can I use it to repro on my side?

And finally: can you provide some info about your host (# CPUs, RAM size, lsb_release -a, uname -r).

Thanks again!

@longzmkm
Copy link
Author

longzmkm commented Jan 6, 2021

i modify images name sysbox-test
you can download:
docker pull tlqjxd/sysbox-test:V1.0.0

i use Microsoft vm
newland@student:$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
newland@student:
$ uname -r
5.4.0-1034-azure

newland@student:~$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
stepping : 1
microcode : 0xffffffff
cpu MHz : 2294.686
cache size : 51200 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 4589.37
clflush size : 64
cache_alignment : 64
address sizes : 44 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
stepping : 1
microcode : 0xffffffff
cpu MHz : 2294.686
cache size : 51200 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 4589.37
clflush size : 64
cache_alignment : 64
address sizes : 44 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
stepping : 1
microcode : 0xffffffff
cpu MHz : 2294.686
cache size : 51200 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 4589.37
clflush size : 64
cache_alignment : 64
address sizes : 44 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
stepping : 1
microcode : 0xffffffff
cpu MHz : 2294.686
cache size : 51200 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 4589.37
clflush size : 64
cache_alignment : 64
address sizes : 44 bits physical, 48 bits virtual
power management:

newland@student:~$ grep MemTotal /proc/meminfo
MemTotal: 16370940 kB

Thanks again!

@longzmkm
Copy link
Author

longzmkm commented Jan 6, 2021

Please let me know what you need

@ctalledo
Copy link
Member

ctalledo commented Jan 6, 2021

Hi @longzmkm, thanks for providing the Docker image. I was able to repro easily on my side, I am investigating more right now.

@ctalledo
Copy link
Member

ctalledo commented Jan 6, 2021

@longzmkm:

How was the tlqjxd/sysbox-test:V1.0.0 image created? In particular, how did you embed the inner container images in it? Did you use Sysbox to do this (see here)?

I ask because I can see that the inner container images seem to be corrupted. For example, here are the contents of the container's /var/lib/docker:

# ls -l var/lib/docker/aufs/diff/53fae98765bb18eb0515da0be8b84f1005a9744c2f6c8efb3574b6321daf0d38/var/lib/apt/lists/
ls: cannot access 'var/lib/docker/aufs/diff/53fae98765bb18eb0515da0be8b84f1005a9744c2f6c8efb3574b6321daf0d38/var/lib/apt/lists/partial': No such file or directory
total 0
c????????? ? ? ? ?            ? partial

This appears to be causing problems for Sysbox (though I need to confirm).

Also: I see the inner container images are on aufs, which should be fine but I was expecting them to be on overlayfs (the default storage filesystem for Docker). Any ideas why?

Thanks.

@ctalledo
Copy link
Member

ctalledo commented Jan 6, 2021

This appears to be causing problems for Sysbox (though I need to confirm).

Yes, it is. This causes Sysbox to reports the following errors:

docker run --rm --runtime=sysbox-runc -d -p 10004:22 -p 10005:20805 -p 10006:20905 --name test  tlqjxd/sysbox-test:V1.0.0
docker: Error response from daemon: OCI runtime create failed: error in the container spec: invalid mount config: failed to request mounts from sysbox-mgr: failed to invoke ReqMounts via grpc: rpc error: code = Unknown desc = failed to setup mounts backing /var/lib/docker for container 185f7639d622a1d3817600c7fdcdd77ef7e277c80f82e711a3915f07640010e1: volume sync-in failed: failed to sync /var/lib/docker/overlay2/40bbf0cd2d1e68e9660636d2c565ee88442d427d3a8549661a0c22721c84c7cf/merged/var/lib/docker/ to /var/lib/sysbox/docker/185f7639d622a1d3817600c7fdcdd77ef7e277c80f82e711a3915f07640010e1: file has vanished: "/var/lib/docker/overlay2/40bbf0cd2d1e68e9660636d2c565ee88442d427d3a8549661a0c22721c84c7cf/merged/var/lib/docker/aufs/diff/53fae98765bb18eb0515da0be8b84f1005a9744c2f6c8efb3574b6321daf0d38/var/lib/apt/lists/partial"
file has vanished: "/var/lib/docker/overlay2/40bbf0cd2d1e68e9660636d2c565ee88442d427d3a8549661a0c22721c84c7cf/merged/var/lib/docker/aufs/diff/9e5b214f55f97493a0127b340ebe3c7684ea995d51c18202553921fbf15e43e7/usr/src/arm-alignment.patch"
file has vanished: "/var/lib/docker/overlay2/40bbf0cd2d1e68e9660636d2c565ee88442d427d3a8549661a0c22721c84c7cf/merged/var/lib/docker/aufs/diff/9e5b214f55f97493a0127b340ebe3c7684ea995d51c18202553921fbf15e43e7/usr/src/fix-xattrs-glibc.patch"
file has vanished: "/var/lib/docker/overlay2/40bbf0cd2d1e68e9660636d2c565ee88442d427d3a8549661a0c22721c84c7cf/merged/var/lib/docker/aufs/diff/9e5b214f55f97493a0127b340ebe3c7684ea995d51c18202553921fbf15e43e7/usr/src/musl-find_library.patch"
file has vanished: "/var/lib/docker/overlay2/40bbf0cd2d1e68e9660636d2c565ee88442d427d3a8549661a0c22721c84c7cf/merged/var/lib/docker/aufs/diff/d2cefb095bd4886d13a66f1cc78e0a4a897fc99a9a3da394900aa5f8d54245f2/usr/src/requirements.txt"

However, this is not the cause of the gRPC timeout. The gRPC timeout is due to the fact that the image size is so large that Sysbox is having trouble dealing with it.

More specifically, when a container image that is preloaded with inner containers is launched, Sysbox has to copy those inner container images to a host dir it controls (/var/lib/sysbox). This is needed in order for the inner Docker to work properly (it avoids overlayfs-on-overlayfs scenarios).

This copy is taking a long time with this large container image on your fairly small machine. In a VM of the same size, I checked that this copy takes ~ 1 min. I suspect it's because your image has a lot of files inside the /var/lib/docker/aufs directory. On a faster machine however (32 CPU cores at 3.2GHz with SSD), this copy takes much less (~10 seconds).

I don't have a good solution at this point. Some suggestions for a workaround would be:

  1. Try squashing the (outer) container's image layers, so as to make it smaller. This can lead to significant reduction in the image size.

  2. Try squashing the inner container image layers, for the same reason as above.

  3. Try using overlayfs on the inner container image layers (as opposed to aufs).

  4. Use Docker + Sysbox to embed the inner container images into the outer container, as described here. This will likely avoid the corruption issue I described above.

I am sorry that I don't have a "real solution" at this time, only workarounds. The real solution is to avoid the copy I mentioned above; we have some ideas on how to do this but we won't be there for some time.

@longzmkm
Copy link
Author

longzmkm commented Jan 7, 2021

@ctalledo Thank you for your advice. I'll give it a try,

@ctalledo ctalledo changed the title Running more than 4G size images grpc timeout,What should I do? Launching system containers preloaded with lots of inner containers may lead to Sysbox gRPC timeouts Feb 24, 2021
@ctalledo ctalledo added this to To do in Sysbox Dev via automation Feb 24, 2021
@ctalledo
Copy link
Member

Closing (issue is stale, a work-around was provided).

Sysbox Dev automation moved this from To do to Done Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Sysbox Dev
  
Done
Development

No branches or pull requests

2 participants