Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COPY --from consistently fails with ZFS storage driver #1758

Closed
AdamSLevy opened this issue Oct 27, 2020 · 39 comments
Closed

COPY --from consistently fails with ZFS storage driver #1758

AdamSLevy opened this issue Oct 27, 2020 · 39 comments

Comments

@AdamSLevy
Copy link

Running any Dockerfile that uses COPY --from when the previous layer wasn't already in the cache, appears to cause the following issue:

$ DOCKER_BUILDKIT=1 docker build -t $CONTAINER_REGISTRY/avengers/multiplexer-operator:adamslevy-test -f ./cmd/cp-metrics-multiplexer/Dockerfile --build-arg GOPROXY=http://localhost .
[+] Building 14.2s (14/14) FINISHED                                                                                                                                                                                  
 => [internal] load build definition from Dockerfile                                                                                                                                                            0.1s
 => => transferring dockerfile: 38B                                                                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                                                               0.2s
 => => transferring context: 35B                                                                                                                                                                                0.0s
 => [internal] load metadata for docker.io/library/gobase:latest                                                                                                                                                0.0s
 => [internal] load metadata for docker.io/library/alpine:3.10.2                                                                                                                                                0.0s
 => [builder 1/2] FROM docker.io/library/gobase                                                                                                                                                                 0.0s
 => CACHED [stage-1 1/3] FROM docker.io/library/alpine:3.10.2                                                                                                                                                   0.0s
 => [internal] load build context                                                                                                                                                                               1.3s
 => => transferring context: 284.53kB                                                                                                                                                                           1.2s
 => CACHED [builder 2/2] COPY Makefile go.mod go.sum ./                                                                                                                                                         0.0s
 => CACHED [builder 3/2] COPY internal ./internal                                                                                                                                                               0.0s
 => CACHED [builder 4/2] COPY pkg ./pkg                                                                                                                                                                         0.0s
 => [builder 5/2] COPY cmd ./cmd                                                                                                                                                                                2.0s
 => [builder 6/2] RUN make multiplexer-operator                                                                                                                                                                 9.8s
 => CANCELED [stage-1 2/3] COPY --from=builder /src/bin/multiplexer-operator /multiplexer-operator                                                                                                              0.3s 
 => ERROR [stage-1 3/3] COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt                                                                                               0.0s 
------
 > [stage-1 3/3] COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt:
------
failed to solve with frontend dockerfile.v0: failed to build LLB: failed to compute cache key: error creating zfs mount: mount zroot/ROOT/default/p2ktr8qsvusedj4edsddrepei:/var/lib/docker/zfs/graph/p2ktr8qsvusedj4edsddrepei: no such file or directory

Other info:

$ docker info
Client:
 Debug Mode: false

Server:
 Containers: 35
  Running: 0
  Paused: 0
  Stopped: 35
 Images: 630
 Server Version: 19.03.13-ce
 Storage Driver: zfs
  Zpool: zroot
  Zpool Health: ONLINE
  Parent Dataset: zroot/ROOT/default
  Space Used By Parent: 217493699584
  Space Available: 753881241088
  Parent Quota: no
  Compression: lz4
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: c623d1b36f09f8ef6536a057bd658b3aa8632828.m
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.9.1-zen2-1-zen
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.25GiB
 Name: clevo.adamarch
 ID: GV3Y:62D5:V5TB:TQZG:U6QA:MOAL:MVLT:7ORI:VJ2V:4W6L:4EQJ:2W7R
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Please let me know if I can provide any further helpful info.

@AdamSLevy
Copy link
Author

It would be awesome to see this fixed because using BUILDKIT on zfs is otherwise much faster than the normal docker build engine when using the zfs storage driver.

@tonistiigi
Copy link
Member

Please provide a reproducible testcase.

@codyps
Copy link

codyps commented Jan 8, 2021

Edit: sorry, was following a link from stackoverflow that talked about the copy range issue and referenced this ticket, but this appears distinct. I opened a new issue on moby about my problem.

Using docker run --rm crossbario/crossbar produces the same error (again, with zfs storage), but for a different command (run instead of build).

docker run --rm crossbario/crossbar
docker: Error response from daemon: failed to copy files: copy file range failed: invalid argument.
See 'docker run --help'.

running: "Docker daemon" commit=8891c58a43 graphdriver(s)=zfs version=20.10.2

$ zfs version
zfs-2.0.1-1
zfs-kmod-2.0.0-1
$ uname -a
Linux franklin 5.10.4-arch2-1 #1 SMP PREEMPT Fri, 01 Jan 2021 05:29:53 +0000 x86_64 GNU/Linux
`docker inspect crossbario/crossbar:latest`
[
    {
        "Id": "sha256:c632c6e5461ac9b49a3bb1d3b0a8b2185e8f9fdb1ce65c9b2ec7a3c236c1fa77",
        "RepoTags": [
            "crossbario/crossbar:latest"
        ],
        "RepoDigests": [
            "crossbario/crossbar@sha256:cb82378bad25cbd1952b76dfb9826f4140d3f0f534ad25c865337484c72bbc8e"
        ],
        "Parent": "",
        "Comment": "",
        "Created": "2020-12-14T23:08:50.072571251Z",
        "Container": "7ba359773724c529813635dd7d5d0833695dd929d5656aeea21a3cb9d7362a0a",
        "ContainerConfig": {
            "Hostname": "7ba359773724",
            "Domainname": "",
            "User": "crossbar",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "ExposedPorts": {
                "8000/tcp": {},
                "8080/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "LANG=C.UTF-8",
                "GPG_KEY=0D96DF4D4110E5C43FBFB17F2D347EA6AA65421D",
                "PYTHON_VERSION=3.7.9",
                "PYTHON_PIP_VERSION=20.3.1",
                "PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/91630a4867b1f93ba0a12aa81d0ec4ecc1e7eeb9/get-pip.py",
                "PYTHON_GET_PIP_SHA256=d48ae68f297cac54db17e4107b800faae0e5210131f9f386c30c0166bf8d81b7",
                "HOME=/node",
                "DEBIAN_FRONTEND=noninteractive",
                "PYTHONUNBUFFERED=1"
            ],
            "Cmd": [
                "/bin/sh",
                "-c",
                "#(nop) ",
                "ENTRYPOINT [\"crossbar\" \"start\" \"--cbdir\" \"/node/.crossbar\"]"
            ],
            "Image": "sha256:f79c364656d27a11d90a2fc2247678500818fb91cf330dd0299d800c43cc24d3",
            "Volumes": {
                "/node": {}
            },
            "WorkingDir": "/node",
            "Entrypoint": [
                "crossbar",
                "start",
                "--cbdir",
                "/node/.crossbar"
            ],
            "OnBuild": null,
            "Labels": {
                "org.label-schema.build-date": "2020-12-14",
                "org.label-schema.description": "Quickstart template for application development with Crossbar.io",
                "org.label-schema.name": "Crossbar.io Starter Template",
                "org.label-schema.schema-version": "1.0",
                "org.label-schema.url": "http://crossbar.io",
                "org.label-schema.vcs-ref": "a07cc274",
                "org.label-schema.vcs-url": "https://github.com/crossbario/crossbar",
                "org.label-schema.vendor": "The Crossbar.io Project",
                "org.label-schema.version": "20.12.1"
            }
        },
        "DockerVersion": "20.10.0",
        "Author": "The Crossbar.io Project <support@crossbario.com>",
        "Config": {
            "Hostname": "",
            "Domainname": "",
            "User": "crossbar",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "ExposedPorts": {
                "8000/tcp": {},
                "8080/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "LANG=C.UTF-8",
                "GPG_KEY=0D96DF4D4110E5C43FBFB17F2D347EA6AA65421D",
                "PYTHON_VERSION=3.7.9",
                "PYTHON_PIP_VERSION=20.3.1",
                "PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/91630a4867b1f93ba0a12aa81d0ec4ecc1e7eeb9/get-pip.py",
                "PYTHON_GET_PIP_SHA256=d48ae68f297cac54db17e4107b800faae0e5210131f9f386c30c0166bf8d81b7",
                "HOME=/node",
                "DEBIAN_FRONTEND=noninteractive",
                "PYTHONUNBUFFERED=1"
            ],
            "Cmd": null,
            "Image": "sha256:f79c364656d27a11d90a2fc2247678500818fb91cf330dd0299d800c43cc24d3",
            "Volumes": {
                "/node": {}
            },
            "WorkingDir": "/node",
            "Entrypoint": [
                "crossbar",
                "start",
                "--cbdir",
                "/node/.crossbar"
            ],
            "OnBuild": null,
            "Labels": {
                "org.label-schema.build-date": "2020-12-14",
                "org.label-schema.description": "Quickstart template for application development with Crossbar.io",
                "org.label-schema.name": "Crossbar.io Starter Template",
                "org.label-schema.schema-version": "1.0",
                "org.label-schema.url": "http://crossbar.io",
                "org.label-schema.vcs-ref": "a07cc274",
                "org.label-schema.vcs-url": "https://github.com/crossbario/crossbar",
                "org.label-schema.vendor": "The Crossbar.io Project",
                "org.label-schema.version": "20.12.1"
            }
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 493233558,
        "VirtualSize": 493233558,
        "GraphDriver": {
            "Data": {
                "Dataset": "franklin/TMP/docker/d99bb5c49a0feee1e9a395582530bdb60804ef15c25536ef1f89c8dc6347cb59",
                "Mountpoint": "/var/lib/docker/zfs/graph/d99bb5c49a0feee1e9a395582530bdb60804ef15c25536ef1f89c8dc6347cb59"
            },
            "Name": "zfs"
        },
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:87c8a1d8f54f3aa4e05569e8919397b65056aa71cdf48b7f061432c98475eee9",
                "sha256:b986ceddf07ce9486d4f577e350ee369fac8f50ebc2f1f8e4529abf2e1b29836",
                "sha256:4b574c535099ee5084888dcf6eef30ad2111b7380151f4f744599145e4a4ecd0",
                "sha256:ffd8021a3fc332bb94cdc169c2307a574683c9db2130ece4ef6b2dbf7923469e",
                "sha256:ad473021ef341f7bd308eec2d22f67e46ebea9911f010f71edc3acb6278d9dcb",
                "sha256:d6e70e892f25dfc9794aa5c135045a4be9b216efc9c9533a8d8500d18c213dea",
                "sha256:026aa3725e8672190828d11e859b170ff1cd8deb1e7f298da3f847fd611d9d4d",
                "sha256:28f952ebcc4aacd4b0dc970d314057a7091f7cec2e72bc31739c11888fcb19a8",
                "sha256:d6f486ad3a0c3bf5aa3335036adfd6c9771151308e487e72e8b39373db51869e",
                "sha256:0e91129494625022b3da7c64adb9c66ab7c7d5dff383902c36633109808d558b",
                "sha256:5809e5568ffcff0b5f7f0eaf2e9f3ae8d41c71b5d47a6c0ca8b11211609eb49a"
            ]
        },
        "Metadata": {
            "LastTagTime": "0001-01-01T00:00:00Z"
        }
    }
]

@jaen
Copy link

jaen commented Jan 12, 2021

@tonistiigi I think I have a fairly minimal reproduction, a script follows:

#!/usr/bin/env bash

CONTEXT_TAR_PATH="${PWD}/context.tar"

if [ ! -f $CONTEXT_TAR_PATH ]; then
  CONTEXT_TEMP=`mktemp -d`

  echo "test-1" > ${CONTEXT_TEMP}/test-1
  echo "test-2" > ${CONTEXT_TEMP}/test-2

  cat > ${CONTEXT_TEMP}/Dockerfile <<-EODOCKERFILE
    FROM alpine as builder1

    COPY test-1 /test-1

    FROM alpine as builder2

    COPY test-2 /test-2

    FROM alpine

    COPY --from=builder1 /test-1 /stuff/test-1
    COPY --from=builder2 /test-2 /stuff/test-2
EODOCKERFILE

  (
    cd ${CONTEXT_TEMP}
    tar -cf ${CONTEXT_TAR_PATH} .
  )
fi

cat ${CONTEXT_TAR_PATH} | env DOCKER_BUILDKIT=1 docker build -t zfs-issue -

This fails fairly reliably for me if I run docker system prune -a first – sometimes it will just complain about the cachekey thing, sometimes it will also complain about the Dockerfilemissing, but not always, output follows:

╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ docker system prune -a
WARNING! This will remove:
  - all stopped containers
  - all networks not used by at least one container
  - all images without at least one container associated to them
  - all build cache

Are you sure you want to continue? [y/N] y
Deleted Images:
untagged: zfs-issue:latest
deleted: sha256:76dd7cf63e3a55b8b17656fe6d324d837c5a476500143bba7adbd98ae164987d

Deleted build cache objects:
p1hpzx6t74r6cixg2hjgn2xtq
hwa0qgf36d5b3u284cok6gtig
o2fziihelkrw9x72n4ht9ttg7
p1vvnu475t3mzo5beoka6p3tr
f8u5y0dyg1fmy21lcmc40ue0o
7jbwmrynfa00h0s7mqa4tzr3g
8x5ysndfquqd0cfui4ey0adyk
sha256:777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf

Total reclaimed space: 6.435MB
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ ./repro.sh
[+] Building 1.8s (5/8)
 => [internal] load remote build context                                                                                                                                                  0.0s
 => copy /context /                                                                                                                                                                       0.1s
 => [internal] load metadata for docker.io/library/alpine:latest                                                                                                                          1.6s
 => CANCELED [builder2 1/2] FROM docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                         0.0s
 => => resolve docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                                           0.0s
 => => sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436 1.64kB / 1.64kB                                                                                            0.0s
 => => sha256:074d3636ebda6dd446d0d00304c4454f468237fdacf08fb0eeac90bdbfa1bac7 528B / 528B                                                                                                0.0s
 => => sha256:389fef7118515c70fd6c0e0d50bb75669942ea722ccb976507d7b087e54d5a23 1.47kB / 1.47kB                                                                                            0.0s
 => ERROR [builder2 2/2] COPY test-2 /test-2                                                                                                                                              0.0s
------
 > [builder2 2/2] COPY test-2 /test-2:
------
failed to solve with frontend dockerfile.v0: failed to build LLB: failed to compute cache key: error creating zfs mount: mount airi-storage/local/docker/l1hf17ipwe0d2qrb3ztokt2d1:/var/lib/docker/zfs/graph/l1hf17ipwe0d2qrb3ztokt2d1: no such file or directory
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ ./repro.sh                                                                                                                                                                              ↵ 1
[+] Building 0.0s (2/2) FINISHED
 => CACHED [internal] load remote build context                                                                                                                                           0.0s
 => CACHED copy /context /                                                                                                                                                                0.0s
failed to solve with frontend dockerfile.v0: failed to read dockerfile: open /var/lib/docker/tmp/buildkit-mount871595872/Dockerfile: no such file or directory
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ ./repro.sh                                                                                                                                                                              ↵ 1
[+] Building 2.5s (9/9) FINISHED
 => CACHED [internal] load remote build context                                                                                                                                           0.0s
 => CACHED copy /context /                                                                                                                                                                0.0s
 => [internal] load metadata for docker.io/library/alpine:latest                                                                                                                          1.5s
 => [builder1 1/2] FROM docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                                  0.3s
 => => resolve docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                                           0.0s
 => => sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436 1.64kB / 1.64kB                                                                                            0.0s
 => => sha256:074d3636ebda6dd446d0d00304c4454f468237fdacf08fb0eeac90bdbfa1bac7 528B / 528B                                                                                                0.0s
 => => sha256:389fef7118515c70fd6c0e0d50bb75669942ea722ccb976507d7b087e54d5a23 1.47kB / 1.47kB                                                                                            0.0s
 => => sha256:801bfaa63ef2094d770c809815b9e2b9c1194728e5e754ef7bc764030e140cea 2.80MB / 2.80MB                                                                                            0.1s
 => => extracting sha256:801bfaa63ef2094d770c809815b9e2b9c1194728e5e754ef7bc764030e140cea                                                                                                 0.2s
 => [builder1 2/2] COPY test-1 /test-1                                                                                                                                                    0.1s
 => [builder2 2/2] COPY test-2 /test-2                                                                                                                                                    0.1s
 => [stage-2 2/3] COPY --from=builder1 /test-1 /stuff/test-1                                                                                                                              0.1s
 => [stage-2 3/3] COPY --from=builder2 /test-2 /stuff/test-2                                                                                                                              0.1s
 => exporting to image                                                                                                                                                                    0.3s
 => => exporting layers                                                                                                                                                                   0.3s
 => => writing image sha256:87db0be86903b3f1e492adabdfeca3500e210076e7571f43d5ca1cc888f927a5                                                                                              0.0s
 => => naming to docker.io/library/zfs-issue                                                                                                                                              0.0s
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ docker system prune -a
WARNING! This will remove:
  - all stopped containers
  - all networks not used by at least one container
  - all images without at least one container associated to them
  - all build cache

Are you sure you want to continue? [y/N] y
Deleted Images:
untagged: zfs-issue:latest
deleted: sha256:87db0be86903b3f1e492adabdfeca3500e210076e7571f43d5ca1cc888f927a5

Deleted build cache objects:
d3mymyhmxy9z4vteoyeo45h6h
unk5g02w5s9mh0hft9mg7bzw7
me58b0an7ofktojhnblhaeg64
lwb1g48q0sjrhd02as2lqpfd2
rz31uqdv7xnu5m3vd547lxh10
vmwj43sf9e6958y1oory3hzrh
l1hf17ipwe0d2qrb3ztokt2d1
jz6hwy3ytqwkyrd4tsufxxywu
sha256:777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf

Total reclaimed space: 6.445MB
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ ./repro.sh
[+] Building 1.6s (5/8)
 => [internal] load remote build context                                                                                                                                                  0.0s
 => copy /context /                                                                                                                                                                       0.1s
 => [internal] load metadata for docker.io/library/alpine:latest                                                                                                                          1.5s
 => CANCELED [builder2 1/2] FROM docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                         0.0s
 => => resolve docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                                           0.0s
 => => sha256:389fef7118515c70fd6c0e0d50bb75669942ea722ccb976507d7b087e54d5a23 1.47kB / 1.47kB                                                                                            0.0s
 => => sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436 1.64kB / 1.64kB                                                                                            0.0s
 => => sha256:074d3636ebda6dd446d0d00304c4454f468237fdacf08fb0eeac90bdbfa1bac7 528B / 528B                                                                                                0.0s
 => ERROR [builder2 2/2] COPY test-2 /test-2                                                                                                                                              0.0s
------
 > [builder2 2/2] COPY test-2 /test-2:
------
failed to solve with frontend dockerfile.v0: failed to build LLB: failed to compute cache key: error creating zfs mount: mount airi-storage/local/docker/e5ql1gcedsvzrojshdpxnnztq:/var/lib/docker/zfs/graph/e5ql1gcedsvzrojshdpxnnztq: no such file or directory
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ ./repro.sh                                                                                                                                                                              ↵ 1
[+] Building 1.9s (9/9) FINISHED
 => CACHED [internal] load remote build context                                                                                                                                           0.0s
 => CACHED copy /context /                                                                                                                                                                0.0s
 => [internal] load metadata for docker.io/library/alpine:latest                                                                                                                          0.4s
 => [builder2 1/2] FROM docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                                  0.3s
 => => resolve docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                                           0.0s
 => => sha256:074d3636ebda6dd446d0d00304c4454f468237fdacf08fb0eeac90bdbfa1bac7 528B / 528B                                                                                                0.0s
 => => sha256:389fef7118515c70fd6c0e0d50bb75669942ea722ccb976507d7b087e54d5a23 1.47kB / 1.47kB                                                                                            0.0s
 => => sha256:801bfaa63ef2094d770c809815b9e2b9c1194728e5e754ef7bc764030e140cea 2.80MB / 2.80MB                                                                                            0.1s
 => => sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436 1.64kB / 1.64kB                                                                                            0.0s
 => => extracting sha256:801bfaa63ef2094d770c809815b9e2b9c1194728e5e754ef7bc764030e140cea                                                                                                 0.2s
 => [builder1 2/2] COPY test-1 /test-1                                                                                                                                                    0.1s
 => [builder2 2/2] COPY test-2 /test-2                                                                                                                                                    0.1s
 => [stage-2 2/3] COPY --from=builder1 /test-1 /stuff/test-1                                                                                                                              0.1s
 => [stage-2 3/3] COPY --from=builder2 /test-2 /stuff/test-2                                                                                                                              0.1s
 => exporting to image                                                                                                                                                                    0.9s
 => => exporting layers                                                                                                                                                                   0.9s
 => => writing image sha256:03d3e39c058c783a76e9e6d1f2569721de55cef8fb04fc2c54fc7b5ffe1bda02                                                                                              0.0s
 => => naming to docker.io/library/zfs-issue                                                                                                                                              0.0s
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ ./repro.sh
[+] Building 2.0s (9/9) FINISHED
 => CACHED [internal] load remote build context                                                                                                                                           0.0s
 => CACHED copy /context /                                                                                                                                                                0.0s
 => [internal] load metadata for docker.io/library/alpine:latest                                                                                                                          2.0s
 => [builder2 1/2] FROM docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                                  0.0s
 => CACHED [builder1 2/2] COPY test-1 /test-1                                                                                                                                             0.0s
 => CACHED [stage-2 2/3] COPY --from=builder1 /test-1 /stuff/test-1                                                                                                                       0.0s
 => CACHED [builder2 2/2] COPY test-2 /test-2                                                                                                                                             0.0s
 => CACHED [stage-2 3/3] COPY --from=builder2 /test-2 /stuff/test-2                                                                                                                       0.0s
 => exporting to image                                                                                                                                                                    0.0s
 => => exporting layers                                                                                                                                                                   0.0s
 => => writing image sha256:03d3e39c058c783a76e9e6d1f2569721de55cef8fb04fc2c54fc7b5ffe1bda02                                                                                              0.0s
 => => naming to docker.io/library/zfs-issue                                                                                                                                              0.0s
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ docker system prune -a
WARNING! This will remove:
  - all stopped containers
  - all networks not used by at least one container
  - all images without at least one container associated to them
  - all build cache

Are you sure you want to continue? [y/N] y
Deleted Images:
untagged: zfs-issue:latest
deleted: sha256:03d3e39c058c783a76e9e6d1f2569721de55cef8fb04fc2c54fc7b5ffe1bda02

Deleted build cache objects:
rto19x1hf0pki9eio1kacywv5
6rsyu03778egdzbbra4vhy4eh
jk2zpsl1veiputkq87a0axurg
xxbqejwh5ap5vzcs8t36iwelj
6ngc4b57grwplbkmfh37w1a2o
e5ql1gcedsvzrojshdpxnnztq
o8o5rr0ty3sgj4e87ei0j3xls
khpp8iekxfzdozyy1tjws3xi8
sha256:777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf

Total reclaimed space: 6.445MB
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ ./repro.sh
[+] Building 1.6s (5/8)
 => [internal] load remote build context                                                                                                                                                  0.0s
 => copy /context /                                                                                                                                                                       0.1s
 => [internal] load metadata for docker.io/library/alpine:latest                                                                                                                          1.4s
 => CANCELED [builder2 1/2] FROM docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                         0.0s
 => => resolve docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                                           0.0s
 => => sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436 1.64kB / 1.64kB                                                                                            0.0s
 => => sha256:074d3636ebda6dd446d0d00304c4454f468237fdacf08fb0eeac90bdbfa1bac7 528B / 528B                                                                                                0.0s
 => => sha256:389fef7118515c70fd6c0e0d50bb75669942ea722ccb976507d7b087e54d5a23 1.47kB / 1.47kB                                                                                            0.0s
 => ERROR [builder2 2/2] COPY test-2 /test-2                                                                                                                                              0.0s
------
 > [builder2 2/2] COPY test-2 /test-2:
------
failed to solve with frontend dockerfile.v0: failed to build LLB: failed to compute cache key: error creating zfs mount: mount airi-storage/local/docker/5le21f5z98do0po1ubji8a6pq:/var/lib/docker/zfs/graph/5le21f5z98do0po1ubji8a6pq: no such file or directory
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ ./repro.sh                                                                                                                                                                              ↵ 1
[+] Building 0.0s (2/2) FINISHED
 => CACHED [internal] load remote build context                                                                                                                                           0.0s
 => CACHED copy /context /                                                                                                                                                                0.0s
failed to solve with frontend dockerfile.v0: failed to read dockerfile: open /var/lib/docker/tmp/buildkit-mount536567859/Dockerfile: no such file or directory
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$ ./repro.sh                                                                                                                                                                              ↵ 1
[+] Building 1.9s (9/9) FINISHED
 => CACHED [internal] load remote build context                                                                                                                                           0.0s
 => CACHED copy /context /                                                                                                                                                                0.0s
 => [internal] load metadata for docker.io/library/alpine:latest                                                                                                                          0.4s
 => [builder2 1/2] FROM docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                                  0.4s
 => => resolve docker.io/library/alpine@sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436                                                                           0.0s
 => => sha256:3c7497bf0c7af93428242d6176e8f7905f2201d8fc5861f45be7a346b5f23436 1.64kB / 1.64kB                                                                                            0.0s
 => => sha256:074d3636ebda6dd446d0d00304c4454f468237fdacf08fb0eeac90bdbfa1bac7 528B / 528B                                                                                                0.0s
 => => sha256:389fef7118515c70fd6c0e0d50bb75669942ea722ccb976507d7b087e54d5a23 1.47kB / 1.47kB                                                                                            0.0s
 => => sha256:801bfaa63ef2094d770c809815b9e2b9c1194728e5e754ef7bc764030e140cea 2.80MB / 2.80MB                                                                                            0.1s
 => => extracting sha256:801bfaa63ef2094d770c809815b9e2b9c1194728e5e754ef7bc764030e140cea                                                                                                 0.2s
 => [builder2 2/2] COPY test-2 /test-2                                                                                                                                                    0.1s
 => [builder1 2/2] COPY test-1 /test-1                                                                                                                                                    0.1s
 => [stage-2 2/3] COPY --from=builder1 /test-1 /stuff/test-1                                                                                                                              0.1s
 => [stage-2 3/3] COPY --from=builder2 /test-2 /stuff/test-2                                                                                                                              0.1s
 => exporting to image                                                                                                                                                                    0.8s
 => => exporting layers                                                                                                                                                                   0.8s
 => => writing image sha256:5f9c5af7e4ad6219bf80b9a9e8d9c69b920560081bb95eeb8bf4c15f537d6e32                                                                                              0.0s
 => => naming to docker.io/library/zfs-issue                                                                                                                                              0.0s
╭─jaen@airi /home/jaen/Projects/work/zfs-issue  ‹system›
╰─$

Any idea how hard this could be to fix? I'm using ZFS and I need to use buildkit with it, so it's a fairly blocking issue for me.

@AdamSLevy
Copy link
Author

Sorry I didn't circle back to this but the above comment provides a reproducible example that is in line with my experience that I opened the issue for.

@tobia
Copy link

tobia commented Feb 8, 2021

This is still happening with containerd.io 1.4.3 and docker-ce 20.10.3

It seems to be a race condition that happens when Docker's root is on a ZFS volume and the build is multi-stage, containing a COPY --from=... If the source stage is not already in cache (and probably if it takes more time to build than the destination stage) this triggers the bug. The reason it seems to work every n-th time is that the race condition is not triggered if the previous build step is already in cache.

@tonistiigi This script reliably reproduces it, without the need for system prune or such workarounds (thanks @jaen)

#!/bin/sh
cd `mktemp -d`
echo test-1 > test-1
echo test-2 > test-2
cat > Dockerfile <<EOF
FROM alpine as builder1
COPY test-1 /test-1
FROM alpine as builder2
COPY test-2 /test-2
FROM alpine
COPY --from=builder1 /test-1 /stuff/test-1
COPY --from=builder2 /test-2 /stuff/test-2
EOF
TAR=`mktemp`
tar -cf $TAR .
env DOCKER_BUILDKIT=1 docker build - < $TAR

I don't know why the TAR step triggers the bug more easily, but it does.

Sometimes you get this error:

[+] Building 0.8s (5/8)                                                                             
 => [internal] load remote build context                                                       0.0s
 => copy /context /                                                                            0.1s
 => [internal] load metadata for docker.io/library/alpine:latest                               0.5s
 => [builder2 1/2] FROM docker.io/library/alpine@sha256:08d6ca16c60fe7490c03d10dc339d9fd8ea67  0.0s
 => ERROR [builder2 2/2] COPY test-2 /test-2                                                   0.0s
------
 > [builder2 2/2] COPY test-2 /test-2:
------
failed to compute cache key: error creating zfs mount: mount tank/docker/x4pmhzudwpq49o1f2cg24mec8:/tank/docker/zfs/graph/x4pmhzudwpq49o1f2cg24mec8: no such file or directory

Other times you get this:

[+] Building 0.2s (2/2) FINISHED                                                                    
 => [internal] load remote build context                                                       0.0s
 => copy /context /                                                                            0.1s
failed to solve with frontend dockerfile.v0: failed to read dockerfile: open /tank/docker/tmp/buildkit-mount210258653/Dockerfile: no such file or directory

PS. I think this is the same issue as moby/moby#37207

@tobia
Copy link

tobia commented Feb 10, 2021

The only a workaround I found for this issue is not to use the zfs storage driver at all (explicitly set "storage-driver": "aufs" in /etc/docker/daemon.json) Of course if you want to do this on an existing server, you will have to backup your volumes, destroy your data root and re-create it.

@jaen
Copy link

jaen commented Feb 13, 2021

@tonistiigi now that there is a reproducible case multiple people corroborate, can we please give this a higher priority? It seriously hinders usage of buildkit on ZFS.

@tonistiigi
Copy link
Member

@jaen Sorry, zfs is not at the top of the priorities atm for us so can't give ETA. It has always been a community maintained driver. Maybe https://github.com/moby/moby/blob/master/daemon/graphdriver/zfs/MAINTAINERS have time to look at it meanwhile.

@jaen
Copy link

jaen commented Feb 15, 2021

@tonistiigi sure, while somewhat disappointing for me personally, this is at least a clear statement of current priorities, so I know what to expect. Thanks for the update!

@Mic92 @baloo, sorry for the ats, but do any of you have an inkling what's going on here? I'm afraid my ZFS/docker internals/go–fu isn't good enough to even guess at the cause (and possibly not good enough to fix it even if you point me in the right direction, but it's worth a shot).

@tobia
Copy link

tobia commented Feb 15, 2021

My knowledge of both Docker and ZFS internals is exactly zero, but from an outsider point of view, it seems very likely to be a race condition.

Meaning: one thread (or process, or green thread, I don't know) is building a layer called X with the previous stage; another thread needs layer X as the source of a COPY command. Somehow, with the Docker ZFS driver, the second thread is signaled ("wake up, layer X is ready") before layer X is actually committed to disk, therefore causing a file not found error.

Considering the relative maturity and usage of the Linux ZFS driver and of the Docker ZFS driver, I would wager the issue is in the latter.

If i knew where to look, I would strengthen the locks that are involved in this. The issue is likely to be on the committing side, meaning the code that is signaling "layer is ready" is doing so before all underlying operations have been flushed, so maybe it needs an additional lock or system call. Maybe there's a ZFS system call that is usually "fast enough" but is not technically synchronous, so we need an additional system call to wait until it's complete.

The easiest way to "work around" this bug, without having to identify the actual culprit system call, is to add a tiny "busy loop" right after a ZFS layer is supposedly created (but before telling the rest of Docker that it has been created) that tries to access the new layer from disk and waits until it actually succeeds.

Where are the sources of the Docker ZFS driver anyways?

@Mic92
Copy link

Mic92 commented Feb 15, 2021

Mhm. Never used buildkit. The same script seems to work with both docker and podman.

@Mic92
Copy link

Mic92 commented Feb 15, 2021

But I can reproduce it with the buildkite option:

sudo bash -x ./test.sh
++ mktemp -d
+ cd /tmp/tmp.TcQfMaFwuW
+ echo test-3
+ echo test-3
+ cat
++ mktemp
+ TAR=/tmp/tmp.pCYTD3vNlG
+ tar -cf /tmp/tmp.pCYTD3vNlG .
+ env DOCKER_BUILDKIT=1 docker build -
[+] Building 0.3s (6/8)
 => [internal] load remote build context                                                                                                                                               
 => copy /context /                                                                                                                                                                    
 => [internal] load metadata for docker.io/library/alpine:latest                                                                                                                       
 => [builder2 1/2] FROM docker.io/library/alpine                                                                                                                                       
 => => resolve docker.io/library/alpine:latest                                                                                                                                         
 => CANCELED [builder2 2/2] COPY test-2 /test-2                                                                                                                                        
 => ERROR [builder1 2/2] COPY test-1 /test-1                                                                                                                                           
------
 > [builder1 2/2] COPY test-1 /test-1:
------
failed to compute cache key: error creating zfs mount: mount data/root/docker/cslaflbgfshloyx5at4vjtwrm:/var/lib/docker/zfs/graph/cslaflbgfshloyx5at4vjtwrm: no such file or directory

@Mic92
Copy link

Mic92 commented Feb 15, 2021

@Mic92
Copy link

Mic92 commented Feb 15, 2021

@Mic92
Copy link

Mic92 commented Feb 15, 2021

I have never used buildkit and don't know anything about its architecture but does it somehow calls Get() and Put() from different processes? Otherwise I cannot explain how it would delete the mountpoint before mounting a different one.

@tonistiigi
Copy link
Member

does it somehow calls Get() and Put() from different processes

No, everything is internal to dockerd

@arudnev
Copy link

arudnev commented Nov 5, 2021

We hit this issue earlier this week, switched out buildx from docker (with zfs storage driver) to docker-container (which points to local docker, but it seems that buildkit uses different storage within the container, kind of expected), and our build times skyrocketed (multiple large images add tens of seconds on each build instruction instead of sub-seconds, even COPY of a small entrypoint file or definition of WORKDIR, resulting in hours of build time instead of a few minutes that we had with zfs storage).
It looks like for each new layer it creates a copy of yet another ~1G of whatever we have in intermediate builder container, creating a lot of IO contention, especially in parallel builds.

Anyway, we found a dirty / ugly workaround that allows us to stay with default docker builder on zfs, not sure yet how reliable it is as timing is likely to be of essence here, but we have not been able to reproduce that error with this approach yet.

Basically, for each stage that will be copied from we run docker build with --target=stage and no --output specified before running final build without any --target.
For example in case of test.sh from the reproducible example above it would be having just one extra line at the end to have it working:

env DOCKER_BUILDKIT=1 docker build --target=builder2 - < $TAR
env DOCKER_BUILDKIT=1 docker build - < $TAR

Need to do more cleanup at the end to get rid of temporary images created for those builder stages, but at least it gives us our favorite zfs driver back.

Really looking forward for this defect to be fixed though.

@saiarcot895
Copy link

I have never used buildkit and don't know anything about its architecture but does it somehow calls Get() and Put() from different processes? Otherwise I cannot explain how it would delete the mountpoint before mounting a different one.

With BuildKit, it is possible for Get() and Put() to be called from different threads within the dockerd process. Additionally, they may be called multiple times in succession with the same ID (i.e. Get() can be called with the same ID 3 times in a row for some cache mount).

Briefly looking at the zfs code, I could maybe see how there can be a race condition related to the mount directory being missing.

Assume that there are two threads, and some layer with ID 824f183 has been mounted once by thread 1. Now, thread 1 wants to unmount it, and thread 2 wants to mount that same layer. Here's a possible execution sequence which will result in the above error:

  1. Thread 2 decrements the refcount for 824f183's mountpoint from 1 to 0. (When this counter reaches 0, it'll unmount and remove that directory)
  2. Thread 1 increments the refcount for 824f183's mountpoint from 0 to 1.
  3. Thread 2 unmounts 824f183's mountpoint.
  4. Thread 1 creates the mountpoint directory if it doesn't already exist. (In this case, the directory still exists, so nothing really happens here)
  5. Thread 2 removes the mountpoint directory.
  6. Thread 1 tries to mount 824f183, but fails, because that directory no longer exists.

In this sequence, the only mutex/lock that is present is in the refcount increment/decrement (this is from code outside of the ZFS file); the rest don't have any mutexes.

@tonistiigi
Copy link
Member

@saiarcot895 If you compare this with overlay/aufs then afaics there the reference counter logic is always behind an id-based locker.

@saiarcot895
Copy link

@tonistiigi I see locks on the complete Get(), Put(), and Remove() methods for the overlay, aufs, and devmapper drivers, added in moby/moby@fc1cf19. The critical thing here is that the entire method has a lock (for each ID), so that the above case where one thread is going through Put() and another thread is going through Get() concurrently doesn't happen.

@jaen
Copy link

jaen commented Dec 27, 2021

So does this mean that I should try replicating locks from that commit in the ZFS driver and see if it fixes the issue?

@saiarcot895
Copy link

I believe so, yes.

I was working on some changes to the btrfs driver (porting patches that another person posted in a merge request to the current version of the code), and hit a similar issue when using buildkit to build the Docker build environment (with the host Docker running with my changes). After I added in the locks that the overlay driver has into the btrfs driver, that issue went away.

@jaen
Copy link

jaen commented Jan 6, 2022

Ok, so I have built docker with this commit jaen/moby@a4ab9f5: and it seems to fix the issue, I can't reproduce it using the script I posted above anymore.

@thaJeztah
Copy link
Member

Feel free to open a pull request with your changes to have it reviewed 👍

jaen added a commit to jaen/moby that referenced this issue Jan 10, 2022
Trying to build Docker images with buildkit using a ZFS-backed storage
was unreliable due to apparent race condition between adding and
removing layers to the storage (see: moby/buildkit#1758).
The issue describes a similar problem with the BTRFS driver that was
resolved by adding additional locking based on the scheme used in the
OverlayFS driver. This commit replicates the scheme to the ZFS driver
which makes the problem as reported in the issue stop happening.

Signed-off-by: Tomasz Mańko <hi@jaen.me>
@jaen
Copy link

jaen commented Jan 10, 2022

Made a PR with it, @saiarcot895 thanks a lot for pointing me in the right direction!

@oramirite
Copy link

I'm potentially in a good position to test this. I've been testing a deployment using Arch Linux on ZFS, and never got this before recently moby/moby#41866, which it seems this would fix. Sorry to ask for hand-holding but what's the easiest way to get this patch on Arch for testing?

I'm consistently getting the 'failed to copy files: copy file range failed: invalid argument' error when trying to build or run this container: https://github.com/dperson/samba

I can build the container after removing the VOLUMES section at the bottom of the Dockerfile, strangely.

@jaen
Copy link

jaen commented Feb 16, 2022

@oramirite I've since moved to NixOS so can't easily test this works for sure, but I think you would just want to use docker-git's PKGBUILD(see: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=docker-git) and change the moby repository (first entry of sources array) to point to my fork (and branch, apparently it would be adding #branch=zfs-driver-fix to the end, cf. https://wiki.archlinux.org/title/VCS_package_guidelines#VCS_sources).

@scotte
Copy link

scotte commented Feb 17, 2022

I can reproduce this at will by running certain containers. For example:

$ docker run --rm google/cloud-sdk:373.0.0
docker: Error response from daemon: failed to copy files: copy file range failed: invalid argument.

I had trouble building with the patch, so I was not able to test the fix.

3nprob pushed a commit to 3nprob/moby that referenced this issue Feb 28, 2022
Trying to build Docker images with buildkit using a ZFS-backed storage
was unreliable due to apparent race condition between adding and
removing layers to the storage (see: moby/buildkit#1758).
The issue describes a similar problem with the BTRFS driver that was
resolved by adding additional locking based on the scheme used in the
OverlayFS driver. This commit replicates the scheme to the ZFS driver
which makes the problem as reported in the issue stop happening.

Signed-off-by: Tomasz Mańko <hi@jaen.me>
@3nprob
Copy link

3nprob commented Feb 28, 2022

FWIW https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=docker-git is very out of date and does not build as-is. This is not complete, may be broken, and required me to manually run containerd but allowed me to verify that #43136 indeed fixes the issue by successfully building images that consistently failed without.

jaen added a commit to jaen/moby that referenced this issue Mar 6, 2022
Trying to build Docker images with buildkit using a ZFS-backed storage
was unreliable due to apparent race condition between adding and
removing layers to the storage (see: moby/buildkit#1758).
The issue describes a similar problem with the BTRFS driver that was
resolved by adding additional locking based on the scheme used in the
OverlayFS driver. This commit replicates the scheme to the ZFS driver
which makes the problem as reported in the issue stop happening.

Signed-off-by: Tomasz Mańko <hi@jaen.me>
@frebib
Copy link

frebib commented May 11, 2022

Is there a chance that the merged fix will make it into a docker-ce release any time soon? Will it be in the 20.10.x series? Some future version?

@jaen
Copy link

jaen commented May 15, 2022

FWIW I don't think that fix… well, fixes all the issues with ZFS driver. It's better, but sometimes I can with errors saying that a snapshot doesn't exist (for example when building or even pulling images), but I guess that could be considered a separate issue, since the reproduction isn't failing anymore.

@mschout
Copy link

mschout commented Oct 13, 2022

I installed the 22.06 beta build from the APT test channel about a week ago and have not seen this problem since then. It was happening roughly every other build otherwise. The fix from moby/moby#43136 is in 22.06 beta, but not in the 20.10.x series.

@jerrac
Copy link

jerrac commented Dec 4, 2022

This issue is still present in Docker 20.10.21 on Debian 11 when my root drive is zfs.

------                                                                                                                             
 > [merge_exts 13/45] COPY --from=install_zip /usr/local/lib/php/extensions/no-debug-non-zts-20210902/zip.so /usr/local/lib/php/extensions/no-debug-non-zts-20210902/zip.so:                                                                                          
------                                                                                                                             
ERROR: failed to solve: failed to compute cache key: error creating zfs mount: mount rpool/ROOT/pve-1/h4qhjvd1apt1w8po18tpe9x6t:/var/lib/docker/zfs/graph/h4qhjvd1apt1w8po18tpe9x6t: no such file or directory

Oddly it only showed up today. No idea why previous builds didn't have the problem. Though I did just reinstall over Thanksgiving, and haven't run many builds locally since... My previous install was on ext4 (I think.)

My work around is switching to the devicemapper driver. I tried overlay2 first, but docker says it isn't supported?

devicemapper seems to work fine. So I'm good for now. Hopefully this will eventually get fixed.

@mschout
Copy link

mschout commented Dec 7, 2022

To be clear, its fixed, but only in the 22.x builds.

@crazy-max
Copy link
Member

@mschout fyi, 22.x is now 23, see moby/moby#44589 for more info.

@mschout
Copy link

mschout commented Jan 6, 2023

Apologies @crazy-max. 23.x is also immune.

@mschout
Copy link

mschout commented Feb 27, 2023

Given that 23.x is out now, I feel like this can be closed. I have not seen it at all since upgrading.

@thaJeztah
Copy link
Member

Yup, let's go ahead and close this one 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests