Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker pull on certain images fail on bionic with docker 18.09 #451

Closed
Sid-Sun opened this issue Dec 29, 2021 · 9 comments
Closed

docker pull on certain images fail on bionic with docker 18.09 #451

Sid-Sun opened this issue Dec 29, 2021 · 9 comments
Assignees
Labels
bug Something isn't working
Projects

Comments

@Sid-Sun
Copy link

Sid-Sun commented Dec 29, 2021

When running docker 18.09.7 on Ubuntu Bionic, certain image pulls fail with:

failed to register layer: Error processing tar file(exit status 1): operation not permitted

the issue does not occur with docker versions 19.03 or higher on bionic
the issue also cannot be reproduced on all docker images, only certain layers in some images seem to fail, we have been able to consistently reproduce with rrdockerhub/ros-base-melodic-amd64:latest
might be similar to #254

Environment Details:

  • Running sysbox v0.4.1 on GKE
  • The nodes are running Ubuntu with containerd
  • Node kernel version: 5.4.0-1053-gke
  • Kube version 1.21

Complete output:

root@bionic-test:~# strace -o strace.txt docker pull rrdockerhub/ros-base-melodic-amd64
Using default tag: latest
latest: Pulling from rrdockerhub/ros-base-melodic-amd64
5b7339215d1d: Pull complete 
14ca88e9f672: Pull complete 
a31c3b1caad4: Pull complete 
b054a26005b7: Pull complete 
2d951c0d117b: Extracting [==================================================>]  836.8kB/836.8kB
7a080df8aa29: Download complete 
af7dec1d4371: Download complete 
7ae51eab5aaf: Download complete 
0c1c60218d5d: Download complete 
c81272701b33: Download complete 
04c6d168e31d: Download complete 
51f578a54dfd: Download complete 
6323fa587f09: Download complete 
b4a3394e02f5: Download complete 
failed to register layer: Error processing tar file(exit status 1): operation not permitted

strace:
Posted here

When we initially identified the issue, strace had:

epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4276727552, u64=140715995279104}}) = -1 EPERM (Operation not permitted)
epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc4202db87c) = -1 EPERM (Operation not permitted)

However it might have been due to something else, we were using a different image at the time. Using a cleaner image directly from nestybox and only setting specific docker version gets rid of this

Dockerfile used to create the image (same as nestybox modified to specify docker version):

FROM nestybox/ubuntu-bionic-systemd:latest

# Docker install
RUN apt-get update && apt-get install --no-install-recommends -y      \
       apt-transport-https                                            \
       ca-certificates                                                \
       curl                                                           \
       gnupg-agent                                                    \
       software-properties-common &&                                  \
                                                                      \
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg           \
         | apt-key add - &&                                           \
                                                                     \
    apt-key fingerprint 0EBFCD88 &&                                   \
                                                                      \
    add-apt-repository                                                \
       "deb [arch=amd64] https://download.docker.com/linux/ubuntu     \
       $(lsb_release -cs)                                             \
       stable" &&                                                     \
                                                                      \
    apt-get update && apt-get install --no-install-recommends -y      \
       docker-ce=5:18.09.7~3-0~ubuntu-bionic                          \
       docker-ce-cli=5:18.09.7~3-0~ubuntu-bionic                      \
       containerd.io=1.2.6-3 &&                                       \
                                                                      \
    # Housekeeping
    apt-get clean -y &&                                               \
    rm -rf                                                            \
       /var/cache/debconf/*                                           \
       /var/lib/apt/lists/*                                           \
       /var/log/*                                                     \
       /tmp/*                                                         \
       /var/tmp/*                                                     \
       /usr/share/doc/*                                               \
       /usr/share/man/*                                               \
       /usr/share/local/* &&                                          \
                                                                      \
    # Add user "admin" to the Docker group
    usermod -a -G docker admin

# Sshd install
RUN apt-get update && apt-get install --no-install-recommends -y      \
            openssh-server &&                                         \
    mkdir /home/admin/.ssh &&                                         \
    chown admin:admin /home/admin/.ssh

EXPOSE 22

# Set systemd as entrypoint.
ENTRYPOINT [ "/sbin/init", "--log-level=err" ]

Pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: bionic-test
  annotations:
    io.kubernetes.cri-o.userns-mode: "auto:size=65536"
spec:
  runtimeClassName: sysbox-runc
  containers:
  - name: ubu-foc-systemd-docker
    image: sidinbits/sysbox-bionic:docker-18
    startupProbe:
      tcpSocket:
        port: 22
      initialDelaySeconds: 5
      periodSeconds: 10
    ports:
      - containerPort: 22
    imagePullPolicy: Always
  restartPolicy: Never

Thanks :D

@rodnymolina rodnymolina added the bug Something isn't working label Dec 29, 2021
@rodnymolina rodnymolina added this to To do in Sysbox Dev via automation Dec 29, 2021
@rodnymolina
Copy link
Member

Thanks for filing the issue @Sid-Sun.

I haven't been able to reproduce it so far, but will try again in a different node with a similar kernel version -- I'm running 5.11 in my current setup:

rmolina@dev-vm1:$ docker run --runtime=sysbox-runc -it --rm --name test-issue-451 --hostname test-issue-451 issue-451

admin@test-issue-451:~$ docker version
Client:
 Version:           18.09.7
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        2d0083d
 Built:             Thu Jun 27 17:56:23 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       2d0083d
  Built:            Thu Jun 27 17:23:02 2019
  OS/Arch:          linux/amd64
  Experimental:     false
admin@test-issue-451:~$

admin@test-issue-451:~$ docker pull rrdockerhub/ros-base-melodic-amd64
Using default tag: latest
latest: Pulling from rrdockerhub/ros-base-melodic-amd64
5b7339215d1d: Pull complete
14ca88e9f672: Pull complete
a31c3b1caad4: Pull complete
b054a26005b7: Pull complete
2d951c0d117b: Pull complete
7a080df8aa29: Pull complete
af7dec1d4371: Pull complete
7ae51eab5aaf: Pull complete
0c1c60218d5d: Pull complete
c81272701b33: Pull complete
04c6d168e31d: Pull complete
51f578a54dfd: Pull complete
6323fa587f09: Pull complete
b4a3394e02f5: Pull complete
Digest: sha256:418c1c5d28c4829a22ace56f3e3b120fa57088bda99e678cb46afb7d598853ad
Status: Downloaded newer image for rrdockerhub/ros-base-melodic-amd64:latest
admin@test-issue-451:~$

@ctalledo
Copy link
Member

Hi @Sid-Sun,

root@bionic-test:~# strace -o strace.txt docker pull rrdockerhub/ros-base-melodic-amd64

Please repeat this with the -f flag (i.e., "follow forks"):

strace -f -o strace.txt docker pull rrdockerhub/ros-base-melodic-amd64

then attach the resulting strace output.

Thanks!

@Sid-Sun
Copy link
Author

Sid-Sun commented Dec 30, 2021

@ctalledo

Hey! Thanks for the quick response, here is the strace with forks follow

@rodnymolina That's interesting... I will try to update the kernel on one of the nodes and try to reproduce it thanks :D

@ctalledo
Copy link
Member

Thanks @Sid-Sun, that was helpful though I still don't see the root cause.

Here is what I see:

  • The only EPERM error in the entire trace is from an epoll_ctl syscall on file-descriptor 5:
1424  openat(AT_FDCWD, "/lib/terminfo/x/xterm-256color", O_RDONLY|O_CLOEXEC) = 5
1424  epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=154258992, u64=140295260982832}} <unfinished ...> 
1424  <... epoll_ctl resumed> )         = -1 EPERM (Operation not permitted) 
1424  epoll_ctl(4, EPOLL_CTL_DEL, 5, 0xc42019331c <unfinished ...>
1424  <... epoll_ctl resumed> )         = -1 EPERM (Operation not permitted)
  • There are no other EPERM errors, so this must be the cause.

  • However, it's weird: the EPERM occurs very early in the strace (e.g., at lines 1058 and 1063), and the strace continues for much longer before the docker pull error shows up in line 25261: "failed to register layer: Error processing tar file...".

  • Why that epoll_ctl() returns EPERM is not clear to me. Per the man page for epoll_ctl(2), EPERM is only returned when the file on which the epoll occurs does not support polling (e.g., when it's a regular file or directory, rather than a socket or pipe). That would be file /lib/terminfo/x/xterm-256color inside the image, which is a regular file (as it is in my Ubuntu host).

With all of this, I am suspecting this is a Docker error (i.e., it's epolling a regular file), and the fact that you don't see it with later versions of Docker would seem to add weight to this. I certainly can't see how it could be related to Sysbox.

Do you know if this occurs outside of a Sysbox container (e.g., when using the same version of Docker on a regular host)?

Also, it may be worthwhile getting the same strace with Docker 19.03 pulling the same image, so we can compare.

Finally, on a quick web search, it seems it may be related to this Stack Overflow post, tracked by this Moby issue. That would explain it since Sysbox containers are always rootless.

@Sid-Sun
Copy link
Author

Sid-Sun commented Jan 3, 2022

Hi @ctalledo,

I've run a few more tests to get some more data points; on both native (VMs) and sysbox environments.

What I have observed is:

  • Docker 18 (and likely below) on native environments:

    • EPERM error for polling xterm is present but image pull succeeds
  • In sysbox environment with kernel 5.4:

    • EPERM error is present but image pull fails
  • In sysbox environment with kernel 5.11:

    • EPERM error is present yet the image pull succeeds

Docker 19 and above:

  • docker doesn't try to poll xterm at all
  • image pull succeeds in native and sysbox environments

It appears there is some sort of incompatibility between docker 18, sysbox and kernel 5.4 - swapping any one of these seems to resolve the incompatibility

Note: Docker 18 is not available for focal out of the box, i had to use docker's bionic archive to install it

You can find the strace of the tests I ran here

@ctalledo
Copy link
Member

ctalledo commented Jan 4, 2022

Hi @Sid-Sun, thanks for the excellent data points.

Based on the fact that with Docker 18 the "polling xterm" error is always present, yet sometimes the pull fails and sometimes it doesn't, it seems to me that error is a red-herring. This also gels with my observation that this error occurs much earlier in the strace than the failure point.

However the weird thing is that there is no other EPERM error in the strace that would give us a better hint on why the pull fails.

I think I'll need to repro on my side to dig down further.

Let me ask: how important is that you use Docker 18? (wondering how high I need to prioritize this).

Thanks again.

@Sid-Sun
Copy link
Author

Sid-Sun commented Mar 3, 2022

Hi, we've moved to a newer docker version, I guess it would be best to note some potential incompatibility in the documentation until this is nailed down

@ctalledo
Copy link
Member

ctalledo commented Mar 4, 2022

Thanks @Sid-Sun ; I think this issue is probably the best way to document the problem (until we are able to better understand the underlying root cause).

@ctalledo ctalledo self-assigned this Mar 22, 2022
@ctalledo
Copy link
Member

Closing for now, please re-open if this occurs again.

Sysbox Dev automation moved this from To do to Done Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Sysbox Dev
  
Done
Development

No branches or pull requests

3 participants