Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kind 0.20.0 pod create error with 0.20.0 node images #3309

Closed
bpfoster opened this issue Jul 17, 2023 · 15 comments
Closed

Kind 0.20.0 pod create error with 0.20.0 node images #3309

bpfoster opened this issue Jul 17, 2023 · 15 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/external upstream bugs
Milestone

Comments

@bpfoster
Copy link
Contributor

What happened:

Just upgraded to kind 0.20.0. If I specify any of the node images that are listed in the release (e.g. kindest/node:v1.22.17@sha256:f5b2e5698c6c9d6d0adc419c0deae21a425c07d81bbf3b6a6834042f25d4fba2 or kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72) one of my pods (the only statefulset fwiw), fails to create with the following error:

Failed to pull image "xxx": rpc error: code = Unknown desc = failed to pull and unpack image "xxx": failed to extract layer sha256:yyy: mount callback failed on /var/lib/containerd/tmpmounts/containerd-mount2288311390: failed to convert whiteout file "var/cache/apt/.wh.archives": unlinkat /var/lib/containerd/tmpmounts/containerd-mount2288311390/var/cache/apt/archives: input/output error: unknown

Interestingly if I switch the node image to one specified in the kind 0.19.0 release while still running kind 0.20.0 (for example kindest/node:v1.22.17@sha256:9af784f45a584f6b28bce2af84c494d947a05bd709151466489008f80a9ce9d5 or kindest/node:v1.27.1@sha256:b7d12ed662b873bd8510879c1846e87c7e676a79fefc93e17b2a52989d3ff42b), it works.

What you expected to happen: Pods run without error

How to reproduce it (as minimally and precisely as possible):

  1. kind create cluster --image=<image mentioned above>
  2. helm install my-app

Anything else we need to know?: Running on rootless podman via systemd user scope

Environment:

  • kind version: (use kind version): kind v0.20.0 go1.20.4 linux/amd64
  • Runtime info: (use docker info or podman info):
host:
  arch: amd64
  buildahVersion: 1.29.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.6-1.module+el8.8.0+1265+fa25dd7a.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.6, commit: a88a21e8953a6243d5f369f61a342bcaf0630aa1'
  cpuUtilization:
    idlePercent: 87.51
    systemPercent: 2.75
    userPercent: 9.75
  cpus: 12
  distribution:
    distribution: '"rocky"'
    version: "8.8"
  eventLogger: file
  hostname: x
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 4.18.0-477.15.1.el8_8.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 377614336
  memTotal: 33449598976
  networkBackend: cni
  ociRuntime:
    name: runc
    package: runc-1.1.4-1.module+el8.8.0+1265+fa25dd7a.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.1.4
      spec: 1.0.2-dev
      go: go1.19.4
      libseccomp: 2.5.2
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_SYS_CHROOT,CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-2.module+el8.8.0+1265+fa25dd7a.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 16638275584
  swapTotal: 16793989120
  uptime: 2h 39m 21.00s (Approximately 0.08 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /home/x/.config/containers/storage.conf
  containerStore:
    number: 3
    paused: 0
    running: 2
    stopped: 1
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/x/.local/share/containers/storage
  graphRootAllocated: 407822663680
  graphRootUsed: 137751646208
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /home/x/tmp
  imageStore:
    number: 172
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/x/.local/share/containers/storage/volumes
version:
  APIVersion: 4.4.1
  Built: 1687991933
  BuiltTime: Wed Jun 28 18:38:53 2023
  GitCommit: ""
  GoVersion: go1.19.9
  Os: linux
  OsArch: linux/amd64
  Version: 4.4.1
  • OS (e.g. from /etc/os-release):
NAME="Rocky Linux"
VERSION="8.8 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.8 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2029-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.8"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8
  • Kubernetes version: (use kubectl version):
Client Version: v1.24.3
Kustomize Version: v4.5.4
Server Version: v1.22.17
  • Any proxies or other special environment settings?:
    Kind config:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  # don't pass through host search paths
  dnsSearch: []
@bpfoster bpfoster added the kind/bug Categorizes issue or PR as related to a bug. label Jul 17, 2023
@kundan2707
Copy link
Contributor

/assign

@kundan2707
Copy link
Contributor

/remove-kind bug

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 18, 2023
@kundan2707
Copy link
Contributor

/kind support

@k8s-ci-robot k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label Jul 18, 2023
@kundan2707
Copy link
Contributor

@bpfoster
I have created kind cluster with image and version mentioned by you.
I was also able to create various pod successfully.

 kind create cluster --image=kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a question, bug, or feature request? Let us know! https://kind.sigs.k8s.io/#community 🙂

Is there any specific pod which is failing ?

@bpfoster
Copy link
Contributor Author

bpfoster commented Jul 18, 2023

Thanks @kundan2707

This is specific to a pod running one of our in-house images. I have been able to whittle down the real dockerfile to the following minimal reproduction image. Understand that it is not an efficient Dockerfile - I'll be looking at cleaning it up - but the real dockerfile is much more complex and this is essentially the end-result of run steps:

FROM debian:bullseye-slim

RUN apt-get update && \
    apt-get install -y --no-install-recommends curl && \
    apt-get clean && rm -rf /var/lib/apt/lists /var/cache/apt/archives

RUN apt-get update && apt-get upgrade -y

RUN apt-get clean && rm -rf /var/lib/apt/lists /var/cache/apt/archives

To reproduce:

  1. docker build -t foobar:latest .
  2. kind load docker-image foobar:latest

Step 2 fails with kind node image kindest/node:v1.22.17@sha256:f5b2e5698c6c9d6d0adc419c0deae21a425c07d81bbf3b6a6834042f25d4fba2 from the 0.20 release, and succeeds with kind node image kindest/node:v1.22.17@sha256:9af784f45a584f6b28bce2af84c494d947a05bd709151466489008f80a9ce9d5 from the 0.19 release.

This is beyond my understanding, but the errors seem to relate to whiteout files created during the last rm -rf. If I remove the last RUN line, it succeeds.

mount callback failed on /var/lib/containerd/tmpmounts/containerd-mount1472449707: failed to convert whiteout file \"var/cache/apt/.wh.archives\": unlinkat /var/lib/containerd/tmpmounts/containerd-mount1472449707/var/cache/apt/archives: input/output error: unknown

Perhaps this is then a containerd issue..?

@bpfoster
Copy link
Contributor Author

bpfoster commented Jul 18, 2023

Forget all the apt-get steps, I get a similar whiteout convert error with an image as simple as

FROM debian:bullseye-slim

RUN mkdir /a && touch /a/b
RUN rm -rf /a
ERROR: failed to load image: command "podman exec --privileged -i kind-control-plane ctr --namespace=k8s.io images import --all-platforms --digests --snapshotter=overlayfs -" failed with error: exit status 1
Command Output: unpacking foobar:latest (sha256:c0bfa9fa4dec37ce494b2c1eca50d95b661e9b453886be82aadbeee280da7476)...time="2023-07-18T13:31:54Z" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:18b9b98002bb83ad9d69f81904460f474f9a33f28a42637092f750061d0bf6d4: mount callback failed on /var/lib/containerd/tmpmounts/containerd-mount3730886863: failed to convert whiteout file \".wh.a\": unlinkat /var/lib/containerd/tmpmounts/containerd-mount3730886863/a: input/output error: unknown" key="extract-466388356-4oZW sha256:95cc91e05f5b1e9f023ac497c6ea15aee933a166a2b0ba4ba947e8695bf5a555"
ctr: failed to extract layer sha256:18b9b98002bb83ad9d69f81904460f474f9a33f28a42637092f750061d0bf6d4: mount callback failed on /var/lib/containerd/tmpmounts/containerd-mount3730886863: failed to convert whiteout file ".wh.a": unlinkat /var/lib/containerd/tmpmounts/containerd-mount3730886863/a: input/output error: unknown

@BenTheElder
Copy link
Member

This sounds like a containerd bug, thanks for debugging this so far.
I'm going to be unavailable until next week, but we should look to see if there's an existing containerd bug report.

@bpfoster
Copy link
Contributor Author

If I run containerd locally, things work fine. So my guess is it's something with being run within a container, and my uneducated guess is related to the overlay mounts.

This looks to be the problematic commit in containerd:
containerd/containerd@fa4720f

Prior to that commit it works, after it I get the error.

I don't know enough here to say if it's a bug in containerd or something that kind needs to change to handle.

@bpfoster
Copy link
Contributor Author

Had some time to dig around, and it does seem to be a containerd bug. I've opened an issue with them: containerd/containerd#8851

@BenTheElder
Copy link
Member

Thanks! containerd/containerd#8851 (comment)

Looks like we'll need to upgrade containerd to pick up this fix.

@BenTheElder BenTheElder added kind/bug Categorizes issue or PR as related to a bug. kind/external upstream bugs and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. kind/support Categorizes issue or PR as a support question. labels Aug 2, 2023
@BenTheElder BenTheElder added this to the v0.21.0 milestone Aug 2, 2023
@BenTheElder
Copy link
Member

This is now in the 1.7 branch containerd/containerd@2eaeb32 (since 5 days ago), given 1.7.3 released a week ago we may need to pick up a pre-release commit for a bit.

@bpfoster
Copy link
Contributor Author

bpfoster commented Aug 3, 2023

Yeah unfortunately I wasn't familiar with their process and didn't request the 1.7 cherry-pick until after 1.7.3 had been released.
Not sure if there's any expectation on timeliness of a new containerd release, but since the kind 0.19 images seem to be working OK, this isn't a blocker for us at the moment.

@bpfoster
Copy link
Contributor Author

@BenTheElder - looks like containerd 1.7.4 was released with this fix, with a 1.7.5 release shortly after. 1.7.4 also bumps runc to 1.1.9.

@BenTheElder
Copy link
Member

We didn't bump containerd further w/ the k8s release hitting code freeze etc. I'm going to figure out go update in #3335 then bump everything and get reading for a v0.21 which will include this.

Particularly long lifecycle this time, if workarounds weren't available we would have pressed forward with something sooner.

@bpfoster
Copy link
Contributor Author

Great, thanks for the update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/external upstream bugs
Projects
None yet
Development

No branches or pull requests

4 participants