Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coredns:v1.8.0 and etcd:3.4.13-0 on arm64 have the wrong architecture in their manifests; kubeadm init fails #104085

Closed
sysedwinistrator opened this issue Aug 2, 2021 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/release Categorizes an issue or PR as relevant to SIG Release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@sysedwinistrator
Copy link

What happened:

Cluster initialization on arm64 via kubeadm init using CRI-O would fail after a timeout.
watch crictl ps wouldn't show any containers starting.
CRI-O logs show the following:

Aug 02 20:42:19 n2 crio[852185]: time="2021-08-02 20:42:19.803721746+02:00" level=info msg="Image operating system mismatch: image uses OS \"linux\"+architecture \"amd64\", expecting one of \"linux+arm64\""

The inspection of the images' manifests seems to confirm the initial assumption that some of them were pulled with the wrong architecture:

# crictl images | awk 'NR>1 {print $3}' | while read -r line; do crictl inspecti $line | jq '"IMAGE: "+.status.repoTags[]+", ARCH: "+ .info.imageSpec.architecture'; done
"IMAGE: k8s.gcr.io/coredns/coredns:v1.8.0, ARCH: amd64"
"IMAGE: k8s.gcr.io/etcd:3.4.13-0, ARCH: amd64"

But all the other images are fine:

"IMAGE: k8s.gcr.io/kube-apiserver:v1.21.3, ARCH: arm64"
"IMAGE: k8s.gcr.io/kube-controller-manager:v1.21.3, ARCH: arm64"
"IMAGE: k8s.gcr.io/kube-proxy:v1.21.3, ARCH: arm64"
"IMAGE: k8s.gcr.io/kube-scheduler:v1.21.3, ARCH: arm64"
"IMAGE: k8s.gcr.io/pause:3.4.1, ARCH: arm64"

To rule out that CRI-O is somehow at fault, I ran the images through skopeo inspect, which confirmed the previous results even when passing --override-arch arm64.

So I checked what architecture the binaries in the container storage location actually are:

# find . -name 'etcd*' -exec file {} \;
./22bafa088de22425c1186532aaaae80ee77823a4dba938b6a09310313804e001/usr/local/bin/etcd: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), statically linked, Go BuildID=C2gD-OhOm0EaEE6O5P-G/E0Rtpt8guBmaGHRO19gQ/OcfznmXfak8UQlpC_k77/c5upPp32vabmtzAPWE5y, not stripped
......
./22bafa088de22425c1186532aaaae80ee77823a4dba938b6a09310313804e001/usr/local/bin/etcdctl: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), statically linked, Go BuildID=Cj-D8mbvp8S4ehUJ-NSL/_qmGy8_7NS8-Lo8YkEWC/dPf7-SxGllpI-tniKRD5/7fldA6Ky3GsgJkND3PO2, not stripped
......
# find . -name 'coredns*' -exec file {} \;
./275d9308c04eff2ed9bc4723b14ea6e45baba29bf6c0bff6ce8ff5353fafa29f/coredns: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), statically linked, Go BuildID=39m9lQWm4aqmpdYUY6OJ/iNA5d9riIL95H_GDUSTD/H-Quk1-OcGkg7U4wkTDn/akmRBs5wZkoZ_CH88xMx, stripped

These are arm64 binraries, the images' manifests are wrong.

I reproduced this issue on another arm64 host via podman. The image digests are the same, and podman inspect also shows the architecture to be amd64. Unlike CRI-O, however, podman can run the images.

What you expected to happen:

That the images of the correct architecture also list that correct architecture in their manifests.

How to reproduce it (as minimally and precisely as possible):

kubeadm init with CRI-O should do the trick.

Anything else we need to know?:

This has already been reported in #99656.
But both etcd and coredns still show amd64 on their latest tags.

Environment:

  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"archive", BuildDate:"2021-08-02T11:40:12Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/arm64"}
  • Cloud provider or hardware configuration: ODROID-N2 SBC
  • OS (e.g: cat /etc/os-release): Arch Linux ARM
  • Kernel (e.g. uname -a): 5.13.3 custom based on Arch Linux ARM config
  • Install tools: kubeadm
  • Network plugin and version (if this is a network-related bug):
  • Others: CRI-O
@sysedwinistrator sysedwinistrator added the kind/bug Categorizes issue or PR as related to a bug. label Aug 2, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 2, 2021
@sysedwinistrator
Copy link
Author

@sysedwinistrator: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

* `/sig <group-name>`

* `/wg <group-name>`

* `/committee <group-name>`

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/sig release

@k8s-ci-robot k8s-ci-robot added sig/release Categorizes an issue or PR as relevant to SIG Release. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 2, 2021
@sysedwinistrator
Copy link
Author

It turns the wrong architecture in the manifests was not the reason the pods weren't starting.
I couldn't quite figure out what was the issue. Cluster initialisation would work after kubeadm reset, completely resetting the container storage and "stopping" all cgroup slice units related to k8s.

Since the latest images of etcd and coredns are still affected, I'm going to keep the issue open.

@saschagrunert
Copy link
Member

saschagrunert commented Aug 3, 2021

CoreDNS as well as etcd are released in their own process and are not in the scope of SIG Release. For etcd, the image lives there: https://github.com/kubernetes/kubernetes/blob/master/cluster/images/etcd

Indeed, the architectures for non amd64 images are wrong, so we can do two things:

I assume the same applies to CoreDNS, but I'm not sure right now where the image is being build.

@neolit123
Copy link
Member

I reproduced this issue on another arm64 host via podman. The image digests are the same, and podman inspect also shows the architecture to be amd64. Unlike CRI-O, however, podman can run the images.

is there a way to tell cri-o to ignore the mismatch in the images temporary?
we should certainly fix these images though.

https://github.com/kubernetes/kubernetes/blob/master/cluster/images/etcd

cc @jpbetz @wenjiaswe

I assume the same applies to CoreDNS, but I'm not sure right now where the image is being build.

cc @chrisohaver @johnbelamaric @rajansandeep

i think the image is build from the Docker file at:
https://github.com/coredns/coredns

/sig api-machinery network

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/network Categorizes an issue or PR as relevant to SIG Network. labels Aug 3, 2021
@neolit123
Copy link
Member

neolit123 commented Aug 3, 2021

related:
#102315

note, this test is catching problems in etcd, coredns, kube-proxy, conformance image:
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-all#periodic-manifest-lists
https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-kubernetes-e2e-manifest-lists/1422389464386244608/build-log.txt

WARNING: in config digest sha256:72d94efa1c317db2557d31f62a3384ef4959e5db59fdf9e49ca121a1afe5021e: found architecture "amd64", expected "arm"

WARNING: in config digest sha256:1a771bad15fcca72ab59198b3724a42871b3197bd8afc88feff53a2d045ac8db: found architecture "amd64", expected "arm"

it's just reporting warnings because otherwise this would fail the entire test job and some CRs tolerate the arch mismatch.

@saschagrunert
Copy link
Member

Fixes for coredns and etcd are in flight.

@justaugustus justaugustus added this to the v1.23 milestone Aug 4, 2021
@caesarxuchao
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 5, 2021
@saschagrunert
Copy link
Member

The issue should be solved with the next Kubernetes minor release (v1.23) for etcd, and with the next coredns release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/release Categorizes an issue or PR as relevant to SIG Release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants