Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-arch next steps #1139

Closed
1 of 8 tasks
timja opened this issue Jul 6, 2021 · 36 comments
Closed
1 of 8 tasks

Multi-arch next steps #1139

timja opened this issue Jul 6, 2021 · 36 comments

Comments

@timja
Copy link
Member

timja commented Jul 6, 2021

  • Convert publishing scripts to bake
  • Enable multi-arch on CI, make targets have been added in this PR to show how that will work
  • Add ssh credentials to trusted ci credentials store for existing s390x, ppc64le static agents
  • Update pipeline to load ssh credentials for above agents
  • Create an arm64 VM
  • Add ssh credentials to trusted ci for arm and add to pipeline
  • Add builder definitions to trusted ci agent or pipeline, see description in
  • Enable publishing multi arch builds, should just be a matter of removing a --set '*.platform=linux/amd64'
`docker buildx` config and ssh config
docker buildx create --name remote --use

docker buildx create --name remote \
  --append ssh://jenkins-agent-ppc64le

docker buildx create --name remote \
  --append ssh://jenkins-agent-s390x

# I'm running on arm so didn't actually do this, but for completeness
docker buildx create --name remote \
  --append ssh://jenkins-agent-arm64
Host jenkins-agent-ppc64le
HostName <actualname>
  IdentityFile ~/.ssh/ppc64le
  User jenkins

Host jenkins-agent-s390x
  HostName <actualname>
  IdentityFile ~/.ssh/s390x
  User jenkins

Host jenkins-agent-arm64
  HostName <actualname>
  IdentityFile ~/.ssh/arm64
  User jenkins

Few questions:

  1. Is it fine to re-use existing s390x and ppc64le agents across ci.jenkins.io and trusted-ci or we need to get another one for each?
  2. Where should the arm64 machine be hosted, as far as I know we have 2 choices, AWS or Oracle cloud, any preference?
  3. Should SSH keys and SSH config be loaded per pipeline run on trusted ci or baked into the image? (I assume in pipeline)
  4. Are the IP addresses for the static agents (s390x and ppc64le currently) considered sensitive? or can I create DNS entries for them, e.g. (ppc64le-agent.jenkins.io)
  5. For setting up the buildx builders (script above), I was thinking of adding a shell script to the repo? which would setup the ssh config (assuming non sensitive) and the builders, but any other thoughts?

Any help would be hugely appreciated ❤️

cc @olblak @MarkEWaite @slide @dduportal

@dduportal
Copy link
Contributor

Thanks a lot for writing this issue!

About the ARM instances, is there something blocking us from using the current dynamic agent allocation in ci.jenkins.io (e.g. adding a pipeline parallel branch to execute the make build-arm64 command on a node with the arm64docker label)?

About your questions:

  1. Reusing the same agent between the 2 instances is risky? My vote is on having 2 (physically) separated instances, one per Jenkins controller
  2. Regarding the ARM machines, if we use them dynamically, then AWS seems fine as they are already available in ci.jenkins.io and infra.jenkins as for today. If we use static instances then we should compare the prices per month.
  3. Could you elaborate what do you mean by "baked in the image"?
  4. For the "static VMs", the IPv4 seems enough, as soon as it does not change on the VM restarts AND there is a security group to restrict access to the VM from our agent + controller networks
  5. About setting up the buildx builders, I'm really excited to get started on ci.jenkins.io , but we have to start by putting the s390x and ppc64 VMs under configuration management to tackle the following problems:
  • Avoid disk full (=> daily crontab with docker system prune --force --volumes and apt-get autoremove --purge --yes)
  • Avoid security issues due to not being up to date (=> weekly crontab with apt dist-upgrade -y && reboot + unattended upgrade enabled)
  • Execute Docker Engine as non root (that should be easy) so the SSH credentials + Docker should not allow to grant UID 0 nor sudo (unless a containerd or kernel CVE).
    => Challenge will be with the PPC architecture, that might not be supported by Puppet. We could use a shell, or ansible, or terraform or whatever to maintain it, no problem on this.
    => As soon as we have the VMs ready then go go go for adding the buildx shells.

Thanks for this huge and awesome work @timja

@dduportal
Copy link
Contributor

Another food for thought if we want to get started with s390x and ppc64: WDYT about adding, right now:

  • Add ans use the shell script that start the buildx workers
  • Still run on the AMD64 machines, but use QEMU, as Docker4Mac does for instance
    => that would allow to get started for the build (at least) and validate this work soon, without being blocked by the VMs management. WDYT?

@timja
Copy link
Member Author

timja commented Jul 6, 2021

About the ARM instances, is there something blocking us from using the current dynamic agent allocation in ci.jenkins.io (e.g. adding a pipeline parallel branch to execute the make build-arm64 command on a node with the arm64docker label)?

For CI it's fine for running the build / test against each architecture, but when we're publishing we want to use docker buildx builders, which means we run the command from one machine which has ssh access to all the others required

Reusing the same agent between the 2 instances is risky? My vote is on having 2 (physically) separated instances, one per Jenkins controller

👍, is it possible to get more machines, @MarkEWaite / @slide?

For the "static VMs", the IPv4 seems enough, as soon as it does not change on the VM restarts AND there is a security group to restrict access to the VM from our agent + controller networks

There doesn't seem to be atm, I can access from my machine

For the "static VMs", the IPv4 seems enough, as soon as it does not change on the VM restarts AND there is a security group to restrict access to the VM from our agent + controller networks

The main question was can I put it into DNS to make it easier to manage / reason about it, or if IP only is it ok if they are public or do they need to be loaded from a credential

@dduportal
Copy link
Contributor

For CI it's fine for running the build / test against each architecture, but when we're publishing we want to use docker buildx builders, which means we run the command from one machine which has ssh access to all the others required

The ARM capability for trusted.ci is only a configuration away. Can you add this to today's infra meeting? I'll take care of that this week to open up the possibility here and not risk any blocking :)

@timja
Copy link
Member Author

timja commented Jul 6, 2021

I can't see how to do that, I also can't attend, (meeting time doesn't work so well for me these days)

@dduportal
Copy link
Contributor

@timja let me handle it, no problem on this (and many thanks for managing this!)

@timja
Copy link
Member Author

timja commented Jul 6, 2021

(I meant the agenda, btw, the move to hackmd has made that harder than google doc was, I can't see a published agenda)

@MarkEWaite
Copy link
Contributor

MarkEWaite commented Jul 7, 2021

  1. Is it fine to re-use existing s390x and ppc64le agents across ci.jenkins.io and trusted-ci or we need to get another one for each?

I think we should reuse the machines but create separate accounts on the machine for those use cases. I've been using a separate account on the machines for my test cluster without any negative impact that I've detected. I propose to create the following accounts:

  • trusted - agent account for trusted-ci.jenkins.io, with docker permission
  • timja - user account for Tim Jacomb with sudo permission and docker permission (send me a public key that I can add to the account so that you can login with ssh)
  1. Where should the arm64 machine be hosted, as far as I know we have 2 choices, AWS or Oracle cloud, any preference?

I'm open to either. I suspect that the operating system is more important than the cloud provider. @olblak and I have been using arm64 machines with Ubuntu 20.04 on Oracle Cloud with good results. I've also run arm64 on Oracle Cloud with Oracle Linux, but it is much less familiar to me than the Ubuntu environment. Oracle Cloud has offered us membership in their Arm accelerator program and a $3000 credit. AWS has donated $60k to the Jenkins project. My initial leaning is towards Oracle Arm just because there are so many other ways that we will use the capacity that AWS is donating.

  1. Should SSH keys and SSH config be loaded per pipeline run on trusted ci or baked into the image? (I assume in pipeline)

I assume in Pipeline, though I'm OK with either.

  1. Are the IP addresses for the static agents (s390x and ppc64le currently) considered sensitive? or can I create DNS entries for them, e.g. (ppc64le-agent.jenkins.io)

The IP addresses are not considered sensitive. DNS entries seem like a very good idea.

  1. For setting up the buildx builders (script above), I was thinking of adding a shell script to the repo? which would setup the ssh config (assuming non sensitive) and the builders, but any other thoughts?

That sounds good to me.

Any help would be hugely appreciated ❤️

cc @olblak @MarkEWaite @slide @dduportal

@timja
Copy link
Member Author

timja commented Jul 7, 2021

timja - user account for Tim Jacomb with sudo permission and docker permission (send me a public key that I can add to the account so that you can login with ssh)

https://github.com/jenkins-infra/jenkins-infra/blob/staging/hieradata/common.yaml#L80

I'm open to either. I suspect that the operating system is more important than the cloud provider. @olblak and I have been using arm64 machines with Ubuntu 20.04 on Oracle Cloud with good results. I've also run arm64 on Oracle Cloud with Oracle Linux, but it is much less familiar to me than the Ubuntu environment. Oracle Cloud has offered us membership in their Arm accelerator program and a $3000 credit. AWS has donated $60k to the Jenkins project. My initial leaning is towards Oracle Arm just because there are so many other ways that we will use the capacity that AWS is donating.

Fine with me, how can we get the machine setup?

@MarkEWaite
Copy link
Contributor

timja - user account for Tim Jacomb with sudo permission and docker permission (send me a public key that I can add to the account so that you can login with ssh)

https://github.com/jenkins-infra/jenkins-infra/blob/staging/hieradata/common.yaml#L80

Fine with me, how can we get the machine setup?

I'll create the machine and provide you an account on the machine with sudo. Are you OK with the idea that I proposed to have a trusted account on the machine that is used for access from trusted.ci.jenkins.io?

@timja
Copy link
Member Author

timja commented Jul 7, 2021

fine from my POV @olblak or @dduportal may have different opinions, the machines we have already are quite powerful.

@MarkEWaite
Copy link
Contributor

I've created a timja account on s390x and on the ppc64le machine with the public key that you provided.

@MarkEWaite
Copy link
Contributor

MarkEWaite commented Jul 8, 2021

I updated the Ubuntu packages on ppc64le and rebooted (it had 100+ packages that were outdated, including Java versions). The machine has restarted and is working.

@timja
Copy link
Member Author

timja commented Jul 18, 2021

After the parallel changes are merged I can look at enabling this on ci at least in a PR

Are we wanting the full test suite run on every platform or just smoke tests?

@timja
Copy link
Member Author

timja commented Jul 19, 2021

FYI I tried running via QEMU after the git-lfs update in our dockerfile running on our agents (aws one) and looks like I hit this?

moby/buildkit#1929

which points to maybe a mis-configured QEMU?

#293 173.1 debconf: delaying package configuration, since apt-utils is not installed
#293 173.3 Fetched 25.0 MB in 2s (16.2 MB/s)
#293 173.5 Error while loading /usr/sbin/dpkg-split: No such file or directory
#293 173.5 Error while loading /usr/sbin/dpkg-deb: No such file or directory
#293 173.5 dpkg: error processing archive /tmp/apt-dpkg-install-Vs2JH4/00-perl-modules-5.28_5.28.1-6+deb10u1_all.deb (--unpack):
#293 173.5  dpkg-deb --control subprocess returned error exit status 1
#293 173.5 Error while loading /usr/sbin/dpkg-split: No such file or directory
#293 173.5 Error while loading /usr/sbin/dpkg-deb: No such file or directory
#293 173.5 dpkg: error processing archive /tmp/apt-dpkg-install-Vs2JH4/01-libgdbm6_1.18.1-4_arm64.deb (--unpack):
#293 173.5  dpkg-deb --control subprocess returned error exit status 1
#293 173.5 Error while loading /usr/sbin/dpkg-split: No such file or directory
#293 173.5 Error while loading /usr/sbin/dpkg-deb: No such file or directory
#293 173.5 dpkg: error processing archive /tmp/apt-dpkg-install-Vs2JH4/02-libgdbm-compat4_1.18.1-4

@timja timja pinned this issue Jul 22, 2021
@timja
Copy link
Member Author

timja commented Jul 22, 2021

Setting up QEMU got it further,

It now fails on:

 > [debian_jdk11 linux/arm64  8/15] RUN curl -fsSL https://repo.jenkins-ci.org/public/org/jenkins-ci/main/jenkins-war/2.300/jenkins-war-2.300.war -o /usr/share/jenkins/jenkins.war   && echo "2f6aa548373b038af4fb6a4d6eaa5d13679510008f1712532732bf77c55b9670  /usr/share/jenkins/jenkins.war" | sha256sum -c -:
------
Dockerfile:73
--------------------
  72 |     # see https://github.com/docker/docker/issues/8331
  73 | >>> RUN curl -fsSL ${JENKINS_URL} -o /usr/share/jenkins/jenkins.war \
  74 | >>>   && echo "${JENKINS_SHA}  /usr/share/jenkins/jenkins.war" | sha256sum -c -
error: failed to solve: rpc error: code = Unknown desc = process "/bin/sh -c curl -fsSL ${JENKINS_URL} -o /usr/share/jenkins/jenkins.war   && echo \"${JENKINS_SHA}  /usr/share/jenkins/jenkins.war\" | sha256sum -c -" did not complete successfully: exit code: 18
Makefile:15: recipe for target 'build' failed

not clear why

@dduportal
Copy link
Contributor

Manual test:

export ARCH=arm64
make build-debian_jdk11
  • Resulting image is reporting aarch64 as expected:
docker run --rm -t docker.io/jenkins/jenkins:2.300-jdk11 uname -m
WARNING: The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64) and no specific platform was requested
aarch64

=> I assume the error when downloading the war file was a network issue: to be double checked of course :)

@timja
Copy link
Member Author

timja commented Jul 22, 2021

@dduportal I can reproduce on the ubuntu 20 machine,

just run

docker buildx bake -f docker-bake.hcl linux

@dduportal
Copy link
Contributor

@timja when you say that you can reproduce, do you mean the error? Because the command you provided is successfull for me on both the Ubuntu 20.04 machine with QEMU installed (and enabled) and my macOS Intel (with Docker4Mac).

@timja
Copy link
Member Author

timja commented Jul 23, 2021

I got the error last night on the ubuntu 20 machine using that above command, I've just retriggered it again

@dduportal
Copy link
Contributor

@timja thanks for clearing it out, I asked because I was not sure if I understood correctly. It means the outcome of the build is not always the same: there is something weird :|

@timja
Copy link
Member Author

timja commented Jul 23, 2021

it's passing now

@timja
Copy link
Member Author

timja commented Jul 23, 2021

It takes 19m44.652 with no cache though, maybe we can remove some images we don't need multi-arch on?

@timja
Copy link
Member Author

timja commented Jul 23, 2021

I built the full set twice more via QEMU (with --no-cache)

One got stuck in s390x and I cancelled it after 30 minutes.

One completed in 10 minutes

This is on the reduced platform branch.

I also ran it twice on my M1 building with remote builders (non emulated),

1st run failed with

#34 4.130 failure: repodata/repomd.xml from AdoptOpenJDK: [Errno 256] No more mirrors to try.
#34 4.130 https://adoptopenjdk.jfrog.io/adoptopenjdk/rpm/centos/7/x86_64/repodata/repomd.xml: [Errno 14] curl#77 - "Problem with the SSL CA cert (path? access rights?)"
------
Dockerfile:3
--------------------
   2 |
   3 | >>> RUN echo -e '[AdoptOpenJDK]\n\
   4 | >>> name=AdoptOpenJDK\n\
   5 | >>> baseurl=https://adoptopenjdk.jfrog.io/adoptopenjdk/rpm/centos/$releasever/$basearch\n\
   6 | >>> enabled=1\n\
   7 | >>> gpgcheck=1\n\
   8 | >>> gpgkey=https://adoptopenjdk.jfrog.io/adoptopenjdk/api/gpg/key/public' > /etc/yum.repos.d/adoptopenjdk.repo && \
   9 | >>>     yum update -y && yum install -y git curl adoptopenjdk-8-hotspot-8u292_b10-3 freetype fontconfig unzip which && \
  10 | >>>     yum clean all
  11 |
--------------------
error: failed to solve: rpc error: code = Unknown desc = executor failed running [/bin/sh -c echo -e '[AdoptOpenJDK]\nname=AdoptOpenJDK\nbaseurl=https://adoptopenjdk.jfrog.io/adoptopenjdk/rpm/centos/$releasever/$basearch\nenabled=1\ngpgcheck=1\ngpgkey=https://adoptopenjdk.jfrog.io/adoptopenjdk/api/gpg/key/public' > /etc/yum.repos.d/adoptopenjdk.repo &&     yum update -y && yum install -y git curl adoptopenjdk-8-hotspot-8u292_b10-3 freetype fontconfig unzip which &&     yum clean all]: exit code: 1
docker buildx bake --no-cache --pull -f docker-bake.hcl linux  0.82s user 0.59s system 4% cpu 29.141 total

2nd run failed with

#155 3.282 + git lfs install
#155 3.588 error: git-lfs died of signal 4
------
Dockerfile:13
--------------------
  12 |     # https://github.com/git-lfs/git-lfs/issues/4546
  13 | >>> RUN GIT_LFS_ARCHIVE="git-lfs-linux-${TARGETARCH}-v${GIT_LFS_VERSION}.tar.gz" \
  14 | >>>     GIT_LFS_RELEASE_URL="https://github.com/git-lfs/git-lfs/releases/download/v${GIT_LFS_VERSION}/${GIT_LFS_ARCHIVE}"\
  15 | >>>     set -x; curl --fail --silent --location --show-error --output "/tmp/${GIT_LFS_ARCHIVE}" "${GIT_LFS_RELEASE_URL}" && \
  16 | >>>     mkdir -p /tmp/git-lfs && \
  17 | >>>     tar xzvf "/tmp/${GIT_LFS_ARCHIVE}" -C /tmp/git-lfs && \
  18 | >>>     bash -x /tmp/git-lfs/install.sh && \
  19 | >>>     rm -rf /tmp/git-lfs*
  20 |
error: failed to solve: rpc error: code = Unknown desc = executor failed running [/bin/sh -c GIT_LFS_ARCHIVE="git-lfs-linux-${TARGETARCH}-v${GIT_LFS_VERSION}.tar.gz"     GIT_LFS_RELEASE_URL="https://github.com/git-lfs/git-lfs/releases/download/v${GIT_LFS_VERSION}/${GIT_LFS_ARCHIVE}"    set -x; curl --fail --silent --location --show-error --output "/tmp/${GIT_LFS_ARCHIVE}" "${GIT_LFS_RELEASE_URL}" &&     mkdir -p /tmp/git-lfs &&     tar xzvf "/tmp/${GIT_LFS_ARCHIVE}" -C /tmp/git-lfs &&     bash -x /tmp/git-lfs/install.sh &&     rm -rf /tmp/git-lfs*]: exit code: 132

exit code 4 is SIGILL which means unknown machine code =/

@timja
Copy link
Member Author

timja commented Jul 23, 2021

I ran the build ~4 times on the s390x machine and all failed on git-lfs install, when I removed the debian buster 11 image from the list it passed first time.

So I've removed it in #1156

FTR I tried manually on ubuntu s390x and it works fine to install git-lfs

@timja
Copy link
Member Author

timja commented Jul 27, 2021

@dduportal do you think we should continue trying with QEMU or build on architecture?

@MarkEWaite
Copy link
Contributor

MarkEWaite commented Jul 27, 2021

In past sessions of the platform SIG, Alex Earl mentioned that there were specific feature issues with QEMU (see the Jan 15, 2021 platform SIG notes). Unfortunately, I didn't capture any details in the notes. You can hear the description from @slide at https://youtu.be/MzpL2IEkJ3E?t=530

There was a comment in the meeting Jan 15, 2020 as well that Jim Crowley was investigating QEMU. https://docs.google.com/document/d/1q5A72xnoJVPZRKXZhyNnYCSCTuG02LiFcKQkH5rdwXc/edit#heading=h.2ye0o1azc72i

@dduportal
Copy link
Contributor

So @timja was able to determine why the publication was failing (on trusted.ci) while the usual pipeline was working (on ci.jenkins): the Azure VM required to enable the QEMU's binfmt before each QEMU build, while AWS (on ci.jenkins) EC2 agent only required the binfmt to be loaded in the AMI.

#1169 enabled the multi-arch again and also has the fix, it seems that @timja tests are ok (custom images published on DockerHub) \o/

@timja
Copy link
Member Author

timja commented Aug 3, 2021

Images are now being published 🎉

@timja timja closed this as completed Aug 3, 2021
@timja timja unpinned this issue Aug 3, 2021
@Nayana-ibm
Copy link

Images are now being published 🎉

@timja I could see multiarch images published with tag rhel-ubi8 but not all tags/versions images. Checking on https://hub.docker.com/r/jenkins/jenkins
As I could see s390x is mentioned in description and comments, is there any plan to publish multi-arch images including s390x arch for all tags/versions?

@olblak
Copy link
Member

olblak commented Aug 17, 2021

@Nayana-ibm We do have some plans to build and publish images for s390x as we currently have access to s390x infrastructure from IBM but there is no ETA at the moment as far as I know.

@timja
Copy link
Member Author

timja commented Aug 17, 2021

Is there anything specific you’re after? Adding it to each tag makes the build take longer.

so we would prefer if it was done based off of user need.

It’s no problem to enable it for another one though

@timja
Copy link
Member Author

timja commented Aug 17, 2021

@Nayana-ibm this adds it to the default image is that enough? #1183

@Nayana-ibm
Copy link

@timja Thank you for considering s390x for default images. I could see images are now published with latest and jdk11 tags.
however I don't see jenkins/jenkins:lts-jdk11 and jenkins/jenkins:lts images as multiarch even though the tag is mentioned in target debian_jdk11 -

equal(LATEST_LTS, "true") ? "${REGISTRY}/${JENKINS_REPO}:lts" : "",

Am I missing anything here?

@timja
Copy link
Member Author

timja commented Aug 18, 2021

Next LTS release is scheduled for Wed 25th August

@Nayana-ibm
Copy link

Great! Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants