Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Build ARM images for core components #7220

Conversation

kimwnasptd
Copy link
Member

This PR extends the GH Actions to also build the "core" images in kubeflow/kubeflow for ARM. This is for kubeflow/manifests#2472 and #2337

The last step is to then modify the build system and makefiles for the example notebook images https://github.com/kubeflow/kubeflow/tree/master/components/example-notebook-servers

@kimwnasptd kimwnasptd force-pushed the feature-kimwnasptd-arm64-core-components branch 5 times, most recently from c1b5900 to 6a09136 Compare July 26, 2023 07:01
@hwang9u
Copy link

hwang9u commented Jul 26, 2023

I'm a beginner at Kuberflow, so I didn't know how to promote it. Thank you for your help! Have a nice day 🤩

@kimwnasptd kimwnasptd force-pushed the feature-kimwnasptd-arm64-core-components branch from 6a09136 to c9b788b Compare July 31, 2023 17:27
@kimwnasptd
Copy link
Member Author

@thesuperzapper can I get a review on this ASAP? The CI is broken right now, due to the multi-arch command having a redundant comma https://github.com/kubeflow/kubeflow/actions/runs/5717274226/job/15490659878

This PR should both fix that issue and also introduce ARM support for most of core components, aside from the notebook server images.

@thesuperzapper
Copy link
Member

@kimwnasptd I want to clarify that we have actually tested the building with ARM64? Because there is no guarantee they will all work, and we probably don't want to push busted images.


Also, in my experience, building complex ARM64 images under QEMU emulation can take a LONG time (and might never finish in extreme cases). In the past, I mitigated this problem using the docker buildx registry cache feature in two ways.

1: Use a separate "ci" GitHub Container Registry (which is local to the runners, and fully supports caching, unlike ECR):

CI_REGISTRY_IMAGE="ghcr.io/<IMAGE_OWNER>/ci/<IMAGE_TITLE>

docker buildx build \
  --cache-from=type=registry,ref=${CI_REGISTRY_IMAGE} \
  --cach-to=type=registry,ref=${CI_REGISTRY_IMAGE},mode=max \
  ./path/to/Dockerfile

2: For builds that never finish, manually populate the cache on an ARM device (like a MacBook):

CI_REGISTRY_IMAGE="ghcr.io/<IMAGE_OWNER>/ci/<IMAGE_TITLE>"

docker buildx build \
  --cache-to=type=registry,ref=${CI_REGISTRY_IMAGE},mode=max \
  ./path/to/Dockerfile

If we are going to do lots of cross-architecture builds I think we must implement the cache feature, and it should be relatively straightforward to host the "ci" image cache on GitHub Container Registry (GHCR).

PS: they are a bit out of date now, but a while ago, I mirrored all the old images onto the Kubeflow Org's GHCR. We can make "ci" versions of all of them in a similar way (and possibly also push the non-ci image there too): https://github.com/orgs/kubeflow/packages

@kimwnasptd
Copy link
Member Author

@thesuperzapper thanks for the review!

@kimwnasptd I want to clarify that we have actually tested the building with ARM64? Because there is no guarantee they will all work, and we probably don't want to push busted images.

Yes, I can confirm they can be build. I've tested all of them in an M2 MacBook and all the above components can be successfully built and run.

If we are going to do lots of cross-architecture builds I think we must implement the cache feature, and it should be relatively straightforward to host the "ci" image cache on GitHub Container Registry (GHCR).

Regarding the cache, I'd say let's evaluate this once we have some numbers first and the performance is indeed very poor. Looking at other projects like KServe they are also building for both x86 and ARM architectures and the builds are quite snappy.

https://github.com/kserve/kserve/actions/runs/5719086923/job/15496326007
https://github.com/kserve/kserve/blob/master/.github/workflows/sklearnserver-docker-publish.yml
https://hub.docker.com/layers/kserve/sklearnserver/latest/images/sha256-c010cf2e69fa014062b1b424014e0cc9cc58e170287caba827216480c02f1d37?context=explore

So I'd suggest to move forward with this and if we see that indeed the actions never complete we can re-evaluate going down a more complicated route afterwards.

@kimwnasptd
Copy link
Member Author

/hold

Let me actually fully flesh this out in a local fork, to make sure we have some numbers and we can re-evaluate

@kimwnasptd
Copy link
Member Author

/hold cancel

@thesuperzapper indeed most the build jobs finish in less than 30 mins. So I we should be good to go for now and not require a cache. You can see a run of GH workflows in the tests from this commit of my fork kimwnasptd@1e02d1e

Note that the 3 crud-web-apps are failing to build, but this is on ppc64le platform and is being tracked in #7226

I've verified that these three images can be built, and also tested them in my cluster. You can always check if an image for any architecture can be build locally with:

cd components/crud-web-apps/jupyter
ARCH=linux/arm64/v8 make docker-build-multi-arch

Since docker buildx is using qemu to emulate the different platforms https://docs.docker.com/build/building/multi-platform/

@kimwnasptd
Copy link
Member Author

Note, I had to do a bump in the python and gevent versions in order for these images to be able to be built for ARM, due to gevent/gevent#1721

@thesuperzapper
Copy link
Member

@kimwnasptd so are we planning to get this merged before 1.8, because I would like to!

However, I see that there is now a merge conflict.

@thesuperzapper
Copy link
Member

@kimwnasptd Also, I see that in kimwnasptd@1e02d1e you are also updating the example notebook images, I cant see the update to those images in this PR, did I just miss it?

@thesuperzapper
Copy link
Member

@kimwnasptd we really need to move this forward so that Kubeflow 1.8 works on ARM!

@pingsutw
Copy link
Member

+1, we need the image for ARM.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
@kimwnasptd kimwnasptd force-pushed the feature-kimwnasptd-arm64-core-components branch 3 times, most recently from c09a781 to 9ed9e4f Compare October 3, 2023 14:58
@kimwnasptd
Copy link
Member Author

@thesuperzapper ping to take a look at the current state. All the tests are passing, but I have a commit that triggered all of them which I'll drop once I get an ACK.

The changes at this point are:

  1. New workflows for building for multi-arch during a PR (this way the integration tests run way faster)
  2. The workflows that build/publish images will build for all archs

With the above we will always be testing on each PR that images can be built accordingly

@kimwnasptd
Copy link
Member Author

Had a sync with @thesuperzapper. We are OK to merge for now, although there are a couple of improvements to make afterwards, like:

  1. Have a common action for a couple of our workflows
  2. Use a matrix for building each image for each different platform, to build in parallel
  3. Move over to ghcr.io to host our images and also utilise a cache

@thesuperzapper I'm moving on with removing the DROP commits so that we can merge

@kimwnasptd kimwnasptd force-pushed the feature-kimwnasptd-arm64-core-components branch from 9ed9e4f to 1886b01 Compare October 4, 2023 18:53
@kimwnasptd
Copy link
Member Author

@thesuperzapper pushed the changes. The actions for the three webapp got triggered since there is a commit for changing some versions, to make sure the apps work in ARM/ppc64le.

Once the tests are good we can merge

@kimwnasptd
Copy link
Member Author

@thesuperzapper should be ready to merge

@thesuperzapper
Copy link
Member

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: thesuperzapper

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 466d675 into kubeflow:master Oct 4, 2023
8 checks passed
@kimwnasptd kimwnasptd deleted the feature-kimwnasptd-arm64-core-components branch October 5, 2023 07:43
kimwnasptd added a commit to kimwnasptd/kubeflow that referenced this pull request Oct 13, 2023
* ci: Build ARM images for core components

Extend the GH Actions to also build the images on ARM architectures.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* crud-web-apps: Update python and gevent versions

In order to successfully build on linux/arm64/v8 we'll need to:
* Update to Python 3.10
* Bump the gevent version

gevent/gevent#1721 (comment)

* Update the workflows for JWA

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for centraldb

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for kfam

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for notebook-controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for PodDefaults

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for Profile Controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for pvcviewer controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for TensorBoard Controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for TWA

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for VWA

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update releasing script to include PVCViewers

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

---------

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
kimwnasptd added a commit to kimwnasptd/kubeflow that referenced this pull request Oct 13, 2023
* ci: Build ARM images for core components

Extend the GH Actions to also build the images on ARM architectures.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* crud-web-apps: Update python and gevent versions

In order to successfully build on linux/arm64/v8 we'll need to:
* Update to Python 3.10
* Bump the gevent version

gevent/gevent#1721 (comment)

* Update the workflows for JWA

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for centraldb

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for kfam

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for notebook-controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for PodDefaults

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for Profile Controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for pvcviewer controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for TensorBoard Controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for TWA

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for VWA

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update releasing script to include PVCViewers

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

---------

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
google-oss-prow bot pushed a commit that referenced this pull request Oct 16, 2023
* ci: Build ARM images for core components (#7220)

* ci: Build ARM images for core components

Extend the GH Actions to also build the images on ARM architectures.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* crud-web-apps: Update python and gevent versions

In order to successfully build on linux/arm64/v8 we'll need to:
* Update to Python 3.10
* Bump the gevent version

gevent/gevent#1721 (comment)

* Update the workflows for JWA

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for centraldb

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for kfam

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for notebook-controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for PodDefaults

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for Profile Controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for pvcviewer controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for TensorBoard Controller

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for TWA

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update workflows for VWA

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Update releasing script to include PVCViewers

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

---------

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* Run build steps for multi-arch sequentially (#7333)

* Run build steps for multi-arch sequentially

We saw that if we try to build using an ENV var containing all the
architectures then docker buildx will run 3 parallel jobs.

This made the GH runners crash in some cases because they were
over-utilizing the resources.

To mitigate this we'll sequentially build each image for each arch.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* review: Build first for amd64

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

---------

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* ci: Ensure we publish for all architectures (#7343)

Previously we omitted the ARCH env var, which resulted in the last step
to not push for all architectures, but only for amd64.

This commit will set a default ENV var with all architectures, which
will be picked up by the build-push make rule we have.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

* ci: Push the multi-arch image for specific tag (#7345)

The latest changes made the CI to only build the images for different
archs, but not push for the tag value of the commit.

While next steps, for pushing the latest tag or a release version tag
will work we still want to ensure we push an image with a tag.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>

---------

Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants