New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: Build ARM images for core components #7220
ci: Build ARM images for core components #7220
Conversation
c1b5900
to
6a09136
Compare
I'm a beginner at Kuberflow, so I didn't know how to promote it. Thank you for your help! Have a nice day 🤩 |
6a09136
to
c9b788b
Compare
@thesuperzapper can I get a review on this ASAP? The CI is broken right now, due to the multi-arch command having a redundant comma https://github.com/kubeflow/kubeflow/actions/runs/5717274226/job/15490659878 This PR should both fix that issue and also introduce ARM support for most of core components, aside from the notebook server images. |
@kimwnasptd I want to clarify that we have actually tested the building with ARM64? Because there is no guarantee they will all work, and we probably don't want to push busted images. Also, in my experience, building complex ARM64 images under QEMU emulation can take a LONG time (and might never finish in extreme cases). In the past, I mitigated this problem using the 1: Use a separate "ci" GitHub Container Registry (which is local to the runners, and fully supports caching, unlike ECR): CI_REGISTRY_IMAGE="ghcr.io/<IMAGE_OWNER>/ci/<IMAGE_TITLE>
docker buildx build \
--cache-from=type=registry,ref=${CI_REGISTRY_IMAGE} \
--cach-to=type=registry,ref=${CI_REGISTRY_IMAGE},mode=max \
./path/to/Dockerfile 2: For builds that never finish, manually populate the cache on an ARM device (like a MacBook): CI_REGISTRY_IMAGE="ghcr.io/<IMAGE_OWNER>/ci/<IMAGE_TITLE>"
docker buildx build \
--cache-to=type=registry,ref=${CI_REGISTRY_IMAGE},mode=max \
./path/to/Dockerfile If we are going to do lots of cross-architecture builds I think we must implement the cache feature, and it should be relatively straightforward to host the "ci" image cache on GitHub Container Registry (GHCR). PS: they are a bit out of date now, but a while ago, I mirrored all the old images onto the Kubeflow Org's GHCR. We can make "ci" versions of all of them in a similar way (and possibly also push the non-ci image there too): https://github.com/orgs/kubeflow/packages |
@thesuperzapper thanks for the review!
Yes, I can confirm they can be build. I've tested all of them in an M2 MacBook and all the above components can be successfully built and run.
Regarding the cache, I'd say let's evaluate this once we have some numbers first and the performance is indeed very poor. Looking at other projects like KServe they are also building for both x86 and ARM architectures and the builds are quite snappy. https://github.com/kserve/kserve/actions/runs/5719086923/job/15496326007 So I'd suggest to move forward with this and if we see that indeed the actions never complete we can re-evaluate going down a more complicated route afterwards. |
c9b788b
to
b54ee24
Compare
/hold Let me actually fully flesh this out in a local fork, to make sure we have some numbers and we can re-evaluate |
/hold cancel @thesuperzapper indeed most the build jobs finish in less than 30 mins. So I we should be good to go for now and not require a cache. You can see a run of GH workflows in the tests from this commit of my fork kimwnasptd@1e02d1e Note that the 3 crud-web-apps are failing to build, but this is on I've verified that these three images can be built, and also tested them in my cluster. You can always check if an image for any architecture can be build locally with:
Since |
Note, I had to do a bump in the python and gevent versions in order for these images to be able to be built for ARM, due to gevent/gevent#1721 |
@kimwnasptd so are we planning to get this merged before 1.8, because I would like to! However, I see that there is now a merge conflict. |
@kimwnasptd Also, I see that in kimwnasptd@1e02d1e you are also updating the example notebook images, I cant see the update to those images in this PR, did I just miss it? |
@kimwnasptd we really need to move this forward so that Kubeflow 1.8 works on ARM! |
+1, we need the image for ARM. |
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
c09a781
to
9ed9e4f
Compare
@thesuperzapper ping to take a look at the current state. All the tests are passing, but I have a commit that triggered all of them which I'll drop once I get an ACK. The changes at this point are:
With the above we will always be testing on each PR that images can be built accordingly |
Had a sync with @thesuperzapper. We are OK to merge for now, although there are a couple of improvements to make afterwards, like:
@thesuperzapper I'm moving on with removing the DROP commits so that we can merge |
9ed9e4f
to
1886b01
Compare
@thesuperzapper pushed the changes. The actions for the three webapp got triggered since there is a commit for changing some versions, to make sure the apps work in ARM/ppc64le. Once the tests are good we can merge |
@thesuperzapper should be ready to merge |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: thesuperzapper The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* ci: Build ARM images for core components Extend the GH Actions to also build the images on ARM architectures. Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * crud-web-apps: Update python and gevent versions In order to successfully build on linux/arm64/v8 we'll need to: * Update to Python 3.10 * Bump the gevent version gevent/gevent#1721 (comment) * Update the workflows for JWA Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for centraldb Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for kfam Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for notebook-controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for PodDefaults Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for Profile Controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for pvcviewer controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for TensorBoard Controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for TWA Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for VWA Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update releasing script to include PVCViewers Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> --------- Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
* ci: Build ARM images for core components Extend the GH Actions to also build the images on ARM architectures. Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * crud-web-apps: Update python and gevent versions In order to successfully build on linux/arm64/v8 we'll need to: * Update to Python 3.10 * Bump the gevent version gevent/gevent#1721 (comment) * Update the workflows for JWA Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for centraldb Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for kfam Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for notebook-controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for PodDefaults Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for Profile Controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for pvcviewer controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for TensorBoard Controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for TWA Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for VWA Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update releasing script to include PVCViewers Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> --------- Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
* ci: Build ARM images for core components (#7220) * ci: Build ARM images for core components Extend the GH Actions to also build the images on ARM architectures. Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * crud-web-apps: Update python and gevent versions In order to successfully build on linux/arm64/v8 we'll need to: * Update to Python 3.10 * Bump the gevent version gevent/gevent#1721 (comment) * Update the workflows for JWA Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for centraldb Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for kfam Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for notebook-controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for PodDefaults Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for Profile Controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for pvcviewer controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for TensorBoard Controller Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for TWA Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update workflows for VWA Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Update releasing script to include PVCViewers Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> --------- Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * Run build steps for multi-arch sequentially (#7333) * Run build steps for multi-arch sequentially We saw that if we try to build using an ENV var containing all the architectures then docker buildx will run 3 parallel jobs. This made the GH runners crash in some cases because they were over-utilizing the resources. To mitigate this we'll sequentially build each image for each arch. Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * review: Build first for amd64 Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> --------- Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * ci: Ensure we publish for all architectures (#7343) Previously we omitted the ARCH env var, which resulted in the last step to not push for all architectures, but only for amd64. This commit will set a default ENV var with all architectures, which will be picked up by the build-push make rule we have. Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> * ci: Push the multi-arch image for specific tag (#7345) The latest changes made the CI to only build the images for different archs, but not push for the tag value of the commit. While next steps, for pushing the latest tag or a release version tag will work we still want to ensure we push an image with a tag. Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com> --------- Signed-off-by: Kimonas Sotirchos <kimwnasptd@gmail.com>
This PR extends the GH Actions to also build the "core" images in kubeflow/kubeflow for ARM. This is for kubeflow/manifests#2472 and #2337
The last step is to then modify the build system and makefiles for the example notebook images https://github.com/kubeflow/kubeflow/tree/master/components/example-notebook-servers