Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobSet release for ARM? #237

Closed
vsoch opened this issue Jul 24, 2023 · 19 comments · Fixed by #694
Closed

JobSet release for ARM? #237

vsoch opened this issue Jul 24, 2023 · 19 comments · Fixed by #694
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Comments

@vsoch
Copy link
Contributor

vsoch commented Jul 24, 2023

I'm testing JobSet with an operator, and I saw this error in my operator

2023-07-24T22:04:51Z	ERROR	Reconciler error	{"controller": "hyperqueue", "controllerGroup": "flux-framework.org", "controllerKind": "Hyperqueue", "Hyperqueue": {"name":"hyperqueue-sample","namespace":"flux-operator"}, "namespace": "flux-operator", "name": "hyperqueue-sample", "reconcileID": "a2514367-0003-4f54-bec9-cde6aed4365e", "error": "Internal error occurred: failed calling webhook \"mjobset.kb.io\": failed to call webhook: Post \"https://jobset-webhook-service.jobset-system.svc:443/mutate-jobset-x-k8s-io-v1alpha2-jobset?timeout=10s\": no endpoints available for service \"jobset-webhook-service\""}

Doh! I forgot that JobSet is probably built for x86!

$ kubectl logs -n jobset-system jobset-controller-manager-547857948-d9m82 
exec /manager: exec format error

Would it be possible to have an ARM build / set of manifests?

@vsoch
Copy link
Contributor Author

vsoch commented Sep 11, 2023

heyo! I was wondering if this is something we could look into?

@kannon92
Copy link
Contributor

I think it is doable. I believe that Kueue supports both.

Looks like its some work in the makefile: https://github.com/kubernetes-sigs/kueue/blob/main/Makefile to tell which platform to build.

@ahg-g
Copy link
Contributor

ahg-g commented Sep 12, 2023

yeah, we can support that, do you want to send a PR that follows the Kueue pattern?

@vsoch
Copy link
Contributor Author

vsoch commented Sep 12, 2023

yeah, we can support that, do you want to send a PR that follows the Kueue pattern?

The lab asked me to stop contributing until we have a corporate CLA (I was using an individual one before) so I definitely could, but with some delay for that to finalize (we are pushing fairly hard for it to be signed)! If there is someone else that wants to do it, go ahead, if not I can when we have that.

@kannon92
Copy link
Contributor

/help
/good-first-issue

@k8s-ci-robot
Copy link
Contributor

@kannon92:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/help
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Sep 12, 2023
@a-hilaly
Copy link
Contributor

@vsoch Thanks for the links, very insightful. QQ: Do you also use/publish manifests to group multiple image tags (*-arm, *-amd ) under the same tags?

@vsoch
Copy link
Contributor Author

vsoch commented Sep 22, 2023

I don’t use publish manifests - I have a similar workflow that just builds an explicit arm tag (for a push to main or release with the version) and then generates the equivalent CRD yaml file targeting arm. I didn’t want to tangle the builds together or assume they would stay the same (aside from the architecture).

@a-hilaly
Copy link
Contributor

I have a similar workflow that just builds an explicit arm tag

Thanks for the clarification. I'm bringing this conversation because I'm currently reading about container registry manifests, and i think it would be useful (in the context of this project) to push image tags for each of the supported platform + a manifest mapping the images the their respective platforms. This way user will only have to specify the same tag, regardless of the platform, and leave it up the container runtime to pull the right image. I want to hear what folks this this about this solution/approach? Also happy to contribute and help supporting this.

@vsoch
Copy link
Contributor Author

vsoch commented Sep 22, 2023

I’ve developed container registries and contributed to OCI, so I’m familiar. My reasons are primarily the drastic differences in build times. The typical approach is to use this action to build and push both at once (same manifest) and it’s not reasonable to do in the time of a PR. https://drpdishant.medium.com/multi-arch-images-with-docker-buildx-and-qemu-141e0b6161e7 that said, it would work fine to do both like that on merge to main and release, and just one for PR. It’s up to you how you want to maintain your images and CRDs - I want to keep them separate. I don’t think there is a right choice, just different preferences.

@a-hilaly
Copy link
Contributor

Thanks for clarifying again! I have few questions to complete the missing pieces in my mind:

  • How come CRDs come into the game here? My understanding is that this is about building controller binaries and shipping container images.
  • If understand well, building and pushing images separately for each platform is faster than using docker buildx build/push --platforms=$(SOME_PLATFORM) ?

@a-hilaly
Copy link
Contributor

Maybe i brought some confusion with my initial message, but the manifest i'm talking about was more the "container registry manifests" (https://docs.docker.com/engine/reference/commandline/manifest/), not the released manifest.yaml file (that contains CRDs as well)

@vsoch
Copy link
Contributor Author

vsoch commented Sep 22, 2023

How come CRDs come into the game here? My understanding is that this is about building controller binaries and shipping container images.

A container image for a controller is directly tied to a CRD. If the metadata is off or wrong there, it won't install properly. E.g., here is the container reference for my CRD that the user applies (that single file) to do the entire deployment. https://github.com/flux-framework/flux-operator/blob/76c4754f90043dfc7afecc16ade053549c89ba18/examples/dist/flux-operator.yaml#L1398. The chart is just a more separated variant of that, where the version (or tag) is filled with a variable, e.g., here is the same. https://github.com/flux-framework/flux-operator/blob/76c4754f90043dfc7afecc16ade053549c89ba18/chart/templates/deployment.yaml#L58

If understand well, building and pushing images separately for each platform is faster than using docker buildx build/push --platforms=$(SOME_PLATFORM) ?

if you use the buildx action and give it two architectures, it's going to take (I think) the sum of both, or at best, the time of the fastest one (if it can somehow build in parallel without losing efficiency). In my operator's case, the difference in times is 6 minutes for x86 and 33 for arm. And (in the scope of things I've built with QEMU and arm) that is really speedy, if you have to do more compilation it can take hours (or go outside of action limits). Here are the two paired, which happens on merge to main / releases. https://github.com/flux-framework/flux-operator/actions/runs/6137092811

@vsoch
Copy link
Contributor Author

vsoch commented Sep 22, 2023

Maybe i brought some confusion with my initial message, but the manifest i'm talking about was more the "container registry manifests" (https://docs.docker.com/engine/reference/commandline/manifest/), not the released manifest.yaml file (that contains CRDs as well)

No I understand. You want to pull one digest/tag and have the registry choose the platform. That happens by way of packaging them together into the same OCI image manifest. That is separate from the CRD yaml manifests (that I tried to distinguish based on name). My other point is that the second set of manifests (for the operator) can have subtle differences that would make sharing the OCI image manifest platform potentially problematic. I don't have a good example yet but a simple one might be that some of the clouds expect a pod selector for arm, which we'd need for one CRD manifest for arm but not the other.

@vsoch
Copy link
Contributor Author

vsoch commented Sep 22, 2023

To be explicitly clear, you can find a good strategy to put both platforms in the same manifest and allow the OCI registry to choose. The complexity, I think, is ensuring that the builds happen in a reasonable time in testing, and if you can do then, ensuring that (forever) there are no differences in the operator manifests to warrant having just one the wrong approach. As I already said, I don't think there is a right answer, but it's a choice depending on the needs / use cases for your operator and choices for CI.

@a-hilaly
Copy link
Contributor

I understand that CRDs can be tightly coupled with the controller version (e.g when it comes to controllers reconciliation logic). Is it accurate to say that the main issue revolves around ensuring compatibility between CRDs (in terms of versioning) and container images, preventing them from diverging?

Additionally, in the case of this project, could there be scenarios where different platforms require distinct "CRDs yaml defintions" to accommodate platform-specific requirements?

@vsoch
Copy link
Contributor Author

vsoch commented Sep 22, 2023

Additionally, in the case of this project, could there be scenarios where different platforms require distinct "CRDs yaml defintions" to accommodate platform-specific requirements?

Yes that's why I have two :) And I mentioned:

I don't have a good example yet but a simple one might be that some of the clouds expect a pod selector for arm, which we'd need for one CRD manifest for arm but not the other.

@danielvegamyhre
Copy link
Contributor

I think it is doable. I believe that Kueue supports both.

Looks like its some work in the makefile: https://github.com/kubernetes-sigs/kueue/blob/main/Makefile to tell which platform to build.

Reviving this issue since I think it would be useful to do this, if anyone wants to give it a shot :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants