Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobSet networking - allow setting (shared) name? #136

Closed
Tracked by #145
vsoch opened this issue May 11, 2023 · 6 comments · Fixed by #171
Closed
Tracked by #145

JobSet networking - allow setting (shared) name? #136

vsoch opened this issue May 11, 2023 · 6 comments · Fixed by #171

Comments

@vsoch
Copy link
Contributor

vsoch commented May 11, 2023

What would you like to be added:

It would be nice if two replicated jobs could be put under the same networking space.

Why is this needed:

For cases of a JobSet with two or more ReplicatedJobs that want to interact, the current default (given that networking is enabled) is to create separate headless services (and addresses) for both. This would likely work in many cases, but for our specific case, we have a broker that is expecting an entire cluster (and components from different ReplicatedJobs) to be under the same top level name (just with different beginning pod names). E.g, so this would work:

minicluster-flux-sample-broker-0-0.flux-service.flux-operator.svc.cluster.local
minicluster-flux-sample-worker-0-0.flux-service.flux-operator.svc.cluster.local

But this would not:

minicluster-flux-sample-broker-0-0.flux-broker.flux-operator.svc.cluster.local
minicluster-flux-sample-worker-0-0.flux-worker.flux-operator.svc.cluster.local

In #132 I provide an example of trying to get it working with the current design, and needing to fall back to creating my own headless service (just one) that is shared by the two ReplicatedJobs. So my suggestion is to allow this - to allow specifying the name:

Network: &jobset.Network{
    Name: "flux-service",
    EnableDNSHostnames: &enableDNSHostnames,
},

to generate the working ips (first block above) and doing this by just applying a selector label to each ReplicatedJob. In my case I used the "job-group" label instead of "job-name" because I think the latter needs to be unique.

After this implementation, it should be possible to eliminate all custom network logic from the Flux Operator PR with JobSet I'm testing, and just specify the name of the network that I want created (and shared between the jobs) and I think uniqueness of hostnames should be OK because the job names are still in there, which are required to be unique from one another.

@kannon92
Copy link
Contributor

Hey @vsoch! This sounds like a pretty useful ask. Are you interested in contributing?

API updates can be made here (https://github.com/kubernetes-sigs/jobset/blob/main/api/v1alpha1/jobset_types.go#L124) and I think you can just update this function https://github.com/kubernetes-sigs/jobset/blob/main/pkg/controllers/jobset_controller.go#L677 to use the name in the sub domain.

And then some integration tests to round this out. And no worries if you can't do it. Just wanted to see.

@vsoch
Copy link
Contributor Author

vsoch commented May 30, 2023

I would love to! ❤️ I'm chewing on some autoscaling stuff right now but I'll make some time this week.

@vsoch
Copy link
Contributor Author

vsoch commented May 31, 2023

hey @kannon92 I started on this today! I'm fairly green with testing controllers (terrible I know, but I have to start at some point I suppose!) could you point me to the test file that I should update, and can we talk about the overall strategy for the tests for this? I'm guessing something added to test/integration/controller/jobset_controller_test.go I'll see if there are others in there testing something with networking and try to mimic that.

@kannon92
Copy link
Contributor

hey @kannon92 I started on this today! I'm fairly green with testing controllers (terrible I know, but I have to start at some point I suppose!) could you point me to the test file that I should update, and can we talk about the overall strategy for the tests for this? I'm guessing something added to test/integration/controller/jobset_controller_test.go I'll see if there are others in there testing something with networking and try to mimic that.

I'm mostly on the PTO the next few weeks so I won't be able to help too much. I glanced at your PR and I think you have the right idea.

For integration tests you can't really test correctness with things like services but you can make sure that services match subdomains or something like that. The controller-runtime test library does not run all the controllers so you really only get things like creating the objects.

I think @ahg-g or @danielvegamyhre will give you a detailed review.. ;)

@vsoch
Copy link
Contributor Author

vsoch commented May 31, 2023

Enjoy your PTO! I'll probably do some sanity checks with the Flux Operator soon, because I know how I want/expect it to work there.

@vsoch
Copy link
Contributor Author

vsoch commented Jul 20, 2023

heyo! Just wanted to give quick feedback that I was able to update the hyperqueue operator to use 0.2.0 of JobSet, and the new "use this Subdomain" to get everything on the same network worked great. Thanks again for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants