Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize batch/job.completions with parallelism in case of partial admission #970

Closed
3 tasks
Tracked by #974
trasc opened this issue Jul 11, 2023 · 6 comments · Fixed by #1014
Closed
3 tasks
Tracked by #974

Synchronize batch/job.completions with parallelism in case of partial admission #970

trasc opened this issue Jul 11, 2023 · 6 comments · Fixed by #1014
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@trasc
Copy link
Contributor

trasc commented Jul 11, 2023

What would you like to be added:
Synchronize batch/job.completions with parallelism in case of partial admission
Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@trasc trasc added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 11, 2023
@trasc
Copy link
Contributor Author

trasc commented Jul 11, 2023

#971 has a partial implementation for this, missing the k8s version check side.

For version check I see two possibilities:

  1. Get the the version of the api server using client-go discovery.
  2. Get the node name via downward API, get the node info from the cluster and extract the kubelet version.

With 2. we can consider the version to be the same for the full duration of the execution.
With 1. we might be required to get the version multiple times since we cannot grantee that the api-server is upadated before the kubelet managing the kueue's controller-manager.

cc: @alculquicondor

@alculquicondor
Copy link
Contributor

2 doesn't work, because the node version could be different from the control plane version. And validation happens in the control plane.

So it has to be 1. However, the API call should probably be done once at startup.

@trasc
Copy link
Contributor Author

trasc commented Jul 11, 2023

2 doesn't work, because the node version could be different from the control plane version. And validation happens in the control plane.

So it has to be 1. However, the API call should probably be done once at startup.

The problem is that there is no guarantee that the kueue's controller-manager is restarted when the control plane changes.

We could watch for nodes with node-role.kubernetes.io/control-plane label, and try to solve potential conflicts when multiple are present. This way we are also using the controller runtime cache.

@alculquicondor
Copy link
Contributor

We could watch for nodes with node-role.kubernetes.io/control-plane label, and try to solve potential conflicts when multiple are present. This way we are also using the controller runtime cache.

This is not always possible. In multiple cloud providers, the control-plane VMs are not visible in the API.

@alculquicondor
Copy link
Contributor

On the other hand, a query every 10 minutes sounds like an acceptable compromise.

@stuton
Copy link
Contributor

stuton commented Jul 13, 2023

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants