Skip to content

feat: KEP 3328 Automatic Resource Configuration#3559

Open
VassilisVassiliadis wants to merge 2 commits into
kubeflow:masterfrom
VassilisVassiliadis:kep-3328
Open

feat: KEP 3328 Automatic Resource Configuration#3559
VassilisVassiliadis wants to merge 2 commits into
kubeflow:masterfrom
VassilisVassiliadis:kep-3328

Conversation

@VassilisVassiliadis
Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

This is a draft of the KEP for automatic resource configuration (#3328).

It covers:

  • motivation
  • user story
  • pros and cons of the approach
  • pros and cons of alternative approaches

I added enough context to understand how this could be implemented but I didn't go into design details. I'll fill in the rest of the text once we concretize which of the available approaches we'd like to go with.

You can find more information about my proposal here: https://docs.google.com/document/d/114Cs7rz79GD5exAiP-iOcKNNGaD6rwf5veBodEvkkxA

Which issue(s) this PR fixes
Fixes #3328

Checklist:

  • Docs included if any changes are user facing

I authored the text and used IBM Bob to tidy it up.

Signed-off-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com>
Copilot AI review requested due to automatic review settings May 28, 2026 14:35
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds KEP-3328 proposal documentation describing an extensible mechanism for automatically configuring TrainJob resource requirements via external plugin controllers, with Trainer-enforced guardrails and admission gating.

Changes:

  • Introduces a new KEP document covering motivation, goals/non-goals, risks, and alternatives.
  • Describes the high-level protocol between Kubeflow Trainer and external plugin controllers.
  • Captures integration considerations with Kueue/admission gating and JobSet mutation constraints.

Comment on lines +160 to +162
It is an AI agent powered resource recommender for TrainJobs. They platform engineers
provided the AI agent with MCP tools that can interact with the cluster to get more
information about it (e.g. the available GPUs, the running workloads, etc).
- **Pros:**
- **Extensible**: platform engineers can develop and deploy custom plugins without modifying Kubeflow Trainer.
- **Safe**: Trainer enforces guardrails (timeouts, quota caps) on plugin operations.
- **Observable**: `spec.runtimePatches` shows the eact fields that the plugin is mutating,
<!--
What are the caveats to the proposal?
What are some important details that didn't come across above?
Go in to as much detail as necessary here.
which has the id `ai-agent-recommender`.

It is an AI agent powered resource recommender for TrainJobs. They platform engineers
provided the AI agent with MCP tools that can interact with the cluster to get more
know that this has succeeded?
-->

- Automatically configure TrainJob resources before the job is eligible for admission:
- Integrate with Kueue so that the job is not eligible for Kueue's admission
flow while it is being auto-configured.
- Support a catalog of plugins: platform engineers can enable/disable
plugins per namespace and users can pick the one they'd like to use for their TrainJob.
@VassilisVassiliadis
Copy link
Copy Markdown
Contributor Author

@tenzen-y @astefanutti @andreyvelich I put together a draft of the kep focusing on high level information regarding the proposal as well as pros and cons of alternative approaches.

Let me know what you think!

Signed-off-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KEP Automatic configuration of GPU requests for TrainJobs

2 participants