feat: KEP 3328 Automatic Resource Configuration#3559
feat: KEP 3328 Automatic Resource Configuration#3559VassilisVassiliadis wants to merge 2 commits into
Conversation
Signed-off-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds KEP-3328 proposal documentation describing an extensible mechanism for automatically configuring TrainJob resource requirements via external plugin controllers, with Trainer-enforced guardrails and admission gating.
Changes:
- Introduces a new KEP document covering motivation, goals/non-goals, risks, and alternatives.
- Describes the high-level protocol between Kubeflow Trainer and external plugin controllers.
- Captures integration considerations with Kueue/admission gating and JobSet mutation constraints.
| It is an AI agent powered resource recommender for TrainJobs. They platform engineers | ||
| provided the AI agent with MCP tools that can interact with the cluster to get more | ||
| information about it (e.g. the available GPUs, the running workloads, etc). |
| - **Pros:** | ||
| - **Extensible**: platform engineers can develop and deploy custom plugins without modifying Kubeflow Trainer. | ||
| - **Safe**: Trainer enforces guardrails (timeouts, quota caps) on plugin operations. | ||
| - **Observable**: `spec.runtimePatches` shows the eact fields that the plugin is mutating, |
| <!-- | ||
| What are the caveats to the proposal? | ||
| What are some important details that didn't come across above? | ||
| Go in to as much detail as necessary here. |
| which has the id `ai-agent-recommender`. | ||
|
|
||
| It is an AI agent powered resource recommender for TrainJobs. They platform engineers | ||
| provided the AI agent with MCP tools that can interact with the cluster to get more |
| know that this has succeeded? | ||
| --> | ||
|
|
||
| - Automatically configure TrainJob resources before the job is eligible for admission: |
| - Integrate with Kueue so that the job is not eligible for Kueue's admission | ||
| flow while it is being auto-configured. | ||
| - Support a catalog of plugins: platform engineers can enable/disable | ||
| plugins per namespace and users can pick the one they'd like to use for their TrainJob. |
|
@tenzen-y @astefanutti @andreyvelich I put together a draft of the kep focusing on high level information regarding the proposal as well as pros and cons of alternative approaches. Let me know what you think! |
Signed-off-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com>
What this PR does / why we need it:
This is a draft of the KEP for automatic resource configuration (#3328).
It covers:
I added enough context to understand how this could be implemented but I didn't go into design details. I'll fill in the rest of the text once we concretize which of the available approaches we'd like to go with.
You can find more information about my proposal here: https://docs.google.com/document/d/114Cs7rz79GD5exAiP-iOcKNNGaD6rwf5veBodEvkkxA
Which issue(s) this PR fixes
Fixes #3328
Checklist:
I authored the text and used IBM Bob to tidy it up.