Skip to content

chore: Specialized Trainers Proposal#286

Closed
szaher wants to merge 1 commit intokubeflow:mainfrom
szaher:KEP-285
Closed

chore: Specialized Trainers Proposal#286
szaher wants to merge 1 commit intokubeflow:mainfrom
szaher:KEP-285

Conversation

@szaher
Copy link
Member

@szaher szaher commented Feb 11, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Saad Zaher <szaher@redhat.com>
Copilot AI review requested due to automatic review settings February 11, 2026 19:13
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@szaher szaher changed the title Specialized Trainers Proposal docs: Specialized Trainers Proposal Feb 11, 2026
@szaher szaher changed the title docs: Specialized Trainers Proposal feat: Specialized Trainers Proposal Feb 11, 2026
@szaher szaher changed the title feat: Specialized Trainers Proposal chore: Specialized Trainers Proposal Feb 11, 2026
@szaher szaher closed this Feb 11, 2026
@szaher szaher deleted the KEP-285 branch February 11, 2026 19:15
@coveralls
Copy link

Pull Request Test Coverage Report for Build 21919346642

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 67.948%

Totals Coverage Status
Change from base Build 21919105330: 0.0%
Covered Lines: 2828
Relevant Lines: 4162

💛 - Coveralls

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive design proposal for specialized trainer abstractions in the Kubeflow SDK. The proposal addresses current limitations in the SDK's trainer subsystem by introducing framework-aware trainer classes and a dedicated runtime configuration system.

Changes:

  • Proposes a BaseTrainer abstract interface enabling polymorphic handling of trainers across the SDK
  • Defines Tier 1 framework-specific trainers (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer) with automatic runtime discovery
  • Introduces RuntimeConfig dataclass to separate runtime environment settings from training logic
  • Provides comprehensive design details including auto-discovery, validation, backward compatibility, test plans, and implementation phases


from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Callable, Optional
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing import for ClassVar which is used on line 240. Add from typing import ClassVar or include it in the existing typing import.

Suggested change
from typing import Callable, Optional
from typing import Callable, Optional, ClassVar

Copilot uses AI. Check for mistakes.
index_urls: list[str] = field(
default_factory=lambda: list(constants.DEFAULT_PIP_INDEX_URLS)
)
quiet: bool = True
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quiet field has a default value of True, but the guideline notes that booleans should default to False unless there's a specific reason. Consider whether suppressing pip output by default is the desired behavior, or if it should default to False for better debugging visibility.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +473 to +475
"""Trainer for HuggingFace Transformers training.

Wraps HuggingFace's Trainer API and maps to a PyTorch runtime.
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spelling of "Hugging Face". The official company/product name is "Hugging Face" (two words), not "HuggingFace". This appears in multiple locations including class names, comments, and documentation. While HuggingFaceTrainer as a class name (PascalCase without spaces) is acceptable, the documentation text should use "Hugging Face" (two words).

Suggested change
"""Trainer for HuggingFace Transformers training.
Wraps HuggingFace's Trainer API and maps to a PyTorch runtime.
"""Trainer for Hugging Face Transformers training.
Wraps Hugging Face's Trainer API and maps to a PyTorch runtime.

Copilot uses AI. Check for mistakes.
Comment on lines +473 to +475
"""Trainer for HuggingFace Transformers training.

Wraps HuggingFace's Trainer API and maps to a PyTorch runtime.
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spelling of "Hugging Face". The official company/product name is "Hugging Face" (two words), not "HuggingFace". Use "Hugging Face" in documentation text.

Suggested change
"""Trainer for HuggingFace Transformers training.
Wraps HuggingFace's Trainer API and maps to a PyTorch runtime.
"""Trainer for Hugging Face Transformers training.
Wraps Hugging Face's Trainer API and maps to a PyTorch runtime.

Copilot uses AI. Check for mistakes.
3. **Deprecating `CustomTrainer` or `BuiltinTrainer`.** Both remain supported.
Specialized trainers are an additional option, not a replacement.
4. **Tier 2 trainer implementations.** This proposal defines the extension mechanism
and interface. Concrete Tier 2 implementations (HuggingFace, DeepSpeed, Unsloth,
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spelling of "Hugging Face". The official company/product name is "Hugging Face" (two words), not "HuggingFace". Use "Hugging Face" in documentation text.

Suggested change
and interface. Concrete Tier 2 implementations (HuggingFace, DeepSpeed, Unsloth,
and interface. Concrete Tier 2 implementations (Hugging Face, DeepSpeed, Unsloth,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants