Skip to content

Commit

Permalink
Update a few args
Browse files Browse the repository at this point in the history
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
  • Loading branch information
andreyvelich committed Apr 19, 2024
1 parent 9b46305 commit b07dd48
Showing 1 changed file with 12 additions and 11 deletions.
23 changes: 12 additions & 11 deletions sdk/python/kubeflow/training/api/training_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,22 +116,23 @@ def train(
Trainer to fine-tune LLM. Your cluster should support PVC with ReadOnlyMany access mode
to distribute data across PyTorchJob workers.
It uses `torchrun` CLI to fine-tune model in distributed mode across multiple PyTorchJob
workers. Follow this guide to know more about `torchrun`: https://pytorch.org/docs/stable/elastic/run.html
It uses `torchrun` CLI to fine-tune model in distributed mode with multiple PyTorchJob
workers. Follow this guide to know more about `torchrun` CLI:
https://pytorch.org/docs/stable/elastic/run.html
This feature is in alpha stage and Kubeflow community is looking for your feedback.
Please use #kubeflow-training-operator Slack channel or Kubeflow Training Operator GitHub
for your questions or suggestions.
Args:
name: Name of the PyTorchJob.
namespace: Namespace for the Job. By default namespace is taken from
namespace: Namespace for the PyTorchJob. By default namespace is taken from
`TrainingClient` object.
num_workers: Number of PyTorchJob worker replicas for the Job.
num_workers: Number of PyTorchJob workers.
num_procs_per_worker: Number of processes per PyTorchJob worker for `torchrun` CLI.
You can use this parameter if you use more than 1 GPU per PyTorchJob worker.
You can use this parameter if you want to use more than 1 GPU per PyTorchJob worker.
resources_per_worker: A parameter that lets you specify how much
resources each Worker container should have. You can either specify a
resources each PyTorchJob worker container should have. You can either specify a
kubernetes.client.V1ResourceRequirements object (documented here:
https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/V1ResourceRequirements.md)
or a dictionary that includes one or more of the following keys:
Expand All @@ -151,21 +152,21 @@ def train(
of GPU, pass in a V1ResourceRequirement instance instead, since it's
more flexible. This parameter is optional and defaults to None.
model_provider_parameters: Parameters for the model provider in the Storage Initializer.
For example, HuggingFace model name and Transformer with this type:
AutoModelForSequenceClassification. This parameter must be the type of
For example, HuggingFace model name and Transformer type for that model, like:
AutoModelForSequenceClassification. This argument must be the type of
`kubeflow.storage_initializer.hugging_face.HuggingFaceModelParams`
dataset_provider_parameters: Parameters for the dataset provider in the
Storage Initializer. For example, name of the HuggingFace dataset or
AWS S3 configuration. These parameters must be the type of
AWS S3 configuration. This argument must be the type of
`kubeflow.storage_initializer.hugging_face.HuggingFaceDatasetParams` or
`kubeflow.storage_initializer.s3.S3DatasetParams`
trainer_parameters: Parameters for LLM Trainer that will fine-tune pre-trained model
with the given dataset. For example, LoRA config for parameter-efficient fine-tuning
and HuggingFace training arguments like optimizer or number of training epochs.
These parameters must be the type of
This argument must be the type of
`kubeflow.storage_initializer.HuggingFaceTrainerParams`
storage_config: Configuration for Storage Initializer PVC to download pre-trained model
and dataset.
and dataset. You can configure PVC size and storage class name in this argument.
"""
try:
import peft
Expand Down

0 comments on commit b07dd48

Please sign in to comment.