diff --git a/content/en/_index.html b/content/en/_index.html index 0cc2184faa..518dd553df 100644 --- a/content/en/_index.html +++ b/content/en/_index.html @@ -124,7 +124,7 @@
AutoML
Model Training

- Kubeflow Training Operator is a unified interface for model training on Kubernetes. + Kubeflow Training Operator is a unified interface for model training and fine-tuning on Kubernetes. It runs scalable and distributed training jobs for popular frameworks including PyTorch, TensorFlow, MPI, MXNet, PaddlePaddle, and XGBoost.

diff --git a/content/en/docs/components/training/explanation/_index.md b/content/en/docs/components/training/explanation/_index.md new file mode 100644 index 0000000000..bc2e4865e1 --- /dev/null +++ b/content/en/docs/components/training/explanation/_index.md @@ -0,0 +1,5 @@ ++++ +title = "Explanation" +description = "Explanation for Training Operator Features" +weight = 60 ++++ diff --git a/content/en/docs/components/training/explanation/fine-tuning.md b/content/en/docs/components/training/explanation/fine-tuning.md new file mode 100644 index 0000000000..4e565f1368 --- /dev/null +++ b/content/en/docs/components/training/explanation/fine-tuning.md @@ -0,0 +1,63 @@ ++++ +title = "LLM Fine-Tuning with Training Operator" +description = "Why Training Operator needs fine-tuning API" +weight = 10 ++++ + +{{% alert title="Warning" color="warning" %}} +This feature is in **alpha** stage and Kubeflow community is looking for your feedback. Please +share your experience using [#kubeflow-training-operator Slack channel](https://kubeflow.slack.com/archives/C985VJN9F) +or [Kubeflow Training Operator GitHib](https://github.com/kubeflow/training-operator/issues/new). +{{% /alert %}} + +This page explains how [Training Operator fine-tuning API](/docs/components/training/user-guides/fine-tuning) +fits into Kubeflow ecosystem. + +In the rapidly evolving landscape of machine learning (ML) and artificial intelligence (AI), +the ability to fine-tune pre-trained models represents a significant leap towards achieving custom +solutions with less effort and time. Fine-tuning allows practitioners to adapt large language models +(LLMs) like BERT or GPT to their specific needs by training these models on custom datasets. +This process maintains the model's architecture and learned parameters while making it more relevant +to particular applications. Whether you're working in natural language processing (NLP), +image classification, or another ML domain, fine-tuning can drastically improve performance and +applicability of pre-existing models to new datasets and problems. + +## Why Training Operator Fine-Tune API Matter ? + +Training Operator Python SDK introduction of Fine-Tune API is a game-changer for ML practitioners +operating within the Kubernetes ecosystem. Historically, Training Operator has streamlined the +orchestration of ML workloads on Kubernetes, making distributed training more accessible. However, +fine-tuning tasks often require extensive manual intervention, including the configuration of +training environments and the distribution of data across nodes. The Fine-Tune API aim to simplify +this process, offering an easy-to-use Python interface that abstracts away the complexity involved +in setting up and executing fine-tuning tasks on distributed systems. + +## The Rationale Behind Kubeflow's Fine-Tune API + +Implementing Fine-Tune API within Training Operator is a logical step in enhancing the platform's +capabilities. By providing this API, Training Operator not only simplifies the user experience for +ML practitioners but also leverages its existing infrastructure for distributed training. +This approach aligns with Kubeflow's mission to democratize distributed ML training, making it more +accessible and less cumbersome for users. The API facilitate a seamless transition from model +development to deployment, supporting the fine-tuning of LLMs on custom datasets without the need +for extensive manual setup or specialized knowledge of Kubernetes internals. + +## Roles and Interests + +Different user personas can benefit from this feature: + +- **MLOps Engineers:** Can leverage this API to automate and streamline the setup and execution of + fine-tuning tasks, reducing operational overhead. + +- **Data Scientists:** Can focus more on model experimentation and less on the logistical aspects of + distributed training, speeding up the iteration cycle. + +- **Business Owners:** Can expect quicker turnaround times for tailored ML solutions, enabling faster + response to market needs or operational challenges. + +- **Platform Engineers:** Can utilize this API to better operationalize the ML toolkit, ensuring + scalability and efficiency in managing ML workflows. + +## Next Steps + +- Understand [the architecture behind `train` API](/docs/components/training/reference/fine-tuning). diff --git a/content/en/docs/components/training/images/fine-tune-llm-api.drawio.svg b/content/en/docs/components/training/images/fine-tune-llm-api.drawio.svg new file mode 100644 index 0000000000..0aeed6e430 --- /dev/null +++ b/content/en/docs/components/training/images/fine-tune-llm-api.drawio.svg @@ -0,0 +1,4 @@ + + + +
Worker 0
train()
Kubeflow
Python SDK
  • Model Parameters
  • Dataset Parameters
  • Trainer Parameters
  • Num Workers
  • Worker Resources
PyTorchJob
image/svg+xml
Storage Initializer
InitContainer
LLM Trainer Container
Dataset Provider
Model Provider
S3 Bucket
Worker 1
LLM Trainer
Container
Worker 2
LLM Trainer
Container
PyTorchCluster
\ No newline at end of file diff --git a/content/en/docs/components/training/reference/fine-tuning.md b/content/en/docs/components/training/reference/fine-tuning.md new file mode 100644 index 0000000000..ab7362d337 --- /dev/null +++ b/content/en/docs/components/training/reference/fine-tuning.md @@ -0,0 +1,57 @@ ++++ +title = "LLM Fine-Tuning with Training Operator" +description = "How Training Operator performs fine-tuning on Kubernetes" +weight = 10 ++++ + +This page shows how Training Operator implements the +[API to fine-tune LLMs](/docs/components/training/user-guides/fine-tuning). + +## Architecture + +In the following diagram you can see how `train` Python API works: + +Fine-Tune API for LLMs + +- Once user executes `train` API, Training Operator creates PyTorchJob with appropriate resources + to fine-tune LLM. + +- Storage initializer InitContainer is added to the PyTorchJob worker 0 to download + pre-trained model and dataset with provided parameters. + +- PVC with [`ReadOnlyMany` access mode](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) + it attached to each PyTorchJob worker to distribute model and dataset across Pods. **Note**: Your + Kubernetes cluster must support volumes with `ReadOnlyMany` access mode, otherwise you can use a + single PyTorchJob worker. + +- Every PyTorchJob worker runs LLM Trainer that fine-tunes model using provided parameters. + +Training Operator implements `train` API with these pre-created components: + +### Model Provider + +Model provider downloads pre-trained model. Currently, Training Operator supports +[HuggingFace model provider](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/hugging_face.py#L56) +that downloads model from HuggingFace Hub. + +You can implement your own model provider by using [this abstract base class](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/abstract_model_provider.py#L4) + +### Dataset Provider + +Dataset provider downloads dataset. Currently, Training Operator supports +[AWS S3](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/s3.py#L37) +and [HuggingFace](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/hugging_face.py#L92) +dataset providers. + +You can implement your own dataset provider by using [this abstract base class](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/abstract_dataset_provider.py) + +### LLM Trainer + +Trainer implements training loop to fine-tune LLM. Currently, Training Operator supports +[HuggingFace trainer](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/trainer/hf_llm_training.py#L118-L139) +to fine-tune LLMs. + +You can implement your own trainer for other ML use-cases such as image classification, +voice recognition, etc. diff --git a/content/en/docs/components/training/user-guides/fine-tuning.md b/content/en/docs/components/training/user-guides/fine-tuning.md new file mode 100644 index 0000000000..26e08b057a --- /dev/null +++ b/content/en/docs/components/training/user-guides/fine-tuning.md @@ -0,0 +1,97 @@ ++++ +title = "How to Fine-Tune LLMs with Kubeflow" +description = "Overview of LLM fine-tuning API in Training Operator" +weight = 10 ++++ + +{{% alert title="Warning" color="warning" %}} +This feature is in **alpha** stage and Kubeflow community is looking for your feedback. Please +share your experience using [#kubeflow-training-operator Slack channel](https://kubeflow.slack.com/archives/C985VJN9F) +or [Kubeflow Training Operator GitHib](https://github.com/kubeflow/training-operator/issues/new). +{{% /alert %}} + +This page describes how to use a [`train` API from Training Python SDK](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/training/api/training_client.py#L112) that simplifies the ability to fine-tune LLMs with +distributed PyTorchJob workers. + +If you want to learn more about how the fine-tuning API fit in the Kubeflow ecosystem, head to +[explanation guide](/docs/components/training/explanation/fine-tuning). + +## Prerequisites + +You need to install Training Python SDK [with fine-tuning support](/docs/components/training/installation/#install-python-sdk-with-fine-tuning-capabilities) +to run this API. + +## How to use Fine-Tuning API ? + +You need to provide the following parameters to use the `train` API: + +- Pre-trained model parameters. +- Dataset parameters. +- Trainer parameters. +- Number of PyTorch workers and resources per workers. + +For example, you can use `train` API as follows to fine-tune BERT model using Yelp Review dataset +from HuggingFace Hub: + +```python +import transformers +from peft import LoraConfig + +from kubeflow.training import TrainingClient +from kubeflow.storage_initializer.hugging_face import ( + HuggingFaceModelParams, + HuggingFaceTrainerParams, + HuggingFaceDatasetParams, +) + +TrainingClient().train( + name="fine-tune-bert", + # BERT model URI and type of Transformer to train it. + model_provider_parameters=HuggingFaceModelParams( + model_uri="hf://google-bert/bert-base-cased", + transformer_type=transformers.AutoModelForSequenceClassification, + ), + # Use 3000 samples from Yelp dataset. + dataset_provider_parameters=HuggingFaceDatasetParams( + repo_id="yelp_review_full", + split="train[:3000]", + ), + # Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints. + trainer_parameters=HuggingFaceTrainerParams( + training_parameters=transformers.TrainingArguments( + output_dir="test_trainer", + save_strategy="no", + evaluation_strategy="no", + do_eval=False, + disable_tqdm=True, + log_level="info", + ), + # Set LoRA config to reduce number of trainable model parameters. + lora_config=LoraConfig( + r=8, + lora_alpha=8, + lora_dropout=0.1, + bias="none", + ), + ), + num_workers=4, # nnodes parameter for torchrun command. + num_procs_per_worker=2, # nproc-per-node parameter for torchrun command. + resources_per_worker={ + "gpu": 2, + "cpu": 5, + "memory": "10G", + }, +) +``` + +After you execute `train`, Training Operator will orchestrate appropriate PyTorchJob resources +to fine-tune LLM. + +## Next Steps + +- Run example to [fine-tune TinyLlama LLM](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/examples/pytorch/language-modeling/train_api_hf_dataset.ipynb) + +- Check this example to compare `create_job` and `train` Python API for + [fine-tuning BERT LLM](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/examples/pytorch/text-classification/Fine-Tune-BERT-LLM.ipynb). + +- Understand [the architecture behind `train` API](/docs/components/training/reference/fine-tuning).