Fine-tuning in Kubeflow Training Operator #1923

andreyvelich · 2023-09-28T18:04:32Z

Today, in the world of large models, usually, Data Scientists don't train their models from scratch but use the existing Foundation models and fine-tune them.
In the last Training WG Community call, we discussed how Training Operator can be used to efficient fine-tune large models on Kubernetes.

There are several challenges that we can address in Training Operator to improve it.

Streamline access to training data

Usually, training of large model requires many GPUs and Workers. Which means every Worker should have an access to training data before it starts. If data is large, it requires significant amount of CPU resources to download the data and convert it to PyTorch DataLoader as an example.
We can discuss about improvements of the data transfer from data pre-processing step (e.g. using Spark) to training step (e.g. using PyTorch, Tensorflow, etc.).

Optimize model download

Before training starts, we need to download the model on every worker. Maybe we could think how to reduce cost and resources for such operation.

Quick access to Foundation models

We can build abstractions on top of HuggingFace Transformers APIs to give users quick access to fine-tune foundation models on Kubernetes using Training Operator SDK. For example:

TrainingClient().fine_tune(
  model="LLama2",
  dataset="s3://...",
)

And SDK will generate appropriate script to the Job's container arguments by using HuggingFace APIs.

Avoid Overfitting

Sometime model could be overtrained which means model accuracy will decrease and model could forget some features. Especially, that is important when you want to deploy model which produces best results. We can address such issue by using some EarlyStopping techniques, similar what we do in Katib.

Using MPI/All-reduce style of distributed training

We need to benchmark whether all-reduce style of distributed training produces better results to train large models. Then, MPI Operator could be a good candidate to investigate.
In addition to that, we can explore other distributed techniques that improve training performance.

Feedback from Training Operator users

We want to hear feedback from Training Operator users and what features they would like to see to train their large models.
Please provide your ideas, suggestions, and feature-request for that topic.

cc @kubeflow/wg-training-leads @tenzen-y @kuizhiqing

johnugeorge · 2023-10-09T11:53:57Z

I will create a proposal in couple of weeks regarding the new API to be supported.

tenzen-y · 2023-10-10T05:50:55Z

@johnugeorge @andreyvelich I'm not sure why we need to support this feature. I think we can realize the existing features using Kubeflow Pipelines. Maybe we can construct the following pipeline:

Download Model to PVC.
Do any pre-processing to the downloaded mode.
Start fine-tuning using the training-operator with PVC.

So, I think the role of the training-operator would conflict with pipelines.
What is the difference between using pipelines and this new training-operator feature?

johnugeorge · 2023-10-10T20:02:23Z

@tenzen-y the point discussed is about the better data processing for training framework rather than the infra provisioning done by pipelines. Both are complimentary according to me

tenzen-y · 2023-10-11T11:26:51Z

@tenzen-y the point discussed is about the better data processing for training framework rather than the infra provisioning done by pipelines. Both are complimentary according to me

I synced my thoughts with @johnugeorge offline. So, I agreed to support this feature on the training-operator side by expanding our SDK.

andreyvelich · 2023-10-11T16:41:30Z

Just to add to my point about "Streamline access to training data". I think, we need to discuss various capabilities to access data on Training Workers from Data Preparation step (e.g. using Spark). Sometimes PVC might not be enough, since it should support ReadWriteMany Access Mode to read it in multiple MicroServices (e.g. Workers).
For example, we can investigate how PyArrow can help us in Kubernetes to get data from Spark DataFrames.
Also, some additional resources can be found in this talk: https://pt.slideshare.net/databricks/simplify-data-conversion-from-spark-to-tensorflow-and-pytorch

github-actions · 2024-01-09T20:01:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y · 2024-01-09T20:47:45Z

/remove-lifecycle stale

tenzen-y · 2024-01-09T20:48:02Z

/remove-help

tenzen-y · 2024-01-09T20:48:33Z

/assign @johnugeorge @deepanker13

google-oss-prow · 2024-01-09T20:48:36Z

@tenzen-y: GitHub didn't allow me to assign the following users: deepanker13.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @johnugeorge @deepanker13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

github-actions · 2024-04-09T00:05:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y · 2024-04-09T07:32:13Z

/remove-lifecycle stale

andreyvelich added help wanted kind/discussion labels Sep 28, 2023

andreyvelich mentioned this issue Nov 29, 2023

Train/Fine-tune API Proposal for LLMs #1945

Merged

github-actions bot added the lifecycle/stale label Jan 9, 2024

google-oss-prow bot removed the lifecycle/stale label Jan 9, 2024

google-oss-prow bot removed the help wanted label Jan 9, 2024

google-oss-prow bot assigned johnugeorge Jan 9, 2024

andreyvelich mentioned this issue Jan 23, 2024

WG Data(name provisional) proposal kubeflow/community#673

Draft

github-actions bot added the lifecycle/stale label Apr 9, 2024

google-oss-prow bot removed the lifecycle/stale label Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning in Kubeflow Training Operator #1923

Fine-tuning in Kubeflow Training Operator #1923

andreyvelich commented Sep 28, 2023

johnugeorge commented Oct 9, 2023

tenzen-y commented Oct 10, 2023

johnugeorge commented Oct 10, 2023 •

edited

Loading

tenzen-y commented Oct 11, 2023

andreyvelich commented Oct 11, 2023 •

edited

Loading

github-actions bot commented Jan 9, 2024

tenzen-y commented Jan 9, 2024

tenzen-y commented Jan 9, 2024

tenzen-y commented Jan 9, 2024

google-oss-prow bot commented Jan 9, 2024

github-actions bot commented Apr 9, 2024

tenzen-y commented Apr 9, 2024

Fine-tuning in Kubeflow Training Operator #1923

Fine-tuning in Kubeflow Training Operator #1923

Comments

andreyvelich commented Sep 28, 2023

Streamline access to training data

Optimize model download

Quick access to Foundation models

Avoid Overfitting

Using MPI/All-reduce style of distributed training

Feedback from Training Operator users

johnugeorge commented Oct 9, 2023

tenzen-y commented Oct 10, 2023

johnugeorge commented Oct 10, 2023 • edited Loading

tenzen-y commented Oct 11, 2023

andreyvelich commented Oct 11, 2023 • edited Loading

github-actions bot commented Jan 9, 2024

tenzen-y commented Jan 9, 2024

tenzen-y commented Jan 9, 2024

tenzen-y commented Jan 9, 2024

google-oss-prow bot commented Jan 9, 2024

github-actions bot commented Apr 9, 2024

tenzen-y commented Apr 9, 2024

johnugeorge commented Oct 10, 2023 •

edited

Loading

andreyvelich commented Oct 11, 2023 •

edited

Loading