Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning in Kubeflow Training Operator #1923

Open
andreyvelich opened this issue Sep 28, 2023 · 12 comments
Open

Fine-tuning in Kubeflow Training Operator #1923

andreyvelich opened this issue Sep 28, 2023 · 12 comments
Assignees

Comments

@andreyvelich
Copy link
Member

Today, in the world of large models, usually, Data Scientists don't train their models from scratch but use the existing Foundation models and fine-tune them.
In the last Training WG Community call, we discussed how Training Operator can be used to efficient fine-tune large models on Kubernetes.

There are several challenges that we can address in Training Operator to improve it.

Streamline access to training data

Usually, training of large model requires many GPUs and Workers. Which means every Worker should have an access to training data before it starts. If data is large, it requires significant amount of CPU resources to download the data and convert it to PyTorch DataLoader as an example.
We can discuss about improvements of the data transfer from data pre-processing step (e.g. using Spark) to training step (e.g. using PyTorch, Tensorflow, etc.).

Optimize model download

Before training starts, we need to download the model on every worker. Maybe we could think how to reduce cost and resources for such operation.

Quick access to Foundation models

We can build abstractions on top of HuggingFace Transformers APIs to give users quick access to fine-tune foundation models on Kubernetes using Training Operator SDK. For example:

TrainingClient().fine_tune(
  model="LLama2",
  dataset="s3://...",
)

And SDK will generate appropriate script to the Job's container arguments by using HuggingFace APIs.

Avoid Overfitting

Sometime model could be overtrained which means model accuracy will decrease and model could forget some features. Especially, that is important when you want to deploy model which produces best results. We can address such issue by using some EarlyStopping techniques, similar what we do in Katib.

Using MPI/All-reduce style of distributed training

We need to benchmark whether all-reduce style of distributed training produces better results to train large models. Then, MPI Operator could be a good candidate to investigate.
In addition to that, we can explore other distributed techniques that improve training performance.

Feedback from Training Operator users

We want to hear feedback from Training Operator users and what features they would like to see to train their large models.
Please provide your ideas, suggestions, and feature-request for that topic.

cc @kubeflow/wg-training-leads @tenzen-y @kuizhiqing

@johnugeorge
Copy link
Member

I will create a proposal in couple of weeks regarding the new API to be supported.

@tenzen-y
Copy link
Member

@johnugeorge @andreyvelich I'm not sure why we need to support this feature. I think we can realize the existing features using Kubeflow Pipelines. Maybe we can construct the following pipeline:

  1. Download Model to PVC.
  2. Do any pre-processing to the downloaded mode.
  3. Start fine-tuning using the training-operator with PVC.

So, I think the role of the training-operator would conflict with pipelines.
What is the difference between using pipelines and this new training-operator feature?

@johnugeorge
Copy link
Member

johnugeorge commented Oct 10, 2023

@tenzen-y the point discussed is about the better data processing for training framework rather than the infra provisioning done by pipelines. Both are complimentary according to me

@tenzen-y
Copy link
Member

@tenzen-y the point discussed is about the better data processing for training framework rather than the infra provisioning done by pipelines. Both are complimentary according to me

I synced my thoughts with @johnugeorge offline. So, I agreed to support this feature on the training-operator side by expanding our SDK.

@andreyvelich
Copy link
Member Author

andreyvelich commented Oct 11, 2023

Just to add to my point about "Streamline access to training data". I think, we need to discuss various capabilities to access data on Training Workers from Data Preparation step (e.g. using Spark). Sometimes PVC might not be enough, since it should support ReadWriteMany Access Mode to read it in multiple MicroServices (e.g. Workers).
For example, we can investigate how PyArrow can help us in Kubernetes to get data from Spark DataFrames.
Also, some additional resources can be found in this talk: https://pt.slideshare.net/databricks/simplify-data-conversion-from-spark-to-tensorflow-and-pytorch

Copy link

github-actions bot commented Jan 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

tenzen-y commented Jan 9, 2024

/remove-lifecycle stale

@tenzen-y
Copy link
Member

tenzen-y commented Jan 9, 2024

/remove-help

@tenzen-y
Copy link
Member

tenzen-y commented Jan 9, 2024

/assign @johnugeorge @deepanker13

Copy link

@tenzen-y: GitHub didn't allow me to assign the following users: deepanker13.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @johnugeorge @deepanker13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

github-actions bot commented Apr 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

tenzen-y commented Apr 9, 2024

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants