-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tuning in Kubeflow Training Operator #1923
Comments
I will create a proposal in couple of weeks regarding the new API to be supported. |
@johnugeorge @andreyvelich I'm not sure why we need to support this feature. I think we can realize the existing features using Kubeflow Pipelines. Maybe we can construct the following pipeline:
So, I think the role of the training-operator would conflict with pipelines. |
@tenzen-y the point discussed is about the better data processing for training framework rather than the infra provisioning done by pipelines. Both are complimentary according to me |
I synced my thoughts with @johnugeorge offline. So, I agreed to support this feature on the training-operator side by expanding our SDK. |
Just to add to my point about "Streamline access to training data". I think, we need to discuss various capabilities to access data on Training Workers from Data Preparation step (e.g. using Spark). Sometimes PVC might not be enough, since it should support |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove-lifecycle stale |
/remove-help |
/assign @johnugeorge @deepanker13 |
@tenzen-y: GitHub didn't allow me to assign the following users: deepanker13. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove-lifecycle stale |
Today, in the world of large models, usually, Data Scientists don't train their models from scratch but use the existing Foundation models and fine-tune them.
In the last Training WG Community call, we discussed how Training Operator can be used to efficient fine-tune large models on Kubernetes.
There are several challenges that we can address in Training Operator to improve it.
Streamline access to training data
Usually, training of large model requires many GPUs and Workers. Which means every Worker should have an access to training data before it starts. If data is large, it requires significant amount of CPU resources to download the data and convert it to PyTorch DataLoader as an example.
We can discuss about improvements of the data transfer from data pre-processing step (e.g. using Spark) to training step (e.g. using PyTorch, Tensorflow, etc.).
Optimize model download
Before training starts, we need to download the model on every worker. Maybe we could think how to reduce cost and resources for such operation.
Quick access to Foundation models
We can build abstractions on top of HuggingFace Transformers APIs to give users quick access to fine-tune foundation models on Kubernetes using Training Operator SDK. For example:
And SDK will generate appropriate script to the Job's container arguments by using HuggingFace APIs.
Avoid Overfitting
Sometime model could be overtrained which means model accuracy will decrease and model could forget some features. Especially, that is important when you want to deploy model which produces best results. We can address such issue by using some EarlyStopping techniques, similar what we do in Katib.
Using MPI/All-reduce style of distributed training
We need to benchmark whether all-reduce style of distributed training produces better results to train large models. Then, MPI Operator could be a good candidate to investigate.
In addition to that, we can explore other distributed techniques that improve training performance.
Feedback from Training Operator users
We want to hear feedback from Training Operator users and what features they would like to see to train their large models.
Please provide your ideas, suggestions, and feature-request for that topic.
cc @kubeflow/wg-training-leads @tenzen-y @kuizhiqing
The text was updated successfully, but these errors were encountered: