Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify LLM Trainer to support BERT and Tiny LLaMA #2031

Merged
merged 6 commits into from
Mar 15, 2024

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Mar 15, 2024

These changes should allow us to use train API with BERT and Tiny LLaMA models and we can demo our Notebook during KubeCon. I will update Fine-tune BERT Notebook in this PR soon.
List of changes:

  • We need to have strong version dependency across various components (e.g. SDK, Storage Initializer, Trainer) since we dump all Trainer settings as container argument and they should be compatible within different backend components.
  • I used save_to_disk, load_from_disk APIs to download and upload HuggingFace dataset. That will allow us to introduce split parameter to reduce number of samples before saving dataset to the disk. I understand that save_to_disk might not work with IterableDataset HuggingFace dataset, but we can discuss further what we can do with that.
  • I removed device_map, pad_token, and add_pad_token settings from AutoTokenizer. Some of these settings don't work with BERT (e.g. device_map): BertForSequenceClassification does not support 'device_map':"auto" yet huggingface/transformers#25296. For the long-term we should discuss if we need to introduce Tokenizer settings for users where they can set appropriate params.
  • Tokenizer function should be set as follows to work with BERT and Tiny LLaMA Tokenizer:
lambda x: tokenizer(x["text"], padding="max_length", truncation=True)
  • I added Data Collator only for causal language modeling. In the future, we should discuss how we should set this parameter in Trainer.
  • If PyTorchJob has 1 worker, ReadWriteOnce mode should be sufficient for PVC.
  • I removed blank spaces from example Notebook as @PeterWrighten suggested here: Add Fine-Tune BERT LLM Example #2021 (comment). With that it will be easier for us to integrate CI tests for those Notebooks.

Please take a look at these changes.
/assign @johnugeorge @deepanker13 @tenzen-y @kuizhiqing

/hold for the review

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Mar 15, 2024

Pull Request Test Coverage Report for Build 8300892008

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 42.908%

Totals Coverage Status
Change from base Build 8270129783: 0.0%
Covered Lines: 3757
Relevant Lines: 8756

💛 - Coveralls

@@ -394,6 +395,10 @@ def get_pvc_spec(
),
)

# If PyTorchJob has 1 worker, ReadWriteOnce access mode is sufficient for PVC.
if num_workers == 1:
pvc_spec.spec.access_modes = ["ReadWriteOnce"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we take this in storage config? And use it in line 391 instead of this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me change it.

if "train" in dataset:
train_data = dataset["train"]
else:
train_data = dataset

try:
eval_data = dataset["eval"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it always "dataset["eval"]"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge It depends on the dataset.
If dataset doesn't have eval data, we can use dataset.train_test_split(test_size=0.1, stratify_by_column="label"). In that case train and eval dataset will be store under train and test keys.
Should we think about various use-cases in the followup PRs @johnugeorge ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm
I'd like to merge this PR ASAP since this is a blocker for all PRs.

@@ -39,6 +39,8 @@ def load_config(self, serialised_args):
self.config = S3DatasetParams(**json.loads(serialised_args))

def download_dataset(self):
import boto3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this import on the top?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I did this on purpose so Training Operator SDK won't be dependant on boto3 while importing S3 storage init: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L125

)

# TODO (andreyvelich): Currently, data collator is supported only for casual LM Transformer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO (andreyvelich): Currently, data collator is supported only for casual LM Transformer.
# TODO (andreyvelich): Currently, data collector is supported only for casual LM Transformer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is TODO? I guess that you'd like to support data collector other than casual LM Transformer, right?
If so, could we open an issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it calls Data Collator in HuggingFace: https://huggingface.co/docs/transformers/en/main_classes/data_collator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to investigate if we want to apply Data Collator for other transformers. I will create an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks.

Fix access modes in storage config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@johnugeorge
Copy link
Member

/lgtm
/hold for @tenzen-y

@google-oss-prow google-oss-prow bot added the lgtm label Mar 15, 2024
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@@ -77,11 +94,19 @@ def load_config(self, serialised_args):
self.config = HfDatasetParams(**json.loads(serialised_args))

def download_dataset(self):
print("downloading dataset")
logger.info("Downloading dataset")
logger.info("-" * 40)
import huggingface_hub
from datasets import load_dataset

if self.config.access_token:
huggingface_hub.login(self.config.access_token)

load_dataset(self.config.repo_id, cache_dir=VOLUME_PATH_DATASET)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich why are we downloading the dataset again

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great catch @deepanker13!
We should remove it.

@google-oss-prow google-oss-prow bot removed the lgtm label Mar 15, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

/hold cancel

@andreyvelich
Copy link
Member Author

/assign @deepanker13 @johnugeorge @tenzen-y

@johnugeorge
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Mar 15, 2024
@google-oss-prow google-oss-prow bot merged commit bb8bba0 into kubeflow:master Mar 15, 2024
37 checks passed
@andreyvelich andreyvelich deleted the distributed-data-train branch March 15, 2024 19:29
tedhtchang pushed a commit to tedhtchang/training-operator that referenced this pull request Apr 5, 2024
* Modify LLM Trainer to support BERT and Tiny LLaMA

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Access PVC access modes to storage config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Format Python files

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Distribute datasets

Fix access modes in storage config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update example to fine tune BERT with Train API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove dataset download twice

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
deepanker13 pushed a commit to deepanker13/deepanker-training-operator that referenced this pull request Apr 8, 2024
* Modify LLM Trainer to support BERT and Tiny LLaMA

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Access PVC access modes to storage config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Format Python files

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Distribute datasets

Fix access modes in storage config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update example to fine tune BERT with Train API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove dataset download twice

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* Modify LLM Trainer to support BERT and Tiny LLaMA

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Access PVC access modes to storage config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Format Python files

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Distribute datasets

Fix access modes in storage config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update example to fine tune BERT with Train API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove dataset download twice

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* Modify LLM Trainer to support BERT and Tiny LLaMA

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Access PVC access modes to storage config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Format Python files

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Distribute datasets

Fix access modes in storage config

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update example to fine tune BERT with Train API

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Remove dataset download twice

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants