Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add changes to support FSDP #598

Merged
merged 20 commits into from
Jan 23, 2024
Merged

Add changes to support FSDP #598

merged 20 commits into from
Jan 23, 2024

Conversation

vivekgoe
Copy link
Collaborator

What does this PR do?

Add changes to support FSDP. BERT-Base enabled with FSDP as a toy example.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@vivekgoe vivekgoe marked this pull request as ready for review January 3, 2024 09:36
@vivekgoe vivekgoe requested a review from regisss as a code owner January 3, 2024 09:36
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. For the Transformers-related changes, it seems you based them on a newer version (main?). Let's just stick to v4.34.1 for now as newer releases are not supported 🙂

Can you also format the code as follows?

pip install --upgrade ruff
make style

examples/question-answering/gaudi_config.json Outdated Show resolved Hide resolved
optimum/habana/accelerate/utils/dataclasses.py Outdated Show resolved Hide resolved
optimum/habana/accelerate/accelerator.py Outdated Show resolved Hide resolved
optimum/habana/accelerate/accelerator.py Outdated Show resolved Hide resolved
optimum/habana/peft/layer.py Show resolved Hide resolved
optimum/habana/peft/layer.py Outdated Show resolved Hide resolved
optimum/habana/transformers/trainer.py Outdated Show resolved Hide resolved
tests/test_fsdp_examples.py Outdated Show resolved Hide resolved
@@ -46,7 +47,7 @@

from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments
from optimum.habana.utils import set_seed

from optimum.habana.peft.layer import GaudiLoraLayerLinearforward
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do that everytime we use FSDP and LoRA? Or is it necessary for this example only?
If it's always necessary, it's better to do that in optimum.habana.transformers.modeling_utils.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do this every time we use LoRA with torch_compile enabled (see detailed explanation I added in response to one of the other comments from you).
I can move it to optimum.habana.transformers.modeling_utils.py, its ok to do it unconditionally, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay yes. Maybe LoRA inference in the text-generation example could be impacted, I'll check that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisss My understanding was that LoRA is used only for finetuning. But please check and resolve this conversation when you get a chance. Thanks.

@vivekgoe vivekgoe requested a review from libinta January 5, 2024 09:39
@vivekgoe vivekgoe added synapse1.14 run-test Run CI for PRs from external contributors labels Jan 8, 2024
@vivekgoe vivekgoe added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 12, 2024
@vivekgoe vivekgoe added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 22, 2024
@vivekgoe vivekgoe added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 23, 2024
@vivekgoe
Copy link
Collaborator Author

@regisss any suggestions on where in code should I add a warning that FSDP is an experimental feature not ready for use yet? How can I add test_fsdp_examples.py to list of tests you run as part of your regular testing?

@regisss regisss added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 23, 2024
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging now since @libinta told me that 1.14 is getting released.

I'll add the warning and the test in another PR.

@regisss regisss merged commit e238bca into huggingface:main Jan 23, 2024
11 of 12 checks passed
@regisss
Copy link
Collaborator

regisss commented Jan 24, 2024

@vivekgoe When running the FSDP test on Gaudi2 with 1.13, I get errors like

ValueError: Inconsistent compute device and `device_id` on rank 4: hpu:0 vs hpu

And on Gaudi1 with 1.14 I get

RuntimeError: offset_ 0is not matching with the offset of new tensor 13835057943613014016

I launched the test with

ytest tests/test_fsdp_examples.py -v -s

Is there something I'm doing wrong?

jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors synapse1.14
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants