Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[trainer] add tf32-mode control #14606

Merged
merged 11 commits into from
Dec 3, 2021
Merged

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Dec 3, 2021

This PR adds tr32-mode control support for HF Trainer for Ampere cards. RFC: #14450

pytorch had this mode on by default since pt-1.7, but are discussing to turn it off in the coming new release. pytorch/pytorch#67384

Here is the proposed logic:

  1. by default HF Trainer will set it to Enabled, This is marked as experimental should we discover that this is not a safe default down the road.
  2. and --tf32 0 will disable it.
  3. If the setup uses a wrong gpu or too low torch version - it will silently do nothing as it's irrelevant.

The PR adds:

  • is_torch_tf32_available and require_torch_tf32 helper utils
  • adds basic test

Fixes: #14450

@sgugger, @LysandreJik

@stas00 stas00 changed the title [trainer] add --tf32 support [trainer] add tf32-mode control Dec 3, 2021
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As usual, let's operate on a strict no-breaking change rule. I understand PyTorch activates this feature by default when available (at least in some versions), so I would leave the default for --tf32 to None and let PyTorch decide when tf32=None.

Then obviously True and False will force activate/deactivate the feature.

@@ -548,6 +552,12 @@ class TrainingArguments:
default=False,
metadata={"help": "Whether to use full float16 evaluation instead of 32-bit"},
)
tf32: bool = field(
default=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default should be whatever PyTorch has by default, so None here and the user can set it to True or False to force/unforce it.

I understand it's True for versions >= 1.7 and < 1.10 but False after?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good idea! let the user decide!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True for versions >= 1.7 and < 1.11 and probably False after - the nightly is still True as of today.

@@ -802,6 +812,9 @@ def __post_init__(self):
"Mixed precision training with AMP or APEX (`--fp16` or `--bf16`) and half precision evaluation (`--fp16_full_eval` or `--bf16_full_eval`) can only be used on CUDA devices."
)

if is_torch_available() and is_torch_tf32_available():
torch.backends.cuda.matmul.allow_tf32 = True if self.tf32 else False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here we should only change that boolean if the value set was not None. If the value is True, there should be an error if is_torch_tf_32_available() is False so the user is not surprised if they don't get what they want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this great feedback, Sylvain. Please have another look.

@@ -492,6 +493,15 @@ def test_mixed_bf16(self):

# will add more specific tests once there are some bugs to fix

@require_torch_gpu
@require_torch_tf32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, do we have a setup that has the right CUDA version an GPU capabilities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rtx-3090 if that's what you ask.

Running benchmarks now - will post those shortly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering for our testing machines on the automatic CI :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one day we will have those newer gpus.

@stas00
Copy link
Contributor Author

stas00 commented Dec 3, 2021

The benchmarks are terrible: #14608
going to ask for advice from the pytorch experts

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the update, just left some styling nits but it's great!

docs/source/performance.md Outdated Show resolved Hide resolved
src/transformers/file_utils.py Outdated Show resolved Hide resolved
src/transformers/training_args.py Outdated Show resolved Hide resolved
src/transformers/training_args.py Outdated Show resolved Hide resolved
@@ -492,6 +493,15 @@ def test_mixed_bf16(self):

# will add more specific tests once there are some bugs to fix

@require_torch_gpu
@require_torch_tf32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering for our testing machines on the automatic CI :-)

stas00 and others added 2 commits December 3, 2021 09:12
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@stas00 stas00 merged commit 71b1bf7 into huggingface:master Dec 3, 2021
@stas00 stas00 deleted the trainer-tf32 branch December 3, 2021 18:09
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
* [trainer] add --tf32 support

* it's pt>=.17

* it's pt>=.17

* flip the default to True

* add experimental note

* simplify logic

* style

* switch to 3-state logic

* doc

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* re-style code

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Amphere/tf32 defaults for transformers
2 participants