Fp8 integration #1086

sgugger · 2023-02-15T20:05:46Z

This PR brings FP8 mixed precision training to Accelerate. Using this requires a GPU Hopper or higher (hard to find at the moment!) and the transformers_engine library.

HuggingFaceDocBuilderDev · 2023-02-15T20:11:23Z

The documentation is not available anymore as the PR was closed or merged.

muellerzr

Awesome job! I left a few notes and general questions, but great work :)

muellerzr · 2023-02-15T20:15:54Z

src/accelerate/accelerator.py

-            default to the value in the environment variable `ACCELERATE_MIXED_PRECISION`, which will use the default
-            value in the accelerate config of the current system or the flag passed with the `accelerate.launch`
-            command. 'fp16' requires pytorch 1.6 or higher. 'bf16' requires pytorch 1.10 or higher.
+            Whether or not to use mixed precision training (fp16 or bfloat16). Choose from 'no','fp16','bf16 or 'fp8'.


Suggested change

Whether or not to use mixed precision training (fp16 or bfloat16). Choose from 'no','fp16','bf16 or 'fp8'.

Whether or not to use mixed precision training (fp8, fp16, or bfloat16). Choose from 'no','fp16','bf16 or 'fp8'.

Based on the earlier comment, this could be "fp16, bfloat16, or fp8", or we remove the () and just have "Choose from"

Removing the parenthesis entirely.

muellerzr · 2023-02-15T20:17:03Z

src/accelerate/accelerator.py

+                    f"The current device has compute capability of {torch.cuda.get_device_capability()} which is "
+                    "insufficient for FP8 mixed precision training (requires a GPU Hopper or higher, compute "
+                    "capability of 9 or higher). Will using FP16 instead."


Should this only warn? Or should it not explicitly raise an error?

It still uses transformer engine instead of the regular model, so useful for testing on A100s

muellerzr · 2023-02-15T20:18:51Z

src/accelerate/utils/transformer_engine.py

+    import transformer_engine.pytorch as te
+
+
+def convert_model(model, to_transformer_engine=True, _convert_linear=True, _convert_ln=True):


Should this run an explicit try/catch for is_fp8_available and raise an error if not?

It shouldn't be called if is_fp8_available is False.

examples/complete_cv_example.py

pacman100

Awesome! This is going to be a game-changer for LLM training. LGTM 🚀

src/accelerate/accelerator.py

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>

JulesGM · 2023-04-02T23:47:49Z

I just made two new issues with fp8.
#1276
#1275

sgugger and others added 30 commits February 15, 2023 14:53

Draft of FP8 support

68b98fc

Missing import

f10a4b0

Fix names

23b0e04

Conversion is inplace

12ac9bf

Enable fp8 in examples

ab878ac

Customization point for Recipe

b5578ea

Auto-enable FP8 depending on compute capability

b818aee

Fix typo

9e7157e

Put back mixed precision arg

a1a86d1

Add debug script

94d7101

Add more tests in debug

13deee8

Add more stuff to debug

097d3eb

Don't forget train

23d188f

Put the train in the right place

4656b23

Add options for selective conversion

17d0d38

Fix typo

aa3c0c2

Properly recurse

8c69cd4

Add more debug utils

4df14d0

Typo and init

978ed82

Last choice

a2efc93

More fixes

583c5c5

More options in example

cb2b2f1

Remove debug scripts

7deee02

Clean up debug and new names

e3e2ee9

Add torch.no_grad for conversion

659e1dc

Optimizer is deconnected from model?

9c715b0

Re-attach model parameters to optimizer

b3f997f

Fix extract

2787834

Style

e2070df

Cleanup post-rebase

6f21320

Deal with apdding

063d33e

sgugger requested review from muellerzr and pacman100 February 15, 2023 20:05

muellerzr approved these changes Feb 15, 2023

View reviewed changes

fix examples

9f6b40f

pacman100 approved these changes Feb 16, 2023

View reviewed changes

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

sgugger and others added 2 commits March 2, 2023 10:58

Update src/accelerate/accelerator.py

73e06ac

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>

Address comments

10f879a

sgugger merged commit 1bfde6b into main Mar 7, 2023

sgugger deleted the fp8_integration branch March 7, 2023 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fp8 integration #1086

Fp8 integration #1086

sgugger commented Feb 15, 2023

HuggingFaceDocBuilderDev commented Feb 15, 2023 •

edited

muellerzr left a comment

muellerzr Feb 15, 2023

muellerzr Feb 15, 2023

sgugger Mar 2, 2023

muellerzr Feb 15, 2023

sgugger Mar 2, 2023

muellerzr Feb 15, 2023

sgugger Mar 2, 2023

pacman100 left a comment

JulesGM commented Apr 2, 2023

	Whether or not to use mixed precision training (fp16 or bfloat16). Choose from 'no','fp16','bf16 or 'fp8'.
	Whether or not to use mixed precision training (fp8, fp16, or bfloat16). Choose from 'no','fp16','bf16 or 'fp8'.

		import transformer_engine.pytorch as te


		def convert_model(model, to_transformer_engine=True, _convert_linear=True, _convert_ln=True):

Fp8 integration #1086

Fp8 integration #1086

Conversation

sgugger commented Feb 15, 2023

HuggingFaceDocBuilderDev commented Feb 15, 2023 • edited

muellerzr left a comment

Choose a reason for hiding this comment

muellerzr Feb 15, 2023

Choose a reason for hiding this comment

muellerzr Feb 15, 2023

Choose a reason for hiding this comment

sgugger Mar 2, 2023

Choose a reason for hiding this comment

muellerzr Feb 15, 2023

Choose a reason for hiding this comment

sgugger Mar 2, 2023

Choose a reason for hiding this comment

muellerzr Feb 15, 2023

Choose a reason for hiding this comment

sgugger Mar 2, 2023

Choose a reason for hiding this comment

pacman100 left a comment

Choose a reason for hiding this comment

JulesGM commented Apr 2, 2023

HuggingFaceDocBuilderDev commented Feb 15, 2023 •

edited