Revamp TPU internals to be more efficient #441

muellerzr · 2022-06-11T15:48:54Z

Revamp TPU internals

What does this add?

Make prepare_model use MpModelWrapper to distribute the model across all devices efficiently when used
Changes prepare_dataloader to create an MpDeviceLoader allowing for the dataloaders to be more efficient.
Changes DataLoaderShard to no longer do xm.mark_step, MpDeviceLoader will handle this for us. Instead if on TPU we set the device as None in prepare_dataloader letting MpDeviceLoader take over with the device when needed.
Allow for FP16 and BF16 precision types on the TPU via AcceleratorState since they are now supported

Who is it for?

Users of TPUs

Why is it needed?

We currently have a number of "bad practices" in TPU handling due to improvements in the xla API. Here are some benchmarks I ran on a high-memory colab TPU instance with the nlp example script:

Baseline:

Post warmup: 56 seconds

W/ MpModelWrapper and MpDeviceLoader:

Post warmup: 48 seconds

W/ previous and default_tensor_type change:

Post warmup: 34 to 36 seconds

Roughly a 40% boost to speed once this is all done. I also saw some speed increases on the initial launch as well anecdotally, but I did not time them.

I also saw a 2x speedup when using the new DataLoader class vs the old one

About `default_tensor_type`:

Though we do set the device to bf16 which helps with automatically converting the tensors to the right types, if we add torch.set_default_tensor_type('torch.FloatTensor') there is a considerable speedup when it comes to training TPUs.

I'm a bit unsure as to where it would be best to put this, as either something hidden when we initialize the Accelerator or as a util that should get called when you are training on TPUs, open to ideas

Anticipated maintenance burden? (What will happen in say, 3 months if something changes)

This is pretty stable and considered as "good practices" when running on TPUs w/ XLA, so it's unlikely these will change. However after this PR a subsequent PR will be opened to change the nlp_example and cv_example notebook, as another best practice is to declare the model outside of xm.spawn. This includes the internals to make the model as memory efficient as we can, so that will be the last stage needed.

HuggingFaceDocBuilderDev · 2022-06-11T15:51:58Z

The documentation is not available anymore as the PR was closed or merged.

tmabraham · 2022-06-13T10:46:00Z

Thanks for fixing up the TPU support with best practices! Looks great to me! 🔥

sgugger

Thanks for your PR and running the benchmarks in Colab. Could you link to the documentation that shows those are best practices recommended by the torch XLA team? I haven't seen anything personally.

I'd like to make sure those changes are not speeding up the experience in Colab at the cost of the experience on TPU machines, so I'd like the same benchmarks to be run on a TPU VM before fully approving :-)

src/accelerate/accelerator.py

src/accelerate/data_loader.py

muellerzr · 2022-06-13T16:32:42Z

@sgugger this PR also modifies the TPU selection in AcceleratorState to actually use mixed precision types.

muellerzr · 2022-06-14T20:44:50Z

We're passing 😄

sgugger

Thanks for revamping this!

src/accelerate/state.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

muellerzr added 11 commits June 11, 2022 08:54

Wrap TPU

b386b07

prepare_model

5b173eb

Clean

2c75b73

import

47c7d4f

TPU specific

95279fb

Fix device placement

a679a69

Rm dl

8c4ef72

rm model

f620e4e

rm mark step

a8fc0b6

Device

bd750e1

Style + Performance Fixes for TPU

23eb748

muellerzr added enhancement New feature or request TPU Bug or feature on TPU platforms labels Jun 11, 2022

muellerzr requested a review from sgugger June 11, 2022 15:48

muellerzr added 2 commits June 11, 2022 12:28

Print

f279923

Style

2243706

sgugger reviewed Jun 13, 2022

View reviewed changes

src/accelerate/accelerator.py Show resolved Hide resolved

src/accelerate/data_loader.py Show resolved Hide resolved

muellerzr added 3 commits June 13, 2022 12:08

Revamp

5ae800e

Num processes

904da47

Flip num_processes logic

534a7f6

muellerzr requested a review from sgugger June 13, 2022 16:13

Mixed precision

d56538d

muellerzr mentioned this pull request Jun 13, 2022

notebook_launcher fails in colab pytorch/xla#3136

Open

Put on device

de628ff

muellerzr marked this pull request as draft June 13, 2022 20:09

muellerzr added 2 commits June 13, 2022 16:28

New script

1abca5f

Put on device

586bcb6

muellerzr added 20 commits June 13, 2022 17:11

Always on

34a1df5

Typo

5aa7f63

35x speedup

b966505

Change if

bc48fb9

mp_fn

8df8eb9

Double to?

a35a50f

native_amp

ac1521a

Rm model wrapper for now

b3d1a08

Model isn't issue

82beba4

put_on_device

c0b0119

Try explicit device placement

f942b72

attr

0b01ed0

More complex

16e8942

Revert

ace3fb2

And

6a9473f

TPU

397596e

rm mp_fn

53b6302

xmp spawn

93141d9

mp_fn

baa799e

Quality

622b5d5

muellerzr marked this pull request as ready for review June 14, 2022 20:43

sgugger approved these changes Jun 14, 2022

View reviewed changes

use_fork in state

0fca542

muellerzr requested a review from sgugger June 14, 2022 21:11

sgugger reviewed Jun 14, 2022

View reviewed changes

src/accelerate/state.py Outdated Show resolved Hide resolved

muellerzr and others added 2 commits June 14, 2022 17:33

FORK_LAUNCHED instead of USE_LAUNCHER

d3fb678

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

FORK_LAUNCHED

9fd3130

muellerzr merged commit 29eef23 into main Jun 14, 2022

muellerzr deleted the tpu-fixes branch June 14, 2022 21:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp TPU internals to be more efficient #441

Revamp TPU internals to be more efficient #441

muellerzr commented Jun 11, 2022 •

edited

HuggingFaceDocBuilderDev commented Jun 11, 2022 •

edited

tmabraham commented Jun 13, 2022

sgugger left a comment

muellerzr commented Jun 13, 2022

muellerzr commented Jun 14, 2022

sgugger left a comment

Revamp TPU internals to be more efficient #441

Revamp TPU internals to be more efficient #441

Conversation

muellerzr commented Jun 11, 2022 • edited

Revamp TPU internals

What does this add?

Who is it for?

Why is it needed?

About default_tensor_type:

Anticipated maintenance burden? (What will happen in say, 3 months if something changes)

HuggingFaceDocBuilderDev commented Jun 11, 2022 • edited

tmabraham commented Jun 13, 2022

sgugger left a comment

Choose a reason for hiding this comment

muellerzr commented Jun 13, 2022

muellerzr commented Jun 14, 2022

sgugger left a comment

Choose a reason for hiding this comment

muellerzr commented Jun 11, 2022 •

edited

About `default_tensor_type`:

HuggingFaceDocBuilderDev commented Jun 11, 2022 •

edited