torch-native pipeline parallelism for big models #2345

muellerzr · 2024-01-16T19:34:56Z

Example use:

import torch
from accelerate.inference import prepare_pippy
from accelerate.utils import set_seed
from transformers import T5ForConditionalGeneration, T5Config

set_seed(42)

config = T5Config()
model = T5ForConditionalGeneration(config)
model.eval()

# Create example inputs for the model
input = torch.randint(
    low=0,
    high=config.vocab_size,
    size=(2, 1024),  # bs x seq_len
    device="cpu",
    dtype=torch.int64,
    requires_grad=False,
)

example_inputs = {"input_ids": input, "decoder_input_ids": input}

model = prepare_pippy(model, example_kwargs=example_inputs)

args = (
    example_inputs["input_ids"].to("cuda:0"),
    example_inputs["decoder_input_ids"].to("cuda:0")
)
with torch.no_grad():
    output = model(*args)

Speed up:

Using 2x4090's in full precision

Bert

	Accelerate/Sequential	PiPPy + Accelerate
First batch	0.2137s	0.3119s
Average of 5 batches	0.0099s	0.0062s

GPT2

	Accelerate/Sequential	PiPPy + Accelerate
First batch	0.1959s	0.4189s
Average of 5 batches	0.0205s	0.0126s

T5

	Accelerate/Sequential	PiPPy + Accelerate
First batch	0.2789s	0.3809s
Average of 5 batches	0.0198s	0.0166s

HuggingFaceDocBuilderDev · 2024-01-16T19:38:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Very cool API ! I like the design and how easily it is to use. I left a few comments around the split_points mainly.

SunMarc · 2024-01-16T21:13:28Z

src/accelerate/inference.py

+    # To act like a decorator so that it can be popped when doing `extract_model_from_parallel`
+    forward.__wrapped__ = model_forward


SunMarc · 2024-01-16T21:33:26Z

src/accelerate/inference.py

+    no_split_module_classes = getattr(model, "_no_split_modules", [])
+    if num_processes == 1:
+        return infer_auto_device_map(model, no_split_module_classes=no_split_module_classes, clean_result=False)
+    model_size, shared = calculate_maximum_sizes(model)
+
+    # Split into `n` chunks for each GPU
+    memory = (model_size + shared[0]) / num_processes
+    memory = convert_bytes(memory)
+    value, ending = memory.split(" ")
+
+    # Add a chunk to deal with potential extra shared memory instances
+    memory = math.ceil(float(value)) * 1.1
+    memory = f"{memory} {ending}"
+    device_map = infer_auto_device_map(
+        model,
+        max_memory={i: memory for i in range(num_processes)},
+        no_split_module_classes=no_split_module_classes,
+        clean_result=False,
+    )


We can definitely generate a balanced device_map for pippy exclusively "device_map = "balanced_pippy" if the current balanced option is not the best for that. However, I think it would be great if the user can use other options like "sequential". I didn't try but what happens when we only fill 2 gpus out of the 4 available (possible sequential case) ?

src/accelerate/inference.py

SunMarc · 2024-01-16T22:05:29Z

src/accelerate/inference.py

+    if device_map == "auto":
+        device_map = generate_device_map(model, PartialState().num_processes)
+    stage = build_pipeline(model, device_map, example_args, example_kwargs)


Just a thought about how to handle the split points.

We only expose device_map with predefined options ("sequential", "balanced_pippy")

We let the user use a custom device_map. For the custom case, it can be complicated since the user needs to be careful about the order (OrderedDict()) and he needs to attribute the gpu in a sequential manner because of that split_points.append(next(k for k, v in device_map.items() if v == i)). So that can be quite complicated.

We let the user let his own split points List[str].
I think that 1) is a must. between 2) and 3), I prefer 3) since it is easier for the user.

Agreed to do 1 and 3

src/accelerate/inference.py

SunMarc

The API is in good shape ! Let's document the main functions a bit and we can merge it. I left a few comments but nothing blocking.

src/accelerate/inference.py

SunMarc · 2024-01-24T21:06:55Z

src/accelerate/inference.py

+    if split_points == "auto":
+        device_map = generate_device_map(model, state.num_processes, no_split_module_classes=no_split_module_classes)
+        split_points = []
+        for i in range(1, state.num_processes):
+            split_points.append(next(k for k, v in device_map.items() if v == i))


it would be great to have a sanity check, to make sure that we indeed have self.num_processes split points when we are generating the split_points + when the user manually pass them

src/accelerate/inference.py

* Allow for dynamic batch paddign * Fix test * Update src/accelerate/inference.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Break early after the first valid bs is found * Less slicy-dicy * Test cv model * Start, need to test * Use dataloader-like logic * Refactor to utils * With tests * Update the source * Clean * bs=1 case * Add test * add some failing test * Almost working version * Much cleaner implementation * Use pad_input_tensor * All tests passing! * Do it at tracing too --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Marc Sun <marc@huggingface.co>

kwen2501

Thanks a lot for the integration effort!
LGTM!

SunMarc

Thx for iterating ! LGTM

src/accelerate/inference.py

tests/test_utils.py

src/accelerate/inference.py

kwen2501

Thanks for writing the doc so quick! Looks good to me!

docs/source/usage_guides/distributed_inference.md

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

src/accelerate/inference.py

muellerzr · 2024-02-06T14:31:29Z

cc @MKhalusova for the docs!

MKhalusova

Nice work! I left a few comments to polish things in the docs a bit.

docs/source/package_reference/inference.md

docs/source/usage_guides/distributed_inference.md

Co-authored-by: Maria Khalusova <kafooster@gmail.com>

muellerzr · 2024-02-06T17:36:22Z

Final comment before merging, things that still need to be done in a latter PR at some point (but okay not being in the first iteration of this joint effort):

Specify balanced_pippy device map and allow a sequential device_map when making the pipeline via prepare_pippy
Look into supporting model.generate() through an alternative hook into the model forward if possible
Make sure all outputs end up on the CPU so users don't need to check at the end and we can call them via a .gather
Migrate the pippy-device-map-playground examples over to here as part of our examples folder

(I'll be doing 3& 4 this week as a follow-up prior to release)

muellerzr added 3 commits January 16, 2024 13:41

Broken version

e713e28

Timing I would expect

2767bb1

Working version!

06f04a9

muellerzr marked this pull request as draft January 16, 2024 19:35

SunMarc reviewed Jan 16, 2024

View reviewed changes

src/accelerate/inference.py Outdated Show resolved Hide resolved

Use MethodType

9eef9dd

muellerzr force-pushed the pippy-integration-v2 branch from 66fb611 to 9eef9dd Compare January 17, 2024 15:57

muellerzr added 5 commits January 17, 2024 11:07

working test

449eb8d

Tests

77f8e92

Use no split module classes explicitly

df7779a

Put split_points in pipelien

e3f6b99

Store split points in hf_split_points

8792a8c

kwen2501 mentioned this pull request Jan 17, 2024

Re-support kwargs at run time pytorch/PiPPy#928

Closed

fix case num_process=1

7ca4bcc

muellerzr commented Jan 18, 2024

View reviewed changes

src/accelerate/inference.py Show resolved Hide resolved

SunMarc approved these changes Jan 24, 2024

View reviewed changes

SunMarc reviewed Jan 24, 2024

View reviewed changes

src/accelerate/inference.py Outdated Show resolved Hide resolved

muellerzr and others added 9 commits January 25, 2024 14:56

Rm literal

364c3b6

Allow users to pass in max_memory

6a8479b

Note about recursion

303c9cc

Document, document, document

d497e8a

Right import check

06bbc5b

Merge branch 'main' into pippy-integration-v2

5e047da

Fix bug, add tests to multigpu runners

a5059e6

Change default to None

71346a1

kwen2501 approved these changes Jan 31, 2024

View reviewed changes

SunMarc approved these changes Jan 31, 2024

View reviewed changes

src/accelerate/inference.py Outdated Show resolved Hide resolved

src/accelerate/inference.py Outdated Show resolved Hide resolved

tests/test_utils.py Outdated Show resolved Hide resolved

src/accelerate/inference.py Outdated Show resolved Hide resolved

muellerzr added 7 commits February 5, 2024 12:42

Try again?

d2af472

Try again x2

8dc6c6c

Trailing comma

4d0aeb2

Move import

309b71a

Clean

9f561f1

typehint

d5a6fda

typo

954a668

kwen2501 approved these changes Feb 5, 2024

View reviewed changes

docs/source/usage_guides/distributed_inference.md Outdated Show resolved Hide resolved

docs/source/usage_guides/distributed_inference.md Outdated Show resolved Hide resolved

docs/source/usage_guides/distributed_inference.md Outdated Show resolved Hide resolved

muellerzr and others added 3 commits February 5, 2024 15:47

From code review

853f552

Use num_chunks

1362e5c

Update tests/test_utils.py

68bd89b

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

muellerzr marked this pull request as ready for review February 5, 2024 20:53

kwen2501 reviewed Feb 5, 2024

View reviewed changes

src/accelerate/inference.py Outdated Show resolved Hide resolved

muellerzr mentioned this pull request Feb 5, 2024

int8 quantization doesn't work with accelerate on multi-GPUs #875

Closed

4 tasks

Bad copy/paste

181fbda

muellerzr changed the title ~~Pippy integration v2~~ torch-native pipeline parallelism for big models Feb 6, 2024

muellerzr requested a review from MKhalusova February 6, 2024 14:31

hf_split_points

9157cf1

MKhalusova reviewed Feb 6, 2024

View reviewed changes

muellerzr and others added 6 commits February 6, 2024 11:14

Apply suggestions from code review

f2c6e08

Co-authored-by: Maria Khalusova <kafooster@gmail.com>

Year

9f20496

Nit

e1961d6

better title

8c72a5e

Rephrase

3eaa967

Rephrase

31fcde4

Try spacing maybe?

7c3d183

muellerzr merged commit 0867c09 into main Feb 6, 2024
25 checks passed

muellerzr deleted the pippy-integration-v2 branch February 6, 2024 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch-native pipeline parallelism for big models #2345

torch-native pipeline parallelism for big models #2345

muellerzr commented Jan 16, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 16, 2024

SunMarc left a comment

SunMarc Jan 16, 2024

SunMarc Jan 16, 2024

SunMarc Jan 16, 2024

muellerzr Jan 17, 2024

SunMarc left a comment •

edited

Loading

SunMarc Jan 24, 2024

kwen2501 left a comment

SunMarc left a comment

kwen2501 left a comment

muellerzr commented Feb 6, 2024

MKhalusova left a comment

muellerzr commented Feb 6, 2024

		# To act like a decorator so that it can be popped when doing `extract_model_from_parallel`
		forward.__wrapped__ = model_forward

torch-native pipeline parallelism for big models #2345

torch-native pipeline parallelism for big models #2345

Conversation

muellerzr commented Jan 16, 2024 • edited Loading

Example use:

Speed up:

Bert

GPT2

T5

HuggingFaceDocBuilderDev commented Jan 16, 2024

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc Jan 16, 2024

Choose a reason for hiding this comment

SunMarc Jan 16, 2024

Choose a reason for hiding this comment

SunMarc Jan 16, 2024

Choose a reason for hiding this comment

muellerzr Jan 17, 2024

Choose a reason for hiding this comment

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

SunMarc Jan 24, 2024

Choose a reason for hiding this comment

kwen2501 left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

kwen2501 left a comment

Choose a reason for hiding this comment

muellerzr commented Feb 6, 2024

MKhalusova left a comment

Choose a reason for hiding this comment

muellerzr commented Feb 6, 2024

muellerzr commented Jan 16, 2024 •

edited

Loading

SunMarc left a comment •

edited

Loading