Let's support naive Pipeline Parallelism by younesbelkada · Pull Request #210 · huggingface/trl

younesbelkada · 2023-03-10T11:10:26Z

What does this PR do?

Trying to load a model in a single device is cool, but what if we can split the model across multiple devices?
Users will just have to pass a custom device_map when loading the model, and it should work out of the box.

This PR adds the support of "Sequential Parallel" - termed as naive Pipeline Parallelism as the real Pipeline parallelism involves dealing with multi-processing and gradients synchronisation that cannot be handled easily.

This PR depends on the following PRs:

accelerate: [Accelerator] We should not call to on modules that wraps accelerate loaded models accelerate#1172
peft: [core] Fix peft multi-gpu issue peft#145

TODOs:

users should NOT apply DP and/or DeepSpeed with this approach as it remains untested
should we introduce multi-gpu tests that we optionally run?
test with 8bit models
update docs

cc @lvwerra @edbeeching

HuggingFaceDocBuilderDev · 2023-03-10T11:13:26Z

The documentation is not available anymore as the PR was closed or merged.

younesbelkada · 2023-03-13T13:42:17Z

Experiments of gpt-neo-1b int8 + peft multi-GPU : https://wandb.ai/distill-bloom/trl/runs/x3d6fig6?workspace=user-younesbelkada
Single GPU baseline with peft and int8: https://wandb.ai/distill-bloom/trl/runs/rgcqxtfd?workspace=user-younesbelkada

younesbelkada · 2023-03-13T15:50:06Z

Ran a DP script with accelerate launch gpt2-sentiment.py to make sure nothing is broken in DP and seems to work like charm!
@lvwerra @edbeeching this is ready for review

lvwerra

Looks good overall. One main things that I think we need to fix soon is the way different approaches are loaded (peft, PP, int8). This would also allow us to test compatibility of different methods at loading time. Loading a model twice is not very intuitive but we can fix this in a dedicated PR.

lvwerra · 2023-03-13T16:51:50Z

+                    "The model is offloaded on CPU or disk - CPU & disk offloading is not supported for ValueHead models."
+                )
+
+            first_device = list(set(self.pretrained_model.hf_device_map.values()))[0]


sets do not necessarily preserve order, this is an issue here, no?

fixed in b9f75eb

lvwerra · 2023-03-13T16:52:26Z

+
+            first_device = list(set(self.pretrained_model.hf_device_map.values()))[0]
+
+            self.v_head = self.v_head.to(first_device)


why is the head on the first device? naively i would have put it on the last device because it's called last, no?

Because the lm_head is usually on the first device, I modified a bit to use the lm_head device instead

lvwerra · 2023-03-13T16:59:17Z

+pretrained_model = AutoModelForCausalLM.from_pretrained(
+    config.model_name, load_in_8bit=True, device_map="balanced", max_memory={0: "800MB", 1: "800MB"}
+)


I am thinking mid-term we should integrate that into the model classes as well. It's not very intuitive to load AutoModelForCausalLM and later AutoModelForCausalLMWithValueHead.

Same with peft. We could just pass the configs as kwargs, right?

Hmm for now we cant as we need to do it in 2 stages,
1- load the transformers model
2- pass it to get_peft_model
We can open a follow up PR for that to make it simpler

lvwerra · 2023-03-13T17:14:27Z

+            def set_device_hook(module, input, outputs):
+                new_output = ()
+                for output in outputs:
+                    if isinstance(output, torch.Tensor):
+                        new_output += (output.to(first_device),)
+                    else:
+                        new_output += (output,)
+                return new_output
+
+            self.register_forward_hook(set_device_hook)
+            self.is_sequential_parallel = True


an explanation of what this what would be useful. maybe some comments :)

lvwerra · 2023-03-13T17:17:05Z

        accelerator_kwargs: Optional[dict] = {},
        tracker_project_name: Optional[str] = "trl",
        max_grad_norm: Optional[float] = None,
+        optimize_cuda_cache: Optional[bool] = False,


are there drawbacks to setting it to true?

also the order in the docstring and the kwargs is different, i think it's better to be consistent :)

Fixed the order!
The drawback is maybe about the computational time of the step function, didn't benchmarked that though

* add fixes in to support PP * add same logic for enc-dec * add more checks * fix 20b issues * clean up * update scripts * dp safety checker * added multi gpu tests * fix order * change * fix script

younesbelkada added 2 commits March 10, 2023 10:47

add fixes in to support PP

38891c1

add same logic for enc-dec

97b4335

younesbelkada mentioned this pull request Mar 10, 2023

[Gpt-neo-x] Fix gpt neo-x multi gpu training huggingface/transformers#22089

Closed

younesbelkada added 3 commits March 10, 2023 14:58

add more checks

82d69dd

fix 20b issues

d224dfb

clean up

ccba521

younesbelkada added 2 commits March 13, 2023 15:02

update scripts

af0d4c2

dp safety checker

8a24888

younesbelkada requested review from edbeeching and lvwerra March 13, 2023 15:50

lvwerra reviewed Mar 13, 2023

View reviewed changes

younesbelkada added 4 commits March 14, 2023 11:40

added multi gpu tests

f8eacd9

fix order

b749d9d

change

b9f75eb

fix script

9ea1212

younesbelkada requested a review from lvwerra March 14, 2023 11:56

lvwerra approved these changes Mar 14, 2023

View reviewed changes

younesbelkada merged commit 03d9844 into main Mar 15, 2023

younesbelkada deleted the temp-3 branch March 15, 2023 07:28


		first_device = list(set(self.pretrained_model.hf_device_map.values()))[0]

		self.v_head = self.v_head.to(first_device)

Conversation

younesbelkada commented Mar 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada commented Mar 13, 2023

Uh oh!

younesbelkada commented Mar 13, 2023

Uh oh!

lvwerra left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

younesbelkada Mar 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

younesbelkada commented Mar 10, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 10, 2023 •

edited

Loading

younesbelkada Mar 14, 2023 •

edited

Loading