Let's support naive Pipeline Parallelism#210
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
Experiments of gpt-neo-1b int8 + peft multi-GPU : https://wandb.ai/distill-bloom/trl/runs/x3d6fig6?workspace=user-younesbelkada |
|
Ran a DP script with |
lvwerra
left a comment
There was a problem hiding this comment.
Looks good overall. One main things that I think we need to fix soon is the way different approaches are loaded (peft, PP, int8). This would also allow us to test compatibility of different methods at loading time. Loading a model twice is not very intuitive but we can fix this in a dedicated PR.
| "The model is offloaded on CPU or disk - CPU & disk offloading is not supported for ValueHead models." | ||
| ) | ||
|
|
||
| first_device = list(set(self.pretrained_model.hf_device_map.values()))[0] |
There was a problem hiding this comment.
sets do not necessarily preserve order, this is an issue here, no?
|
|
||
| first_device = list(set(self.pretrained_model.hf_device_map.values()))[0] | ||
|
|
||
| self.v_head = self.v_head.to(first_device) |
There was a problem hiding this comment.
why is the head on the first device? naively i would have put it on the last device because it's called last, no?
There was a problem hiding this comment.
Because the lm_head is usually on the first device, I modified a bit to use the lm_head device instead
| pretrained_model = AutoModelForCausalLM.from_pretrained( | ||
| config.model_name, load_in_8bit=True, device_map="balanced", max_memory={0: "800MB", 1: "800MB"} | ||
| ) |
There was a problem hiding this comment.
I am thinking mid-term we should integrate that into the model classes as well. It's not very intuitive to load AutoModelForCausalLM and later AutoModelForCausalLMWithValueHead.
There was a problem hiding this comment.
Same with peft. We could just pass the configs as kwargs, right?
There was a problem hiding this comment.
Hmm for now we cant as we need to do it in 2 stages,
1- load the transformers model
2- pass it to get_peft_model
We can open a follow up PR for that to make it simpler
| def set_device_hook(module, input, outputs): | ||
| new_output = () | ||
| for output in outputs: | ||
| if isinstance(output, torch.Tensor): | ||
| new_output += (output.to(first_device),) | ||
| else: | ||
| new_output += (output,) | ||
| return new_output | ||
|
|
||
| self.register_forward_hook(set_device_hook) | ||
| self.is_sequential_parallel = True |
There was a problem hiding this comment.
an explanation of what this what would be useful. maybe some comments :)
| accelerator_kwargs: Optional[dict] = {}, | ||
| tracker_project_name: Optional[str] = "trl", | ||
| max_grad_norm: Optional[float] = None, | ||
| optimize_cuda_cache: Optional[bool] = False, |
There was a problem hiding this comment.
are there drawbacks to setting it to true?
There was a problem hiding this comment.
also the order in the docstring and the kwargs is different, i think it's better to be consistent :)
There was a problem hiding this comment.
Fixed the order!
The drawback is maybe about the computational time of the step function, didn't benchmarked that though
* add fixes in to support PP * add same logic for enc-dec * add more checks * fix 20b issues * clean up * update scripts * dp safety checker * added multi gpu tests * fix order * change * fix script
What does this PR do?
Trying to load a model in a single device is cool, but what if we can split the model across multiple devices?
Users will just have to pass a custom
device_mapwhen loading the model, and it should work out of the box.This PR adds the support of "Sequential Parallel" - termed as naive Pipeline Parallelism as the real Pipeline parallelism involves dealing with multi-processing and gradients synchronisation that cannot be handled easily.
This PR depends on the following PRs:
accelerate: [Accelerator] We should not calltoon modules that wrapsaccelerateloaded models accelerate#1172peft: [core] Fix peft multi-gpu issue peft#145TODOs:
cc @lvwerra @edbeeching