Extend DeepSpeed integration to ZeRO-{1,2,3}#758
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
| machine_rank: 0 | ||
| main_training_function: main | ||
| mixed_precision: 'no' | ||
| mixed_precision: 'bf16' |
There was a problem hiding this comment.
We can now set this as the default since we initialise both the reference and active models with DeepSpeed
| # NOTE: gpt2 models use Conv1D instead of Linear layers which are not yet supported in 8 bit mode | ||
| # models like gpt-neo* models are more suitable. | ||
| model_name: Optional[str] = field(default="lvwerra/gpt2-imdb", metadata={"help": "the model name"}) | ||
| reward_model_name: Optional[str] = field( |
There was a problem hiding this comment.
I've added this arg to make it easier to configure the running of this script
| deepspeed_plugin = self.accelerator.state.deepspeed_plugin | ||
| batch_size_per_device = deepspeed_plugin.deepspeed_config["train_micro_batch_size_per_gpu"] | ||
| # See DeepSpeed docs for definition of these parameters: https://deepspeed.readthedocs.io/en/latest/zero3.html | ||
| config_kwargs = { |
There was a problem hiding this comment.
All these parameters are set automatically by accelerate and this don't need duplicating. One check I need to make is the inclusion of gradient accumulation.
Update: yes, train_batch_size does reflect the size of gradient accumulation as well, so this is fine to be removed IMO
|
@lewtun nice work! I love that there are gpt2 runs across different zero stages. Could you also test zero 2 + 3 on larger models such as falcon 7b or cerebras GPT 6.7B? |
Yes, I'm running the Cerebras models as we speak and will report back when the runs are done :) |
|
Update on running 3 x 6.7B models with DeepSpeed on
Here's the command I used to test: accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero{2,3}.yaml examples/scripts/sentiment_tuning.py --batch_size 32 --mini_batch_size 32 --
log_with wandb --model_name cerebras/Cerebras-GPT-6.7B --reward_model_name cerebras/Cerebras-GPT-6.7BInterestingly, although ZeRO-3 is less memory intensive, the savings aren't as high as I would have expected on a single node: |
| # Some tokenizers like GPT-2's don't have a padding token by default, so we set one here. | ||
| if sentiment_pipe.tokenizer.pad_token_id is None: | ||
| sentiment_pipe.tokenizer.pad_token_id = tokenizer.pad_token_id | ||
|
|
||
| if sentiment_pipe.model.config.pad_token_id is None: | ||
| sentiment_pipe.model.config.pad_token_id = tokenizer.pad_token_id |
There was a problem hiding this comment.
Do you know why this was not needed before?
There was a problem hiding this comment.
It's usually not needed if you've already trained a proper reward model because this comes with a proper padding token. However, if you want to plug and play with any causal LM on the Hub then this is typically needed to avoid throwing errors in the pipeline
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
|
Great work @lewtun
Thanks in Advance! |
|
Hi @uahmad235 ! Here's answers to your questions:
It will likely be tight to fit 3 x 7B models on 2 x A6000s, so one possibility would be to quantize the reward model by passing Hope that helps! |
|
Thanks for the info @lewtun. However, using Seems like i might have to go for a pair of A100s. |
* Generalise deepspeed * Refactor * Add reward model arg * Fix pipeline tokenizer * Fix deprecation * Pin deepspeed lower * Fix docs * Revert top_k change * Add ZeRO-3 context manager * Revert docs change * Fix docs * Polish docs * Update docs/source/customization.mdx Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
* Generalise deepspeed * Refactor * Add reward model arg * Fix pipeline tokenizer * Fix deprecation * Pin deepspeed lower * Fix docs * Revert top_k change * Add ZeRO-3 context manager * Revert docs change * Fix docs * Polish docs * Update docs/source/customization.mdx Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
* Generalise deepspeed * Refactor * Add reward model arg * Fix pipeline tokenizer * Fix deprecation * Pin deepspeed lower * Fix docs * Revert top_k change * Add ZeRO-3 context manager * Revert docs change * Fix docs * Polish docs * Update docs/source/customization.mdx Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

This PR extends the DeepSpeed initialization of the reference model to work with all stages of DeepSpeed ZeRO.
I'll share some plots of the GPT-2 runs on sentiment tuning shortly, but the code should be good for a review.
Tested with:
accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero{1,2,3}.yaml examples/scripts/sentiment_tuning.py --batch_size 32 --mini_batch_size 32 --log_with wandbHere's the screenshots of the various runs on wandb: https://wandb.ai/huggingface/trl?workspace=user-lewtun
Overall, getting good agreement between the baseline (no DeepSpeed) and stages 1 & 2, while stage 3 has a noticeable discrepancy in the value loss that is worth digging into in a separate issue IMO.