[REQUEST] Add documentation on how to run fast inference of `transformers` models with ZeRO-3 #5498

lewtun · 2024-05-03T10:01:44Z

Is your feature request related to a problem? Please describe.

Hello DeepSpeed team, while looking at how to accelerate text generation in TRL with ZeRO-3, we learned from @pacman100 that the most efficient method is to remove/add hooks within a context manager as follows:

@contextmanager
def unwrap_model_for_generation(
    model: Union["DistributedDataParallel", "DeepSpeedEngine"], accelerator: "Accelerator", is_peft_model: bool = False
) -> Union["PreTrainedModelWrapper", "DeepSpeedEngine"]:
    """Context manager to unwrap a model for generation.


    For ZeRO-3 models, we gather the weights once to speed up generation.
    """
    unwrapped_model = accelerator.unwrap_model(model)
    if is_peft_model:
        unwrapped_model.pretrained_model.disable_adapter()
    if accelerator.state.deepspeed_plugin is not None and accelerator.state.deepspeed_plugin.zero_stage == 3:
        with deepspeed.zero.GatheredParameters(model.parameters()):
            remove_hooks(model)
            yield model
            add_hooks(model)
    else:
        yield unwrapped_model

This works well for inference, but during DPO training, we hit a rather cryptic error only when gradient accumulation steps > 1:

AssertionError: {'id': 0, 'status': 'AVAILABLE', 'numel': 25755648, 'ds_numel': 25755648, 'shape': (50304, 512), 'ds_shape': (50304, 512), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {182}, 'ds_tensor.shape': torch.Size([3219456])}

The solution that @pacman100 found is that one needs to carefully remove all active parameters during the hook removal which has led to this fix in TRL: huggingface/trl#1617

Getting to the bottom of this issue was quite tricky and the DeepSpeed documentation unfortunately did not contain any guidance on how to do this. I'm sharing the issue here for broader visibility in case others are trying to speed up ZeRO-3 generation during training.

Describe the solution you'd like
An example in the documentation which shows how to run fast text generation with ZeRO-3 within a training loop. This is very useful for online methods like PPO.

Describe alternatives you've considered
N/A
Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

lewtun added the enhancement New feature or request label May 3, 2024

jomayeri self-assigned this May 8, 2024

jomayeri closed this as completed Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Add documentation on how to run fast inference of `transformers` models with ZeRO-3 #5498

[REQUEST] Add documentation on how to run fast inference of `transformers` models with ZeRO-3 #5498

lewtun commented May 3, 2024

[REQUEST] Add documentation on how to run fast inference of transformers models with ZeRO-3 #5498

[REQUEST] Add documentation on how to run fast inference of transformers models with ZeRO-3 #5498

Comments

lewtun commented May 3, 2024

[REQUEST] Add documentation on how to run fast inference of `transformers` models with ZeRO-3 #5498

[REQUEST] Add documentation on how to run fast inference of `transformers` models with ZeRO-3 #5498