Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: shardformer: pipeline forward error with customized layer distribution #5187

Closed
insujang opened this issue Dec 15, 2023 · 4 comments · Fixed by #5189
Closed

[BUG]: shardformer: pipeline forward error with customized layer distribution #5187

insujang opened this issue Dec 15, 2023 · 4 comments · Fixed by #5189
Labels
bug Something isn't working

Comments

@insujang
Copy link
Contributor

🐛 Describe the bug

Hi, I am trying to implement a custom shard policy with different layer distribution, but it seems all built-in policies have the following inconsistent implementation:

In get_held_layers(), a policy uses self.distribute_layers() and self.get_stage_index(), which are customizable:

layers_per_stage = self.distribute_layers(len(module.h), stage_manager.num_stages)
if stage_manager.is_first_stage():
held_layers.append(module.wte)
held_layers.append(module.wpe)
held_layers.append(module.drop)
start_idx, end_idx = self.get_stage_index(layers_per_stage, stage_manager.stage)

But in set_pipeline_forward(), the policy uses Policy.distribute_layers() and Policy.get_stage_index():

layers_per_stage = Policy.distribute_layers(len(module.h), stage_manager.num_stages)
stage_index = Policy.get_stage_index(layers_per_stage, stage_manager.stage)

which will raise an error during pipeline forward due to layer inconsistency if the functions are overridden.

How to reproduce

I tested with examples/language/gpt/hybridparallelism/finetune.py.
For hybrid_parallel plugin, add a custom policy:

 elif args.plugin == "hybrid_parallel":
        BATCH_SIZE = 128
        from colossalai.shardformer.policies.base_policy import Policy
        from colossalai.shardformer.policies.gpt2 import GPT2ForSequenceClassificationPolicy
        class CustomGPT2Policy(GPT2ForSequenceClassificationPolicy):
            @staticmethod
            def distribute_layers(num_layers: int, num_stages: int) -> List[int]:
                layers_per_stage = Policy.distribute_layers(num_layers - 4, num_stages)
                layers_per_stage[0] += 4
                return layers_per_stage

        plugin = HybridParallelPlugin(
            tp_size=1,
            pp_size=4,
            num_microbatches=None,
            microbatch_size=8,
            zero_stage=0,
            precision="fp16",
            initial_scale=1,
            custom_policy=CustomGPT2Policy(),
        )

which distributes layers in a slightly different way: first stage has 4 more layers.

This leads the following error:

...
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 312, in forward
    query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/pytorch_utils.py", line 107, in forward
    x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
TypeError: addmm(): argument 'input' (position 1) must be Tensor, not NoneType

Environment

torch 2.1.0 + cu118

@insujang insujang added the bug Something isn't working label Dec 15, 2023
@CWHer
Copy link
Contributor

CWHer commented Dec 15, 2023

Thanks for reporting. Would you like to submit a PR to solve this issue :)

@insujang
Copy link
Contributor Author

Submitted!

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Submitted!

@insujang insujang reopened this Dec 15, 2023
@CWHer
Copy link
Contributor

CWHer commented Mar 27, 2024

Sorry for the delayed update, since I was assigned to another task for the last several months, and this issue is finally resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants