[BUG]: shardformer: pipeline forward error with customized layer distribution #5187

insujang · 2023-12-15T03:20:15Z

🐛 Describe the bug

Hi, I am trying to implement a custom shard policy with different layer distribution, but it seems all built-in policies have the following inconsistent implementation:

In get_held_layers(), a policy uses self.distribute_layers() and self.get_stage_index(), which are customizable:

ColossalAI/colossalai/shardformer/policies/gpt2.py

Lines 170 to 175 in 79718fa

    
           layers_per_stage = self.distribute_layers(len(module.h), stage_manager.num_stages) 
        
           if stage_manager.is_first_stage(): 
        
               held_layers.append(module.wte) 
        
               held_layers.append(module.wpe) 
        
               held_layers.append(module.drop) 
        
           start_idx, end_idx = self.get_stage_index(layers_per_stage, stage_manager.stage)

But in set_pipeline_forward(), the policy uses Policy.distribute_layers() and Policy.get_stage_index():

ColossalAI/colossalai/shardformer/policies/gpt2.py

Lines 192 to 193 in 79718fa

    
           layers_per_stage = Policy.distribute_layers(len(module.h), stage_manager.num_stages) 
        
           stage_index = Policy.get_stage_index(layers_per_stage, stage_manager.stage)

which will raise an error during pipeline forward due to layer inconsistency if the functions are overridden.

How to reproduce

I tested with examples/language/gpt/hybridparallelism/finetune.py.
For hybrid_parallel plugin, add a custom policy:

 elif args.plugin == "hybrid_parallel":
        BATCH_SIZE = 128
        from colossalai.shardformer.policies.base_policy import Policy
        from colossalai.shardformer.policies.gpt2 import GPT2ForSequenceClassificationPolicy
        class CustomGPT2Policy(GPT2ForSequenceClassificationPolicy):
            @staticmethod
            def distribute_layers(num_layers: int, num_stages: int) -> List[int]:
                layers_per_stage = Policy.distribute_layers(num_layers - 4, num_stages)
                layers_per_stage[0] += 4
                return layers_per_stage

        plugin = HybridParallelPlugin(
            tp_size=1,
            pp_size=4,
            num_microbatches=None,
            microbatch_size=8,
            zero_stage=0,
            precision="fp16",
            initial_scale=1,
            custom_policy=CustomGPT2Policy(),
        )

which distributes layers in a slightly different way: first stage has 4 more layers.

This leads the following error:

...
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 312, in forward
    query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/pytorch_utils.py", line 107, in forward
    x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
TypeError: addmm(): argument 'input' (position 1) must be Tensor, not NoneType

Environment

torch 2.1.0 + cu118

The text was updated successfully, but these errors were encountered:

CWHer · 2023-12-15T04:09:38Z

Thanks for reporting. Would you like to submit a PR to solve this issue :)

insujang · 2023-12-15T04:48:36Z

Submitted!

Issues-translate-bot · 2023-12-15T04:48:47Z

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Submitted!

CWHer · 2024-03-27T08:42:20Z

Sorry for the delayed update, since I was assigned to another task for the last several months, and this issue is finally resolved.

insujang added the bug Something isn't working label Dec 15, 2023

insujang mentioned this issue Dec 15, 2023

[shardformer] fix pipeline forward error if custom layer distribution is used #5189

Merged

10 tasks

insujang closed this as completed Dec 15, 2023

insujang reopened this Dec 15, 2023

CWHer closed this as completed in #5189 Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: shardformer: pipeline forward error with customized layer distribution #5187

[BUG]: shardformer: pipeline forward error with customized layer distribution #5187

insujang commented Dec 15, 2023

CWHer commented Dec 15, 2023

insujang commented Dec 15, 2023

Issues-translate-bot commented Dec 15, 2023

CWHer commented Mar 27, 2024

[BUG]: shardformer: pipeline forward error with customized layer distribution #5187

[BUG]: shardformer: pipeline forward error with customized layer distribution #5187

Comments

insujang commented Dec 15, 2023

🐛 Describe the bug

Environment

CWHer commented Dec 15, 2023

insujang commented Dec 15, 2023

Issues-translate-bot commented Dec 15, 2023

CWHer commented Mar 27, 2024