[BUG] apply_tensor_parallelism() is not executed in Zero3 without self.mpu #4080

devamanyu · 2023-08-03T00:07:17Z

Describe the bug

In Hybrid Engine, the apply_tensor_parallelism() is not called when model inference container requires tp > 1 but self.mpu is None. For example, for a large model in Zero3, the apply_tensor_parallelism() is not called.

Log output

To Reproduce

DeepSpeed/deepspeed/runtime/hybrid_engine.py

Line 206 in a7fe3bc

self._inference_containers[layer_id].apply_tensor_parallelism(self.mp_replace,

To reproduce - call any large model, say Llama 30b using Zero 3 + Hybrid Engine.

The text was updated successfully, but these errors were encountered:

wang990099 · 2023-09-12T08:57:36Z

I meet same issue, too. I try to call apply_tensor_parallelism(), I get wrong attention tensor size, too I find auto_tp.py that tensor parallel copy to original size not tp tensor size,

hxdtest · 2023-10-04T14:19:52Z

config = {
  "train_batch_size" : 32,
  "train_micro_batch_size_per_gpu": 2,
  "steps_per_print": 10,
  "zero_optimization": {
    "stage": 3,
    "offload_param": {
        "device": "cpu"
    },
    "stage3_param_persistence_threshold": 0
  },
  "fp16":{
    "enabled": True,
    "loss_scale_window": 100
  },

"hybrid_engine": {
    "enabled": True,
    "inference_tp_size": 2},


  "gradient_clipping": 1.0,
 
}
kwargs = {}
kwargs["config"]  = config

import torch

from transformers import AutoModelForCausalLM
import deepspeed
import argparse
from deepspeed.accelerator import get_accelerator

deepspeed.runtime.utils.see_memory_usage('pre test', force=True)

model = AutoModelForCausalLM.from_pretrained("/ossfs/workspace/opt-350m", trust_remote_code=True).half().to(get_accelerator().device_name())
 

deepspeed.runtime.utils.see_memory_usage('post test', force=True)
import os 
 

 
 
kwargs["model"]  = model 
m, _, _, _ = deepspeed.initialize(**kwargs)


m.eval()

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("/ossfs/workspace/opt-350m", use_fast=False)

prompt = "Hello, I'm am conscious and"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
 
 
generated_ids = m.generate(input_ids)

I meet same issue, too. I disable if self.mpu is not None:

Log output is

│ /opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py:117 │
│ in _merge_qkv │
│ │
│ 114 │ def _merge_qkv(self): │
│ 115 │ │ │
│ 116 │ │ qvkw = DeepSpeedSelfAttention._qkv_buffers[0] │
│ ❱ 117 │ │ qvkw[:self.hidden_size_per_partition, :] = self.attn_qw # type: ignore │
│ 118 │ │ qvkw[self.hidden_size_per_partition:2 * self.hidden_size_per_partition, :] = sel │
│ 119 │ │ qvkw[2 * self.hidden_size_per_partition:, :] = self.attn_vw # type: ignore │
│ 120 │ │ if self.attn_qb is not None: │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The expanded size of the tensor (1024) must match the existing size (0) at
non-singleton dimension 1. Target sizes: [512, 1024]. Tensor sizes: [0]`

LSC527 · 2023-11-01T04:05:01Z

@hxdtest The size [0] tensor seems caused by not properly setting tp weight.
Besides disable if self.mpu is not None: , I did this change in split_qkv.py and tp works fine now:

    def attention_qkv_mp(self, mp_replace, reversed_dim=False):
        # Only need to alter
        if self.module.attention.attn_qkvw is None:
            # params = [
            #     (self.module.attention.attn_qw, self.qw),
            #     (self.module.attention.attn_qb, self.qb),
            #     (self.module.attention.attn_kw, self.kw),
            #     (self.module.attention.attn_kb, self.kb),
            #     (self.module.attention.attn_vw, self.vw),
            #     (self.module.attention.attn_vb, self.vb),
            # ]
            # for dst, src in params:
            #     dst = mp_replace.copy(
            #         dst[:self.qw.shape[0] // mp_replace.mp_size], src, int8=reversed_dim,
            #         allocate_tensor=reversed_dim) if src is not None else None
            self.module.attention.attn_qw = mp_replace.copy(self.module.attention.attn_qw[:self.qw.shape[0] // mp_replace.mp_size], self.qw, int8=reversed_dim, allocate_tensor=reversed_dim) if self.qw is not None else None
            self.module.attention.attn_qb = mp_replace.copy(self.module.attention.attn_qb[:self.qw.shape[0] // mp_replace.mp_size], self.qb, int8=reversed_dim, allocate_tensor=reversed_dim) if self.qb is not None else None
            self.module.attention.attn_kw = mp_replace.copy(self.module.attention.attn_kw[:self.qw.shape[0] // mp_replace.mp_size], self.kw, int8=reversed_dim, allocate_tensor=reversed_dim) if self.kw is not None else None
            self.module.attention.attn_kb = mp_replace.copy(self.module.attention.attn_kb[:self.qw.shape[0] // mp_replace.mp_size], self.kb, int8=reversed_dim, allocate_tensor=reversed_dim) if self.kb is not None else None
            self.module.attention.attn_vw = mp_replace.copy(self.module.attention.attn_vw[:self.qw.shape[0] // mp_replace.mp_size], self.vw, int8=reversed_dim, allocate_tensor=reversed_dim) if self.vw is not None else None
            self.module.attention.attn_vb = mp_replace.copy(self.module.attention.attn_vb[:self.qw.shape[0] // mp_replace.mp_size], self.vb, int8=reversed_dim, allocate_tensor=reversed_dim) if self.vb is not None else None
        else:
            super().attention_qkv_mp(mp_replace)

The original code does not make attn weight stays after apply_tensor_parallelism and exiting GatheredParameters context.
However, I am not sure if the tp weights are actually correct by doing this.

devamanyu added bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat labels Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] apply_tensor_parallelism() is not executed in Zero3 without self.mpu #4080

[BUG] apply_tensor_parallelism() is not executed in Zero3 without self.mpu #4080

devamanyu commented Aug 3, 2023

wang990099 commented Sep 12, 2023

hxdtest commented Oct 4, 2023 •

edited

LSC527 commented Nov 1, 2023

[BUG] apply_tensor_parallelism() is not executed in Zero3 without self.mpu #4080

[BUG] apply_tensor_parallelism() is not executed in Zero3 without self.mpu #4080

Comments

devamanyu commented Aug 3, 2023

wang990099 commented Sep 12, 2023

hxdtest commented Oct 4, 2023 • edited

LSC527 commented Nov 1, 2023

hxdtest commented Oct 4, 2023 •

edited