询问DistributedGroupedDataParallel的使用方式 #116

Fragile-azalea · 2022-05-31T15:08:50Z

作者们您好，我希望使用FastMoE尝试改进ViT在CIFAR-10任务上的结果。
我有2个GPU，我希望为每个FFN提供4个专家，其中2个专家使用GPU0，另外两个使用GPU1，其他部分使用数据并行。
但是我发现修改之后发现效果下降了，于是我使用1个GPU但扩大两倍BatchSize希望模拟2卡的情况，发现效果提升了。
我想询问我对num_expert,world_size,DistributedGroupedDataParallel的理解正确吗，这两个实验应该产生相似的结果吗？谢谢
Describe the bug
使用GPU=2, Batch_Size=256得到ViT-CIFAR10上的Baseline
使用GPU=2, Batch_Size=256，FMoE的参数：num_expert=2, world_size=2, gate=GShardGate上得到的结果弱于Baseline（紫色）
使用GPU=1, Batch_Size=512，FMoE的参数：num_expert=4, world_size=1, gate=GShardGate上得到的结果强于Baseline（粉色）
在其他参数不修改的情况下，两者是否应该得到相似的结果？

相关代码

class _ExpertFF(FMoE):
    def __init__(self,
                 num_expert=32,
                 d_model=1024,
                 world_size=1,
                 top_k=2,
                 gate=GShardGate,
                 expert=None):
        super().__init__(num_expert, d_model, world_size,
                         top_k=top_k, gate=gate, expert=expert)
        self.mark_parallel_comm()

    def forward(self, inp: Tensor):
        b, p, h = inp.shape
        inp = inp.view((-1, h))
        oup = super().forward(inp)
        oup = oup.view((b, p, -1))
        return oup

def expert_fn(dim):
      return FeedForward(dim, mlp_dim, dropout=dropout)

_ExpertFF(4, dim, 1, expert=expert_fn) # when #GPU=1
_ExpertFF(2, dim, 2, expert=expert_fn) # when #GPU=2

分布式的初始化代码

from fmoe.distributed import DistributedGroupedDataParallel as DDP
model = ViT(image_size=32,
                patch_size=4,
                num_classes=10,
                dim=512,
                depth=6,
                heads=8,
                mlp_dim=512,
                dropout=0.1,
                emb_dropout=0.1).to(rank)
model = DDP(model) # 没有传入任何的group

laekov · 2022-06-01T02:28:25Z

您对 num_expert 和 world_size 的使用没有问题. 一个可能的问题是使用 DistributedGroupedDataParallel 时, 需要在 optimizer.step() 前手工调用 model.allreduce_params().

Fragile-azalea · 2022-06-01T02:53:35Z

作者您好，我使用了以下代码进行更新

# compute gradient and do SGD/ADAM step
optimizer.zero_grad()
loss.backward()
model.allreduce_params()
optimizer.step()

此外我使用ResNet-20（无任何专家）在CIFAR-10上训练，
DistributedDataParallel, #GPU=1, Batch_size=128 Epoch: 10, Loss: 0.61720, Acc@1: 78.30
DistributedDataParallel, #GPU=2, Batch_size=64 Epoch: 10, Loss: 0.63077, Acc@1: 77.96
DistributedGroupedDataParallel, #GPU=1, Batch_size=128 Epoch: 10, Loss: 0.62823, Acc@1: 77.98
DistributedGroupedDataParallel(不使用allreduce_params()), #GPU=2, Batch_size=64 Epoch: 10, Loss: 0.75496, Acc@1: 73.40
DistributedGroupedDataParallel(使用allreduce_params()), #GPU=2, Batch_size=64 Epoch: 10, Loss: 0.91942, Acc@1: 67.26

在替换DistributedDataParallel和DistributedGroupedDataParallel中，请问呢还有什么需要特别注意的吗

我是使用的代码在https://github.com/naga-karthik/ddp-resnet-cifar/blob/master/mainCIFAR10.py 的基础上修改得到

laekov · 2022-06-01T03:05:32Z

我觉得你的使用方法没有问题. 不用 allreduce_params() 的话相当于不同步梯度, 两个 GPU 上的模型会直接发散了, 所以竟然 acc 比用了还高, 还挺奇怪的. : (
您可以试一下用 NaiveGate 而不是 GShardGate 吗? 排除一下 GShardGate 可能带来的问题.

Fragile-azalea · 2022-06-01T03:36:32Z

在ResNet-20的实验中，我没有使用MoE，只是跑了ResNet-20。
在ViT的实验中，我也尝试了NaiveGate的结果

laekov · 2022-06-01T03:49:48Z

那基本可以判定是 DistributedGroupedDataParallel 的同步参数方式和你的模型有一些不兼容的地方了. 可以在 fmoe/distributed.py:48 这里检查一下两个进程的 parameter 顺序是否一致?

Fragile-azalea · 2022-06-01T04:34:10Z

好像找到原因了，初始化之后每个机器上的初始值不同，因为

fastmoe/fmoe/distributed.py

Line 83 in 527e66a

if not p.requires_grad or p.grad is None:

一开始的时候所有的参数都没有grad，所以初始化之后并没有在所有机器上同步参数

修改了这行之后 ResNet-20的结果可以提升到Epoch: 10, Loss: 0.64957, Acc@1: 77.12，正常了

Qianshaowei · 2024-02-27T09:23:37Z

在ResNet-20的实验中，我没有使用MoE，只是跑了ResNet-20。在ViT的实验中，我也尝试了NaiveGate的结果

您好，我想问一下您是怎么保存模型并进行推理的？

Fragile-azalea closed this as completed Jun 1, 2022

laekov mentioned this issue Jun 1, 2022

Add a hint for DGDP synchronization #117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

询问DistributedGroupedDataParallel的使用方式 #116

询问DistributedGroupedDataParallel的使用方式 #116

Fragile-azalea commented May 31, 2022

laekov commented Jun 1, 2022

Fragile-azalea commented Jun 1, 2022 •

edited

laekov commented Jun 1, 2022

Fragile-azalea commented Jun 1, 2022

laekov commented Jun 1, 2022

Fragile-azalea commented Jun 1, 2022 •

edited

Qianshaowei commented Feb 27, 2024

询问DistributedGroupedDataParallel的使用方式 #116

询问DistributedGroupedDataParallel的使用方式 #116

Comments

Fragile-azalea commented May 31, 2022

laekov commented Jun 1, 2022

Fragile-azalea commented Jun 1, 2022 • edited

laekov commented Jun 1, 2022

Fragile-azalea commented Jun 1, 2022

laekov commented Jun 1, 2022

Fragile-azalea commented Jun 1, 2022 • edited

Qianshaowei commented Feb 27, 2024

Fragile-azalea commented Jun 1, 2022 •

edited

Fragile-azalea commented Jun 1, 2022 •

edited