[NCCL Error] Enable distributed expert feature #27

BinHeRunning · 2021-04-08T02:38:14Z

Hi,

I installed fastmoe useding USE_NCCL=1 python setup.py install.

When i set "expert_dp_comm" to "dp", the training process is fine. But when i set "expert_dp_comm" to "none" (i.e., Each worker serves several unique expert networks), the process has a nccl error:

NCCL Error at /home/h/code_gpt/fmoe-package/cuda/moe_comm_kernel.cu:29 value 4

I'm looking forward to the help!

My environment:
python 1.8
nccl 2.8.3
cuda 10.1

The text was updated successfully, but these errors were encountered:

laekov · 2021-04-08T02:42:30Z

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

BinHeRunning · 2021-04-08T03:55:32Z

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

download nccl package from https://github.com/NVIDIA/nccl/archive/refs/tags/v2.8.3-1.tar.gz
build and install nccl as described in NCCL repository

$ cd nccl
$ make -j src.build

$ sudo apt install build-essential devscripts debhelper fakeroot
$ make pkg.debian.build
$ ls build/pkg/deb/

$ dpkg -i libnccl2_2.8.3-1+cuda10.1_amd64.deb
$ dpkg -i libnccl-dev_2.8.3-1+cuda10.1_amd64.deb
apt search nccl, nccl 2.8.3 is installed

Sorting... Done
Full Text Search... Done
libhttpasyncclient-java/bionic 4.1.3-1 all
HTTP/1.1 compliant asynchronous HTTP agent implementation

libnccl-dev/now 2.8.3-1+cuda10.1 amd64 [installed,local]
NVIDIA Collective Communication Library (NCCL) Development Files

libnccl2/now 2.8.3-1+cuda10.1 amd64 [installed,local]
NVIDIA Collective Communication Library (NCCL) Runtime

libpuppetlabs-http-client-clojure/bionic 0.9.0-1 all
Clojure wrapper around libhttpasyncclient-java

libvncclient1/bionic-security,bionic-updates 0.9.11+dfsg-1ubuntu1.4 amd64
API to write one's own VNC server - client library

libvncclient1-dbg/bionic-security,bionic-updates 0.9.11+dfsg-1ubuntu1.4 amd64
debugging symbols for libvncclient

python-ncclient/bionic 0.5.3-4 all
Python library for NETCONF clients (Python 2)

python-ncclient-doc/bionic 0.5.3-4 all
Documentation for python-ncclient (Python library for NETCONF clients)

python3-ncclient/bionic 0.5.3-4 all
Python library for NETCONF clients (Python 3)
installed fastmoe using USE_NCCL=1 python setup.py install

BinHeRunning · 2021-04-08T11:44:31Z

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

"The repository is currently tested with PyTorch v1.8.0 and CUDA 10, with designed compatibility to older versions."
Can you provide the docker image with cuda10 ?

Sengxian · 2021-04-15T09:54:17Z

We built a docker image with PyTorch 1.8.0, CUDA 10.2, NCCL 2.7.8 and we have tested this image that it can be used directly to install FastMoE with distributed expert feature.

It can be found on Docker Hub: co1lin/fastmoe:pytorch1.8.0-cuda10.2-cudnn7-nccl2708

BinHeRunning · 2021-04-17T09:24:22Z

We built a docker image with PyTorch 1.8.0, CUDA 10.2, NCCL 2.7.8 and we have tested this image that it can be used directly to install FastMoE with distributed expert feature.

It can be found on Docker Hub: co1lin/fastmoe:pytorch1.8.0-cuda10.2-cudnn7-nccl2708

Thanks for the docker image.

I installed fastmoe using USE_NCCL=1, but when i run GPT2 (L12-H768, intermediate size 1536, top2)in a 8xGPU device, the largest expert number can be increased to 32. However, 96 experts reported in the FastMoE paper.

When i increase the expert number to 48 (batch size per gpu: 1), CUDA OOM occurs.

It seems that the distributed expert feature was not activated. Do you have any suggestions ?

laekov · 2021-04-26T03:36:06Z

The distributed experts feature is by default enabled in fmoefy. You may double check the place where you call the function.
In our experiment, we use NVIDIA V100 32GB. 12 experts are placed on each expert. In other words, our --num-expert is set to 12.

Sengxian mentioned this issue Apr 15, 2021

Hi, an error occurs when i install with USE_NCCL=1: #28

Closed

BinHeRunning closed this as completed May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NCCL Error] Enable distributed expert feature #27

[NCCL Error] Enable distributed expert feature #27

BinHeRunning commented Apr 8, 2021

laekov commented Apr 8, 2021

BinHeRunning commented Apr 8, 2021

BinHeRunning commented Apr 8, 2021

Sengxian commented Apr 15, 2021 •

edited

Loading

BinHeRunning commented Apr 17, 2021

laekov commented Apr 26, 2021

[NCCL Error] Enable distributed expert feature #27

[NCCL Error] Enable distributed expert feature #27

Comments

BinHeRunning commented Apr 8, 2021

laekov commented Apr 8, 2021

BinHeRunning commented Apr 8, 2021

BinHeRunning commented Apr 8, 2021

Sengxian commented Apr 15, 2021 • edited Loading

BinHeRunning commented Apr 17, 2021

laekov commented Apr 26, 2021

Sengxian commented Apr 15, 2021 •

edited

Loading