Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NCCL Error] Enable distributed expert feature #27

Closed
BinHeRunning opened this issue Apr 8, 2021 · 6 comments
Closed

[NCCL Error] Enable distributed expert feature #27

BinHeRunning opened this issue Apr 8, 2021 · 6 comments

Comments

@BinHeRunning
Copy link

Hi,

I installed fastmoe useding USE_NCCL=1 python setup.py install.

When i set "expert_dp_comm" to "dp", the training process is fine. But when i set "expert_dp_comm" to "none" (i.e., Each worker serves several unique expert networks), the process has a nccl error:

NCCL Error at /home/h/code_gpt/fmoe-package/cuda/moe_comm_kernel.cu:29 value 4

I'm looking forward to the help!

My environment:
python 1.8
nccl 2.8.3
cuda 10.1

@laekov
Copy link
Owner

laekov commented Apr 8, 2021

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

@BinHeRunning
Copy link
Author

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

  1. download nccl package from https://github.com/NVIDIA/nccl/archive/refs/tags/v2.8.3-1.tar.gz

  2. build and install nccl as described in NCCL repository

    $ cd nccl
    $ make -j src.build

    $ sudo apt install build-essential devscripts debhelper fakeroot
    $ make pkg.debian.build
    $ ls build/pkg/deb/

    $ dpkg -i libnccl2_2.8.3-1+cuda10.1_amd64.deb
    $ dpkg -i libnccl-dev_2.8.3-1+cuda10.1_amd64.deb

  3. apt search nccl, nccl 2.8.3 is installed

    Sorting... Done
    Full Text Search... Done
    libhttpasyncclient-java/bionic 4.1.3-1 all
    HTTP/1.1 compliant asynchronous HTTP agent implementation

    libnccl-dev/now 2.8.3-1+cuda10.1 amd64 [installed,local]
    NVIDIA Collective Communication Library (NCCL) Development Files

    libnccl2/now 2.8.3-1+cuda10.1 amd64 [installed,local]
    NVIDIA Collective Communication Library (NCCL) Runtime

    libpuppetlabs-http-client-clojure/bionic 0.9.0-1 all
    Clojure wrapper around libhttpasyncclient-java

    libvncclient1/bionic-security,bionic-updates 0.9.11+dfsg-1ubuntu1.4 amd64
    API to write one's own VNC server - client library

    libvncclient1-dbg/bionic-security,bionic-updates 0.9.11+dfsg-1ubuntu1.4 amd64
    debugging symbols for libvncclient

    python-ncclient/bionic 0.5.3-4 all
    Python library for NETCONF clients (Python 2)

    python-ncclient-doc/bionic 0.5.3-4 all
    Documentation for python-ncclient (Python library for NETCONF clients)

    python3-ncclient/bionic 0.5.3-4 all
    Python library for NETCONF clients (Python 3)

  4. installed fastmoe using USE_NCCL=1 python setup.py install

@BinHeRunning
Copy link
Author

This may be because you did not initialize nccl. Can you please provide a minimum script that can reproduce the error? Thx.

"The repository is currently tested with PyTorch v1.8.0 and CUDA 10, with designed compatibility to older versions."
Can you provide the docker image with cuda10 ?

@Sengxian
Copy link
Collaborator

Sengxian commented Apr 15, 2021

We built a docker image with PyTorch 1.8.0, CUDA 10.2, NCCL 2.7.8 and we have tested this image that it can be used directly to install FastMoE with distributed expert feature.

It can be found on Docker Hub: co1lin/fastmoe:pytorch1.8.0-cuda10.2-cudnn7-nccl2708

@BinHeRunning
Copy link
Author

We built a docker image with PyTorch 1.8.0, CUDA 10.2, NCCL 2.7.8 and we have tested this image that it can be used directly to install FastMoE with distributed expert feature.

It can be found on Docker Hub: co1lin/fastmoe:pytorch1.8.0-cuda10.2-cudnn7-nccl2708

Thanks for the docker image.

I installed fastmoe using USE_NCCL=1, but when i run GPT2 (L12-H768, intermediate size 1536, top2)in a 8xGPU device, the largest expert number can be increased to 32. However, 96 experts reported in the FastMoE paper.

When i increase the expert number to 48 (batch size per gpu: 1), CUDA OOM occurs.

It seems that the distributed expert feature was not activated. Do you have any suggestions ?

@laekov
Copy link
Owner

laekov commented Apr 26, 2021

The distributed experts feature is by default enabled in fmoefy. You may double check the place where you call the function.
In our experiment, we use NVIDIA V100 32GB. 12 experts are placed on each expert. In other words, our --num-expert is set to 12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants