Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to generate GPU traces for MSCCL #18

Closed
JasonFantl opened this issue Dec 19, 2022 · 5 comments
Closed

Unable to generate GPU traces for MSCCL #18

JasonFantl opened this issue Dec 19, 2022 · 5 comments

Comments

@JasonFantl
Copy link

I have 8 machines, each with a single GPU. When following the build instructions for NCCL I get traces for both CPU and GPU events, but after following the steps for MSCCL I only get traces for CPU events. Below is each step taken to try and get GPU traces with MSCCL.

git clone https://github.com/microsoft/NPKit.git
cd NPKit
git clone https://github.com/microsoft/msccl msccl-master-e52c525
cd msccl-master-e52c525
git checkout e52c525
find ../npkit_for_msccl_master_e52c525/ | grep '.diff$' | awk '{print "git apply "$1}' | bash
make -j src.build src.build NVCC_GENCODE="-arch=sm_80" NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"

cd ..
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests/
make MPI=1 MPI_HOME=/usr/local/openmpi/ NCCL_HOME=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build -j

cd ..
mkdir dump_files
mkdir trace_files

# root directory copied to all machines

mpirun -H navi,quiritis,saria,oshun,midna,parvati,rhiannon,tara rm -f /home/jasonfantl/NPKit/MSCCL/NPKit/dump_files/* && \
mpirun \
    --tag-output \
    -H navi,quiritis,saria,oshun,midna,parvati,rhiannon,tara \
    -x PATH \
    -x LD_PRELOAD=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build/lib/libnccl.so.2 \
    -x LD_LIBRARY_PATH=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build:/usr/local/openmpi/lib:/usr/local/cuda/lib64:/usr/local/openmpi/lib:$LD_LIBRARY_PATH  \
    -x NCCL_P2P_DISABLE=1 \
    -x NCCL_SHM_DISABLE=1 \
    -x NCCL_SOCKET_IFNAME=wan0 \
    -x NCCL_NET=IB \
    -x NCCL_IB_GID_INDEX=3 \
    -x NCCL_IB_HCA=mlx5 \
    -x NCCL_NET_GDR_LEVEL=SYS  \
    -x NCCL_ALGO=MSCCL \
    -x NCCL_PROTO=LL \
    -x NPKIT_DUMP_DIR=/home/jasonfantl/NPKit/MSCCL/NPKit/dump_files \
    -x MSCCL_XML_FILES=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl_samples/msccl_algo_sample.xml \
    /home/jasonfantl/NPKit/MSCCL/NPKit/nccl-tests/build/all_reduce_perf -b 1048576 -e 1048576 -f 2 -g 1 -c 1 -n 100 -w 100 -z 0

python /home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/samples/npkit/npkit_post_process.py \
  --npkit_dump_dir=/home/jasonfantl/NPKit/MSCCL/NPKit/dump_files \
  --npkit_event_header_path=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/src/include/npkit/npkit_event.h \
  --output_dir=/home/jasonfantl/NPKit/MSCCL/NPKit/trace_files

A potentially useful note: When trying different settings I noticed that when NCCL_ALGO=RING, then NCCL_PROTO=LL (with -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT) doesn't produce GPU traces, but NCCL_PROTO=LL128 (with -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT) does (and I believe there's a typo in npkit_post_process.py line 77, curr_cpu_base_time needs to be replaced with curr_gpu_base_time in order to parse).

The current MSCCL example npkit_runner.sh uses NPKIT=1 as the build flag, which does not seem to enable any traces at all. I saw the MSCCL example had recently used -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_ENTRY -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_EXIT, which also didn't produce traces.

What are the correct build flags to generate GPU traces with MSCCL?

@JasonFantl
Copy link
Author

When comparing NPKit/npkit_for_msccl_master_e52c525/src/collectives/device to NPKit/npkit_for_nccl_v2.10.3-1/src/collectives/device I find that some files are missing, or the files don't add as many modifications to capture all the traces. Does NPKit not support as many tracing options for MSCCL, specifically tracing of LL and Simple protocols?

@yzygitzh
Copy link
Member

Hi,
Please try following the instructions in msccl_samples folder. It contains support for latest MSCCL and fixes several bugs.

@yzygitzh
Copy link
Member

Latest MSCCL will enable primitive-level events by default just with NPKIT=1 option.

@JasonFantl
Copy link
Author

I used the scripts in msccl_samples, but didn't get any traces or even dump files. When adding some print statements to MSCCL I find ENABLE_NPKIT isn't defined. I also only see NPKIT_FLAGS being referenced in the make files, but no reference to NPKIT, do I need different make files?

@JasonFantl
Copy link
Author

MSCCL already has NPKit enabled (https://github.com/microsoft/msccl#npkit-integration), no need to download the NPKit repo. All the steps I took above were unnecessary, and following the steps in the given link produced traces for everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants