-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to generate GPU traces for MSCCL #18
Comments
When comparing |
Hi, |
Latest MSCCL will enable primitive-level events by default just with NPKIT=1 option. |
I used the scripts in msccl_samples, but didn't get any traces or even dump files. When adding some print statements to MSCCL I find ENABLE_NPKIT isn't defined. I also only see NPKIT_FLAGS being referenced in the make files, but no reference to NPKIT, do I need different make files? |
MSCCL already has NPKit enabled (https://github.com/microsoft/msccl#npkit-integration), no need to download the NPKit repo. All the steps I took above were unnecessary, and following the steps in the given link produced traces for everything. |
I have 8 machines, each with a single GPU. When following the build instructions for NCCL I get traces for both CPU and GPU events, but after following the steps for MSCCL I only get traces for CPU events. Below is each step taken to try and get GPU traces with MSCCL.
A potentially useful note: When trying different settings I noticed that when NCCL_ALGO=RING, then NCCL_PROTO=LL (with -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT) doesn't produce GPU traces, but NCCL_PROTO=LL128 (with -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT) does (and I believe there's a typo in npkit_post_process.py line 77, curr_cpu_base_time needs to be replaced with curr_gpu_base_time in order to parse).
The current MSCCL example npkit_runner.sh uses NPKIT=1 as the build flag, which does not seem to enable any traces at all. I saw the MSCCL example had recently used -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_ENTRY -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_EXIT, which also didn't produce traces.
What are the correct build flags to generate GPU traces with MSCCL?
The text was updated successfully, but these errors were encountered: