Fix multi-node H100 CI: CUDA compat, deploy improvements by Binyang2014 · Pull Request #781 · microsoft/mscclpp

Binyang2014 · 2026-04-10T00:30:11Z

Summary

Multi-node H100 CI setup: Improve architecture detection and GPU configuration
Remove hardcoded VMSS hostnames from deploy files
Fix CUDA compat library issue: Remove stale compat paths from Docker image for CUDA 12+. Instead, peer_access_test now returns a distinct exit code (2) for CUDA init failure, and setup.sh conditionally adds compat libs only when needed. This fixes cudaErrorSystemNotReady (error 803) when the host driver is newer than the container's compat libs.
Speed up deploy: Replace recursive parallel-scp with tar+scp+untar to avoid per-file SSH overhead.

Add vmssName pipeline parameter and generate config, hostfile, and hostfile_mpi dynamically. Update run_tests.sh to derive the head host from hostfile_mpi instead of hardcoding it. Delete the static deploy files that previously hardcoded mscclpp-h100-multinode-ci hostnames. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ction The multi-nodes-test pipeline was failing on H100 GPUs with CUDA error 803 (cudaErrorSystemDriverMismatch) because it still included the cuda11.8 Docker image in its matrix. All other H100 CI jobs (ut, integration-test, nccl-api-test) already use only cuda12.9. This aligns the multi-node config accordingly. Also adds gpuArch: '90' to the deploy template call for consistent H100 builds, and improves the peer-access-test Makefile to detect GPU compute capability via nvidia-smi instead of relying solely on -arch=native, which silently falls back to an old default architecture inside Docker containers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The host driver on the multi-node H100 VMs is CUDA 13.0 (driver 580.126.16), so the container image must match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace recursive parallel-scp with tar+scp+untar to avoid per-file SSH overhead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Binyang2014 · 2026-04-10T00:30:42Z

/azp run mscclpp-ut

azure-pipelines · 2026-04-10T00:30:54Z

Azure Pipelines successfully started running 1 pipeline(s).

Tar contents directly (-C ${ROOT_DIR} .) instead of the parent directory, and extract into ${DST_DIR} explicitly. The previous approach used dirname/basename which produced wrong directory names (e.g., 's' from '/__w/1/s/') causing 'No such file or directory' in the container. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When a RegisteredMemory has both CudaIpc and IB transports, the import path was trying CudaIpc (PosixFd) even for cross-node memory. PosixFd uses unix domain sockets which are node-local, causing 'No such file or directory' crashes. For cross-node memory: - If Fabric is available, try it (works with IMEX daemon) - If Fabric fails and IB is available, fall back to IB - If neither works, throw a clear error Same-host behavior is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Binyang2014 · 2026-04-10T07:08:17Z

/azp run mscclpp-ut

azure-pipelines · 2026-04-10T07:08:29Z

Azure Pipelines successfully started running 1 pipeline(s).

The static config file was removed. Generate SSH config at runtime from the dynamically created hostfile_mpi. For single-node tests where hostfile_mpi doesn't exist, skip config generation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Binyang2014 · 2026-04-10T17:42:38Z

/azp run mscclpp-ut

azure-pipelines · 2026-04-10T17:42:49Z

Azure Pipelines successfully started running 1 pipeline(s).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Resolve HEAD_HOST to its eth0 IP address to ensure TcpBootstrap connects on the correct interface, fixing timeout in ResumeWithIpPortPair test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add continueOnError parameter to run-remote-task template and set it for the perf test step. The step will show as failed but subsequent steps (unit tests, python tests, benchmark) will still run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Check isSameHost first (the common/simpler path) before handling the cross-node Fabric fallback logic, improving readability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Binyang2014 · 2026-04-10T21:45:05Z

/azp run mscclpp-ut

azure-pipelines · 2026-04-10T21:45:17Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

This PR updates the multi-node H100 CI/deploy flow to be less environment-specific and more robust across CUDA driver/toolkit mismatches, while also speeding up deployment.

Changes:

Add a distinct exit code for CUDA init failure in peer_access_test, and retry with CUDA compat libs only when needed during remote setup.
Remove hardcoded multi-node hostnames from tracked deploy files; generate deploy hostfiles/config dynamically in the pipeline and improve runtime GPU/baseline selection.
Speed up remote deploy by switching from recursive parallel-scp to tar+scp+untar, and tighten cross-node CUDA IPC behavior to avoid non-functional handle types.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tools/peer-access-test/peer_access_test.cu	Adds exit code (2) for CUDA init failure to enable conditional compat retry.
test/deploy/setup.sh	Generates SSH config dynamically and retries peer-access test with compat libs on init failure.
test/deploy/run_tests.sh	Uses build/bin paths, resolves head node IP, selects perf baseline by GPU type, centralizes mpirun env/args.
test/deploy/perf_ndmv5.jsonl	Adds/extends H100 (NDmv5) perf baseline entries.
test/deploy/hostfile_mpi	Removes hardcoded hostnames from repo (now generated in pipeline).
test/deploy/hostfile	Removes hardcoded hostnames from repo (now generated in pipeline).
test/deploy/deploy.sh	Deploys source via tarball to reduce per-file SSH overhead.
test/deploy/config	Removes hardcoded SSH config from repo (now generated in pipeline/setup).
src/core/registered_memory.cc	Restricts cross-node CUDA IPC to Fabric handles and allows IB fallback behavior.
docker/build.sh	Removes CUDA compat `LD_LIBRARY_PATH` injection from image build.
.azure-pipelines/templates/run-remote-task.yml	Adds `continueOnError` parameter passthrough for remote tasks.
.azure-pipelines/multi-nodes-test.yml	Updates H100 multi-node CI settings, generates deploy files at runtime, adjusts pool/subscription/resource group.

.azure-pipelines/multi-nodes-test.yml

test/deploy/run_tests.sh

test/deploy/setup.sh

.azure-pipelines/multi-nodes-test.yml

docker/build.sh

- Gate CUDA compat-lib retry on PLATFORM==cuda to avoid misleading errors on HIP - Fix hostfile/hostfile_mpi leading whitespace from YAML indentation by using printf instead of echo - Fix /etc/hosts duplicate check by iterating hostEntries per line instead of matching the entire multi-line string Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add gdrdrv kernel module installation for CUDA VMs before Docker container launch. Skips if the module is already loaded. Applies to both single-node and multi-node CI pipelines. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/core/registered_memory.cc

…back logic - Extract duplicated create/map/get into importCudaIpc lambda - Add comment explaining MNNVL failure as the caught error - Document CudaIpc | IB fallback use case in comments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Binyang2014 · 2026-04-13T18:39:58Z

/azp run mscclpp-ut

azure-pipelines · 2026-04-13T18:40:13Z

Azure Pipelines successfully started running 1 pipeline(s).

…host failures - Remove hasFabric pre-check; let GpuIpcMem::create try all handle types - Remove isSameHost branching for import; always try with IB fallback - Catch BaseError to cover both Error and CudaError/CuError - WARN on same-host CudaIpc failure (unexpected), INFO on cross-host (expected) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Binyang2014 and others added 9 commits April 8, 2026 03:30

update miltinode-pool

de8114e

update

77a30a8

Use cuda13.0 image for multi-node H100 CI

545d367

The host driver on the multi-node H100 VMs is CUDA 13.0 (driver 580.126.16), so the container image must match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

testing

6ca257d

fix CI

a2ef206

Speed up deploy by archiving before scp

d531e4d

Replace recursive parallel-scp with tar+scp+untar to avoid per-file SSH overhead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'main' into binyli/multinode-ci

98af765

Binyang2014 and others added 7 commits April 10, 2026 00:31

debug

db469b6

update

7f14ca2

Fix test binary paths: build/test/ -> build/bin/

f8f8aff

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add eth0 MPI TCP interface and deduplicate mpirun args

8c096b4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Set MSCCLPP_SOCKET_IFNAME=eth0 for multi-node bootstrap

22a20a4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Binyang2014 and others added 6 commits April 10, 2026 19:02

Select perf baseline based on GPU type (H100 -> ndmv5)

f138b13

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add H100 multi-node perf baselines to ndmv5

0ddd37f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use eth0 IP for mp_unit_tests bootstrap endpoint

5ad154a

Resolve HEAD_HOST to its eth0 IP address to ensure TcpBootstrap connects on the correct interface, fixing timeout in ResumeWithIpPortPair test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert peer-access-test Makefile to use -arch=native

50da168

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reorder CudaIpc branch to check same-host before cross-node

ff8d4b3

Check isSameHost first (the common/simpler path) before handling the cross-node Fabric fallback logic, improving readability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Binyang2014 requested review from a team and Copilot April 10, 2026 21:49

Copilot started reviewing on behalf of Binyang2014 April 10, 2026 21:50 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Binyang2014 and others added 3 commits April 10, 2026 22:10

Merge branch 'main' into binyli/multinode-ci

bf4f099

chhwang requested changes Apr 13, 2026

View reviewed changes

src/core/registered_memory.cc Outdated Show resolved Hide resolved

src/core/registered_memory.cc Outdated Show resolved Hide resolved

src/core/registered_memory.cc Outdated Show resolved Hide resolved

Binyang2014 and others added 2 commits April 13, 2026 18:38

Merge branch 'main' into binyli/multinode-ci

4eac8d8

Binyang2014 and others added 2 commits April 13, 2026 23:00

Merge branch 'main' into binyli/multinode-ci

0b6f893

chhwang approved these changes Apr 14, 2026

View reviewed changes

Binyang2014 merged commit ecd3372 into main Apr 14, 2026
14 checks passed

Binyang2014 deleted the binyli/multinode-ci branch April 14, 2026 04:51

Conversation

Binyang2014 commented Apr 10, 2026

Summary

Uh oh!

Binyang2014 commented Apr 10, 2026

Uh oh!

azure-pipelines bot commented Apr 10, 2026

Uh oh!

Binyang2014 commented Apr 10, 2026

Uh oh!

azure-pipelines bot commented Apr 10, 2026

Uh oh!

Binyang2014 commented Apr 10, 2026

Uh oh!

azure-pipelines bot commented Apr 10, 2026

Uh oh!

Binyang2014 commented Apr 10, 2026

Uh oh!

azure-pipelines bot commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Binyang2014 commented Apr 13, 2026

Uh oh!

azure-pipelines bot commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants