Skip to content

Fix multi-node H100 CI: CUDA compat, deploy improvements#781

Merged
Binyang2014 merged 30 commits intomainfrom
binyli/multinode-ci
Apr 14, 2026
Merged

Fix multi-node H100 CI: CUDA compat, deploy improvements#781
Binyang2014 merged 30 commits intomainfrom
binyli/multinode-ci

Conversation

@Binyang2014
Copy link
Copy Markdown
Contributor

Summary

  • Multi-node H100 CI setup: Improve architecture detection and GPU configuration
  • Remove hardcoded VMSS hostnames from deploy files
  • Fix CUDA compat library issue: Remove stale compat paths from Docker image for CUDA 12+. Instead, peer_access_test now returns a distinct exit code (2) for CUDA init failure, and setup.sh conditionally adds compat libs only when needed. This fixes cudaErrorSystemNotReady (error 803) when the host driver is newer than the container's compat libs.
  • Speed up deploy: Replace recursive parallel-scp with tar+scp+untar to avoid per-file SSH overhead.

Binyang2014 and others added 9 commits April 8, 2026 03:30
Add vmssName pipeline parameter and generate config, hostfile, and
hostfile_mpi dynamically. Update run_tests.sh to derive the head host
from hostfile_mpi instead of hardcoding it. Delete the static deploy
files that previously hardcoded mscclpp-h100-multinode-ci hostnames.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ction

The multi-nodes-test pipeline was failing on H100 GPUs with CUDA error 803
(cudaErrorSystemDriverMismatch) because it still included the cuda11.8 Docker
image in its matrix. All other H100 CI jobs (ut, integration-test, nccl-api-test)
already use only cuda12.9. This aligns the multi-node config accordingly.

Also adds gpuArch: '90' to the deploy template call for consistent H100 builds,
and improves the peer-access-test Makefile to detect GPU compute capability via
nvidia-smi instead of relying solely on -arch=native, which silently falls back
to an old default architecture inside Docker containers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The host driver on the multi-node H100 VMs is CUDA 13.0 (driver 580.126.16),
so the container image must match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace recursive parallel-scp with tar+scp+untar to avoid
per-file SSH overhead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Binyang2014
Copy link
Copy Markdown
Contributor Author

/azp run mscclpp-ut

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Binyang2014 and others added 7 commits April 10, 2026 00:31
Tar contents directly (-C ${ROOT_DIR} .) instead of the parent
directory, and extract into ${DST_DIR} explicitly. The previous
approach used dirname/basename which produced wrong directory names
(e.g., 's' from '/__w/1/s/') causing 'No such file or directory'
in the container.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a RegisteredMemory has both CudaIpc and IB transports, the
import path was trying CudaIpc (PosixFd) even for cross-node memory.
PosixFd uses unix domain sockets which are node-local, causing
'No such file or directory' crashes.

For cross-node memory:
- If Fabric is available, try it (works with IMEX daemon)
- If Fabric fails and IB is available, fall back to IB
- If neither works, throw a clear error

Same-host behavior is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Binyang2014
Copy link
Copy Markdown
Contributor Author

/azp run mscclpp-ut

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

The static config file was removed. Generate SSH config at runtime
from the dynamically created hostfile_mpi. For single-node tests
where hostfile_mpi doesn't exist, skip config generation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Binyang2014
Copy link
Copy Markdown
Contributor Author

/azp run mscclpp-ut

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Binyang2014 and others added 6 commits April 10, 2026 19:02
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve HEAD_HOST to its eth0 IP address to ensure TcpBootstrap
connects on the correct interface, fixing timeout in
ResumeWithIpPortPair test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add continueOnError parameter to run-remote-task template and set it
for the perf test step. The step will show as failed but subsequent
steps (unit tests, python tests, benchmark) will still run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Check isSameHost first (the common/simpler path) before handling
the cross-node Fabric fallback logic, improving readability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Binyang2014
Copy link
Copy Markdown
Contributor Author

/azp run mscclpp-ut

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the multi-node H100 CI/deploy flow to be less environment-specific and more robust across CUDA driver/toolkit mismatches, while also speeding up deployment.

Changes:

  • Add a distinct exit code for CUDA init failure in peer_access_test, and retry with CUDA compat libs only when needed during remote setup.
  • Remove hardcoded multi-node hostnames from tracked deploy files; generate deploy hostfiles/config dynamically in the pipeline and improve runtime GPU/baseline selection.
  • Speed up remote deploy by switching from recursive parallel-scp to tar+scp+untar, and tighten cross-node CUDA IPC behavior to avoid non-functional handle types.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tools/peer-access-test/peer_access_test.cu Adds exit code (2) for CUDA init failure to enable conditional compat retry.
test/deploy/setup.sh Generates SSH config dynamically and retries peer-access test with compat libs on init failure.
test/deploy/run_tests.sh Uses build/bin paths, resolves head node IP, selects perf baseline by GPU type, centralizes mpirun env/args.
test/deploy/perf_ndmv5.jsonl Adds/extends H100 (NDmv5) perf baseline entries.
test/deploy/hostfile_mpi Removes hardcoded hostnames from repo (now generated in pipeline).
test/deploy/hostfile Removes hardcoded hostnames from repo (now generated in pipeline).
test/deploy/deploy.sh Deploys source via tarball to reduce per-file SSH overhead.
test/deploy/config Removes hardcoded SSH config from repo (now generated in pipeline/setup).
src/core/registered_memory.cc Restricts cross-node CUDA IPC to Fabric handles and allows IB fallback behavior.
docker/build.sh Removes CUDA compat LD_LIBRARY_PATH injection from image build.
.azure-pipelines/templates/run-remote-task.yml Adds continueOnError parameter passthrough for remote tasks.
.azure-pipelines/multi-nodes-test.yml Updates H100 multi-node CI settings, generates deploy files at runtime, adjusts pool/subscription/resource group.

Binyang2014 and others added 3 commits April 10, 2026 22:10
- Gate CUDA compat-lib retry on PLATFORM==cuda to avoid misleading
  errors on HIP
- Fix hostfile/hostfile_mpi leading whitespace from YAML indentation
  by using printf instead of echo
- Fix /etc/hosts duplicate check by iterating hostEntries per line
  instead of matching the entire multi-line string

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add gdrdrv kernel module installation for CUDA VMs before Docker
container launch. Skips if the module is already loaded. Applies
to both single-node and multi-node CI pipelines.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Binyang2014 and others added 2 commits April 13, 2026 18:38
…back logic

- Extract duplicated create/map/get into importCudaIpc lambda
- Add comment explaining MNNVL failure as the caught error
- Document CudaIpc | IB fallback use case in comments

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Binyang2014
Copy link
Copy Markdown
Contributor Author

/azp run mscclpp-ut

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Binyang2014 and others added 2 commits April 13, 2026 23:00
…host failures

- Remove hasFabric pre-check; let GpuIpcMem::create try all handle types
- Remove isSameHost branching for import; always try with IB fallback
- Catch BaseError to cover both Error and CudaError/CuError
- WARN on same-host CudaIpc failure (unexpected), INFO on cross-host (expected)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Binyang2014 Binyang2014 merged commit ecd3372 into main Apr 14, 2026
14 checks passed
@Binyang2014 Binyang2014 deleted the binyli/multinode-ci branch April 14, 2026 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants