CI: enhance NVLS tests by ikryukov · Pull Request #1269 · openucx/ucc

ikryukov · 2026-02-27T13:28:26Z

What

Added correctness tests for NVLS collectives
Updated CUDA to 13.1.1 for NVLS path
Use official CUDA base image nvcr.io/nvidia/cuda:${CUDA_VER}-devel-ubuntu24.04 (more lightweight)
Use official HPCX instead of building from source
Added smoke testing of nvlink on allocated nodes to separate node issues from UCC

ikryukov · 2026-02-27T13:28:49Z

/build

ikryukov · 2026-02-27T13:55:50Z

/build

greptile-apps · 2026-02-27T13:55:51Z

Greptile Summary

This PR enhances the NVLS CI pipeline by introducing dedicated correctness tests (MPI and perftest), a new fabric smoke-test script, an official CUDA base image, and official HPC-X binaries — all well-motivated changes that reduce build complexity. Two actionable issues were found:

Missing reduce_scatter test coverage: run_tests_ucc_nvls_all.sh states that reduce_scatter is "tested via MPI tests", but the corresponding test in run_tests_ucc_nvls_mpi.sh is also commented out. Neither script exercises the reduce_scatter NVLS path.
Potential premature Slurm allocation teardown: The "Run UCC NVLS perftest" step uses onfail to free the Slurm allocation on failure. If the CI framework proceeds to the next step after onfail (rather than aborting), the "Run UCC NVLS MPI tests" step will execute against an already-freed allocation, causing a guaranteed failure and a redundant stop_slurm_allocation call from its own always block.
No checksum on the HPC-X tarball: The wget download in Dockerfile.nvls lacks a SHA-256 digest check; consider adding one to guard against a tampered/substituted package.

Confidence Score: 3/5

Safe to merge with minor fixes; no functional regressions expected in the perftest path, but test coverage gaps and a pipeline ordering risk should be addressed.
The structural changes (Dockerfile, build configurability, fabric smoke test, refactored Slurm runner) are solid. The reduce_scatter coverage gap is a real omission given the explicit cross-file comment that claims it is covered. The onfail/always ordering in the pipeline is a latent failure mode that could silently fail MPI tests if perftest fails first.
.ci/pipeline/test_nvls_matrix.yaml (onfail ordering) and .ci/scripts/run_tests_ucc_nvls_mpi.sh (commented-out reduce_scatter contradicts sister script comment).

Important Files Changed

Filename	Overview
.ci/Dockerfile.nvls	New Dockerfile switching to the official CUDA base image and installing HPC-X from a tarball. The HPC-X download lacks checksum verification, which is a minor supply chain risk.
.ci/pipeline/test_nvls_matrix.yaml	Pipeline updated to split tests into perftest and MPI steps. The `onfail` on the perftest step can prematurely free the Slurm allocation before the MPI tests step executes, potentially causing a cascading failure.
.ci/scripts/build_ucc.sh	Makes the `--with-tls` configure flag configurable via `UCC_BUILD_TLS`; default value preserves the previous hardcoded list. Clean, safe change.
.ci/scripts/check_nvls_fabric.sh	New smoke-test script that validates NVLink fabric registration, state, and P2P topology before running UCC tests. Logic for counting completed GPUs and failure detection is correct.
.ci/scripts/run_nvls_slurm.sh	Renamed and refactored to accept the container script and tasks-per-node as arguments; also upgrades MPI from `pmi2` to `pmix` and improves SSH quoting. Well structured.
.ci/scripts/run_tests_ucc_nvls_all.sh	New perftest runner script. The comment that `reduce_scatter` is "tested via MPI tests instead" is inaccurate — it is also commented out in the MPI test script, leaving no `reduce_scatter` NVLS coverage.
.ci/scripts/run_tests_ucc_nvls_mpi.sh	New MPI correctness test runner for NVLS allreduce. The `reduce_scatter` test variant is commented out, inconsistent with the comment in `run_tests_ucc_nvls_all.sh`.

Comments Outside Diff (3)

.ci/scripts/run_tests_ucc_nvls_all.sh, line 22-25 (link)

reduce_scatter effectively untested

The comment here says reduce_scatter NVLS is "tested via MPI tests instead", but in run_tests_ucc_nvls_mpi.sh the reduce_scatter test is also commented out (lines 23–25). As a result, reduce_scatter NVLS coverage is completely absent from the CI pipeline.

Either enable the reduce_scatter test in run_tests_ucc_nvls_mpi.sh, or update this comment to reflect that it is currently disabled everywhere.
.ci/Dockerfile.nvls, line 34-40 (link)

No integrity verification for the HPC-X tarball

The HPC-X tarball is downloaded with wget -q over HTTPS but without any checksum verification. A compromised or substituted tarball would be silently accepted. Consider adding a SHA-256 digest check after the download, e.g.:
```
echo "<expected_sha256>  ${HPCX_FILENAME}.tbz" | sha256sum -c -
```
This is especially relevant for a Dockerfile that is baked and distributed as a shared CI image.
.ci/pipeline/test_nvls_matrix.yaml, line 72-92 (link)

Premature Slurm allocation teardown may break the MPI tests step

The "Run UCC NVLS perftest" step has an onfail block that calls stop_slurm_allocation.sh. If the CI framework continues to the next step after executing onfail (rather than aborting the pipeline), the "Run UCC NVLS MPI tests" step will run against a Slurm job that has already been freed, causing it to fail with a stale job ID.

The "Run UCC NVLS MPI tests" step then also calls stop_slurm_allocation.sh in its always block, resulting in a double-stop.

If the intent is "run MPI tests only when perftest succeeds", consider removing the onfail from the perftest step and relying solely on the always block of the MPI tests step for cleanup — both on success and failure paths.

_{Last reviewed commit: 9c23025}

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

.ci/scripts/check_nvls_fabric.sh

ikryukov · 2026-02-27T14:22:01Z

/build

ikryukov · 2026-02-27T14:33:42Z

/build

ikryukov · 2026-02-27T15:11:35Z

/build

greptile-apps · 2026-02-27T15:15:51Z

Additional Comments (3)

.ci/scripts/check_nvls_fabric_slurm.sh, line 30
Empty string case is unreachable since line 10 already exits if SLURM_HEAD_NODE is empty

    *)
        echo "Connecting to SLURM head node: ${SLURM_HEAD_NODE}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} ${SLURM_COMMAND}"
        ;;

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh, line 29
Empty string case is unreachable since line 10 already exits if SLURM_HEAD_NODE is empty

    *)
        echo "Connecting to SLURM head node: ${SLURM_HEAD_NODE}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} ${SLURM_COMMAND}"
        ;;

.ci/scripts/run_tests_ucc_nvls_mpi_slurm.sh, line 30
Empty string case is unreachable since line 10 already exits if SLURM_HEAD_NODE is empty

    *)
        echo "Connecting to SLURM head node: ${SLURM_HEAD_NODE}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} ${SLURM_COMMAND}"
        ;;

ikryukov · 2026-02-27T15:37:23Z

/build

greptile-apps · 2026-02-27T15:41:53Z

Additional Comments (3)

.ci/scripts/check_nvls_fabric_slurm.sh, line 29
SLURM_COMMAND contains DOCKER_IMAGE_NAME which includes a # character (see test_nvls_matrix.yaml:77). The default case should escape it like the dlcluster* case does.

        SLURM_COMMAND_ESCAPED="${SLURM_COMMAND//\#/\\#}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} \"${SLURM_COMMAND_ESCAPED}\""

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh, line 28
Same issue as check_nvls_fabric_slurm.sh - missing # escaping in default case.

        SLURM_COMMAND_ESCAPED="${SLURM_COMMAND//\#/\\#}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} \"${SLURM_COMMAND_ESCAPED}\""

.ci/scripts/run_tests_ucc_nvls_mpi_slurm.sh, line 29
Missing # escaping in default case - needs same fix as other SLURM wrappers.

        SLURM_COMMAND_ESCAPED="${SLURM_COMMAND//\#/\\#}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} \"${SLURM_COMMAND_ESCAPED}\""

ikryukov · 2026-02-27T16:19:52Z

/build

ikryukov · 2026-02-27T16:41:09Z

/build

greptile-apps · 2026-02-27T16:44:25Z

Additional Comments (2)

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh, line 9
Inconsistent MPI launcher: SRUN_PERFTEST uses --mpi=pmix while run_tests_ucc_nvls_mpi_slurm.sh (which runs the same run_tests_ucc_nvls_mpi.sh script) uses --mpi=pmi2. This could cause unexpected behavior differences between test runs.

readonly SRUN_BASE="srun --jobid=${SLURM_JOB_ID} --nodes=${SLURM_NODES} --mpi=pmi2 --ntasks-per-node=${NVLS_MPI_PPN:-4} --container-image=${DOCKER_IMAGE_NAME}"

.ci/scripts/run_tests_ucc_nvls_all.sh, line 25
This script is identical to run_tests_ucc_nvls.sh. Consider using a symlink or sourcing one from the other to avoid maintenance issues when changes need to be synchronized.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

janjust · 2026-02-27T16:51:03Z

ready for review only to enable runs

greptile-apps · 2026-02-27T16:55:40Z

Additional Comments (1)

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh, line 7
Uses --mpi=pmix while other NVLS test scripts use --mpi=pmi2. Verify pmix is available in your environment and this inconsistency is intentional.

ikryukov · 2026-02-27T18:15:20Z

/build

ikryukov · 2026-03-02T15:36:57Z

/build

ikryukov · 2026-03-03T10:09:36Z

/build

ikryukov · 2026-03-03T15:21:31Z

/build

greptile-apps · 2026-03-03T15:26:00Z

Additional Comments (3)

.ci/scripts/run_tests_ucc_nvls_all.sh, line 24
The comment states that perftest srun uses ntasks-per-node=1, but SRUN_BASE in run_tests_ucc_nvls_all_slurm.sh (line 7) now uses --ntasks-per-node=${NVLS_MPI_PPN:-4} (defaults to 4). This satisfies the "multiple ranks per node" requirement for reduce_scatter NVLS, so the stated justification for disabling this test may no longer be accurate.

Either re-enable the reduce_scatter test if it now works with ntasks-per-node=4, or update the comment to reflect the current reason for keeping it disabled (e.g., if it's tested via MPI instead for a different reason).

# Disabled: reduce_scatter NVLS is tested via MPI tests instead.
#echo "INFO: NVLS perftest (reduce_scatter) ..."

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh, line 32
The default SSH case (line 31) omits double-quotes around ${SLURM_COMMAND} unlike the dlcluster branch (line 27). The SLURM_COMMAND variable contains spaces and potentially special characters (e.g., docker registry paths with colons from DOCKER_IMAGE_NAME). Without quotes, the SSH client receives a series of arguments rather than a single remote command string, which can cause shell misparsing of the srun invocation.

            eval "${SSH_CMD} ${SLURM_HEAD_NODE} \"${SLURM_COMMAND}\""

.ci/Dockerfile.nvls, line 40
The HPC-X tarball is fetched from an external host without verifying a checksum. If the upstream server returns corrupted data or is ever compromised, the build will silently succeed with an unverified binary. This reduces build reproducibility and introduces a supply-chain risk.

Consider adding a sha256sum verification step after the wget:

wget -q "https://content.mellanox.com/hpc/hpc-x/${HPCX_VERSION}_${HPCX_CUDA}/${HPCX_FILENAME}.tbz" && \
    echo "EXPECTED_SHA256_HERE ${HPCX_FILENAME}.tbz" | sha256sum -c - && \
    tar xf "${HPCX_FILENAME}.tbz" && \

You will need to obtain the correct SHA256 hash from the official HPC-X release documentation.

greptile-apps · 2026-03-03T15:52:40Z

Additional Comments (1)

.ci/scripts/run_tests_ucc_nvls_all.sh, line 6
The fabric check runs once per MPI task, not once per node. Since this script serves as the srun entrypoint with --ntasks-per-node=4, the fabric check executes 4 times simultaneously per node, producing redundant output.

Since fabric validation is a per-node operation, consider running it only on local rank 0 to avoid duplication:

if [ "${SLURM_LOCALID:-0}" = "0" ]; then
    "${SCRIPT_DIR}/check_nvls_fabric.sh"
fi

Alternatively, run the fabric check in a separate srun step with --ntasks-per-node=1 before the perftest step.

ikryukov · 2026-03-03T16:08:10Z

/build

greptile-apps · 2026-03-03T16:11:11Z

Additional Comments (1)

.ci/scripts/check_nvls_fabric.sh, line 1
Missing set -o pipefail option. Without it, failures in the left side of pipes will be silently masked. For example, line 7 (nvidia-smi ... | wc -l) will succeed overall even if nvidia-smi fails — the pipeline exit status comes only from wc -l.

#!/bin/bash -xe
set -o pipefail

This is consistent with build_ucc.sh which includes this option.

ikryukov · 2026-03-04T16:13:30Z

/build

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh

.ci/scripts/run_tests_ucc_nvls_mpi.sh

.ci/Dockerfile.nvls

ikryukov · 2026-03-06T10:51:57Z

/build

.ci/pipeline/test_nvls_matrix.yaml

ikryukov · 2026-03-06T14:04:51Z

/build

ikryukov · 2026-03-09T09:01:32Z

/build

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

Sergei-Lebedev · 2026-03-10T10:18:05Z

/build

ikryukov added the Ready-for-Review label Feb 27, 2026

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

.ci/scripts/check_nvls_fabric.sh Outdated Show resolved Hide resolved

ikryukov self-assigned this Feb 27, 2026

janjust added the WIP - Don't Merge label Feb 27, 2026

ikryukov force-pushed the nvls_tests branch from 82308b8 to a51b8d7 Compare February 27, 2026 18:15

ikryukov force-pushed the nvls_tests branch from a51b8d7 to e731933 Compare March 2, 2026 15:32

ikryukov requested review from Sergei-Lebedev and janjust March 2, 2026 16:01

ikryukov removed the WIP - Don't Merge label Mar 2, 2026

ikryukov requested a review from dpressle March 3, 2026 11:09

janjust approved these changes Mar 3, 2026

View reviewed changes

ikryukov force-pushed the nvls_tests branch 2 times, most recently from e731933 to f80fb69 Compare March 3, 2026 15:20

ikryukov force-pushed the nvls_tests branch from f80fb69 to bb7e22e Compare March 3, 2026 15:45

ikryukov force-pushed the nvls_tests branch from bb7e22e to b4f9292 Compare March 3, 2026 16:06

dpressle requested changes Mar 5, 2026

View reviewed changes

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh Outdated Show resolved Hide resolved

.ci/scripts/run_tests_ucc_nvls_mpi.sh Show resolved Hide resolved

.ci/Dockerfile.nvls Outdated Show resolved Hide resolved

ikryukov requested a review from dpressle March 5, 2026 16:54

ikryukov force-pushed the nvls_tests branch from e2b7fb7 to 0f260fa Compare March 6, 2026 10:51

dpressle requested changes Mar 6, 2026

View reviewed changes

.ci/pipeline/test_nvls_matrix.yaml Show resolved Hide resolved

ikryukov requested a review from dpressle March 6, 2026 14:04

dpressle approved these changes Mar 8, 2026

View reviewed changes

ikryukov force-pushed the nvls_tests branch from 2e7af4e to d2955ba Compare March 9, 2026 09:01

Sergei-Lebedev enabled auto-merge (squash) March 9, 2026 09:36

ikryukov added 3 commits March 10, 2026 11:17

CI: enhance NVLS tests

a10abdd

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

CI: addressed comments

2934d66

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

CI: fix

9c23025

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

Sergei-Lebedev force-pushed the nvls_tests branch from d2955ba to 9c23025 Compare March 10, 2026 10:17

Sergei-Lebedev merged commit b7d4f76 into openucx:master Mar 10, 2026
11 of 12 checks passed

Conversation

ikryukov commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Uh oh!

ikryukov commented Feb 27, 2026

Uh oh!

ikryukov commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Comments Outside Diff (3)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ikryukov commented Feb 27, 2026

Uh oh!

ikryukov commented Feb 27, 2026

Uh oh!

ikryukov commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

ikryukov commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

ikryukov commented Feb 27, 2026

Uh oh!

ikryukov commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

janjust commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

ikryukov commented Feb 27, 2026

Uh oh!

ikryukov commented Mar 2, 2026

Uh oh!

ikryukov commented Mar 3, 2026

Uh oh!

ikryukov commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

ikryukov commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

ikryukov commented Mar 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ikryukov commented Mar 6, 2026

Uh oh!

Uh oh!

ikryukov commented Mar 6, 2026

Uh oh!

ikryukov commented Mar 9, 2026

Uh oh!

Sergei-Lebedev commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

ikryukov commented Feb 27, 2026 •

edited

Loading

greptile-apps bot commented Feb 27, 2026 •

edited

Loading