Skip to content

CI: enhance NVLS tests#1269

Merged
Sergei-Lebedev merged 3 commits intoopenucx:masterfrom
ikryukov:nvls_tests
Mar 10, 2026
Merged

CI: enhance NVLS tests#1269
Sergei-Lebedev merged 3 commits intoopenucx:masterfrom
ikryukov:nvls_tests

Conversation

@ikryukov
Copy link
Collaborator

@ikryukov ikryukov commented Feb 27, 2026

What

  1. Added correctness tests for NVLS collectives
  2. Updated CUDA to 13.1.1 for NVLS path
  3. Use official CUDA base image nvcr.io/nvidia/cuda:${CUDA_VER}-devel-ubuntu24.04 (more lightweight)
  4. Use official HPCX instead of building from source
  5. Added smoke testing of nvlink on allocated nodes to separate node issues from UCC

@ikryukov
Copy link
Collaborator Author

/build

1 similar comment
@ikryukov
Copy link
Collaborator Author

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR enhances the NVLS CI pipeline by introducing dedicated correctness tests (MPI and perftest), a new fabric smoke-test script, an official CUDA base image, and official HPC-X binaries — all well-motivated changes that reduce build complexity. Two actionable issues were found:

  • Missing reduce_scatter test coverage: run_tests_ucc_nvls_all.sh states that reduce_scatter is "tested via MPI tests", but the corresponding test in run_tests_ucc_nvls_mpi.sh is also commented out. Neither script exercises the reduce_scatter NVLS path.
  • Potential premature Slurm allocation teardown: The "Run UCC NVLS perftest" step uses onfail to free the Slurm allocation on failure. If the CI framework proceeds to the next step after onfail (rather than aborting), the "Run UCC NVLS MPI tests" step will execute against an already-freed allocation, causing a guaranteed failure and a redundant stop_slurm_allocation call from its own always block.
  • No checksum on the HPC-X tarball: The wget download in Dockerfile.nvls lacks a SHA-256 digest check; consider adding one to guard against a tampered/substituted package.

Confidence Score: 3/5

  • Safe to merge with minor fixes; no functional regressions expected in the perftest path, but test coverage gaps and a pipeline ordering risk should be addressed.
  • The structural changes (Dockerfile, build configurability, fabric smoke test, refactored Slurm runner) are solid. The reduce_scatter coverage gap is a real omission given the explicit cross-file comment that claims it is covered. The onfail/always ordering in the pipeline is a latent failure mode that could silently fail MPI tests if perftest fails first.
  • .ci/pipeline/test_nvls_matrix.yaml (onfail ordering) and .ci/scripts/run_tests_ucc_nvls_mpi.sh (commented-out reduce_scatter contradicts sister script comment).

Important Files Changed

Filename Overview
.ci/Dockerfile.nvls New Dockerfile switching to the official CUDA base image and installing HPC-X from a tarball. The HPC-X download lacks checksum verification, which is a minor supply chain risk.
.ci/pipeline/test_nvls_matrix.yaml Pipeline updated to split tests into perftest and MPI steps. The onfail on the perftest step can prematurely free the Slurm allocation before the MPI tests step executes, potentially causing a cascading failure.
.ci/scripts/build_ucc.sh Makes the --with-tls configure flag configurable via UCC_BUILD_TLS; default value preserves the previous hardcoded list. Clean, safe change.
.ci/scripts/check_nvls_fabric.sh New smoke-test script that validates NVLink fabric registration, state, and P2P topology before running UCC tests. Logic for counting completed GPUs and failure detection is correct.
.ci/scripts/run_nvls_slurm.sh Renamed and refactored to accept the container script and tasks-per-node as arguments; also upgrades MPI from pmi2 to pmix and improves SSH quoting. Well structured.
.ci/scripts/run_tests_ucc_nvls_all.sh New perftest runner script. The comment that reduce_scatter is "tested via MPI tests instead" is inaccurate — it is also commented out in the MPI test script, leaving no reduce_scatter NVLS coverage.
.ci/scripts/run_tests_ucc_nvls_mpi.sh New MPI correctness test runner for NVLS allreduce. The reduce_scatter test variant is commented out, inconsistent with the comment in run_tests_ucc_nvls_all.sh.

Comments Outside Diff (3)

  1. .ci/scripts/run_tests_ucc_nvls_all.sh, line 22-25 (link)

    reduce_scatter effectively untested

    The comment here says reduce_scatter NVLS is "tested via MPI tests instead", but in run_tests_ucc_nvls_mpi.sh the reduce_scatter test is also commented out (lines 23–25). As a result, reduce_scatter NVLS coverage is completely absent from the CI pipeline.

    Either enable the reduce_scatter test in run_tests_ucc_nvls_mpi.sh, or update this comment to reflect that it is currently disabled everywhere.

  2. .ci/Dockerfile.nvls, line 34-40 (link)

    No integrity verification for the HPC-X tarball

    The HPC-X tarball is downloaded with wget -q over HTTPS but without any checksum verification. A compromised or substituted tarball would be silently accepted. Consider adding a SHA-256 digest check after the download, e.g.:

    echo "<expected_sha256>  ${HPCX_FILENAME}.tbz" | sha256sum -c -

    This is especially relevant for a Dockerfile that is baked and distributed as a shared CI image.

  3. .ci/pipeline/test_nvls_matrix.yaml, line 72-92 (link)

    Premature Slurm allocation teardown may break the MPI tests step

    The "Run UCC NVLS perftest" step has an onfail block that calls stop_slurm_allocation.sh. If the CI framework continues to the next step after executing onfail (rather than aborting the pipeline), the "Run UCC NVLS MPI tests" step will run against a Slurm job that has already been freed, causing it to fail with a stale job ID.

    The "Run UCC NVLS MPI tests" step then also calls stop_slurm_allocation.sh in its always block, resulting in a double-stop.

    If the intent is "run MPI tests only when perftest succeeds", consider removing the onfail from the perftest step and relying solely on the always block of the MPI tests step for cleanup — both on success and failure paths.

Last reviewed commit: 9c23025

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@ikryukov
Copy link
Collaborator Author

/build

1 similar comment
@ikryukov
Copy link
Collaborator Author

/build

@ikryukov ikryukov self-assigned this Feb 27, 2026
@ikryukov
Copy link
Collaborator Author

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Additional Comments (3)

.ci/scripts/check_nvls_fabric_slurm.sh, line 30
Empty string case is unreachable since line 10 already exits if SLURM_HEAD_NODE is empty

    *)
        echo "Connecting to SLURM head node: ${SLURM_HEAD_NODE}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} ${SLURM_COMMAND}"
        ;;

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh, line 29
Empty string case is unreachable since line 10 already exits if SLURM_HEAD_NODE is empty

    *)
        echo "Connecting to SLURM head node: ${SLURM_HEAD_NODE}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} ${SLURM_COMMAND}"
        ;;

.ci/scripts/run_tests_ucc_nvls_mpi_slurm.sh, line 30
Empty string case is unreachable since line 10 already exits if SLURM_HEAD_NODE is empty

    *)
        echo "Connecting to SLURM head node: ${SLURM_HEAD_NODE}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} ${SLURM_COMMAND}"
        ;;

@ikryukov
Copy link
Collaborator Author

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Additional Comments (3)

.ci/scripts/check_nvls_fabric_slurm.sh, line 29
SLURM_COMMAND contains DOCKER_IMAGE_NAME which includes a # character (see test_nvls_matrix.yaml:77). The default case should escape it like the dlcluster* case does.

        SLURM_COMMAND_ESCAPED="${SLURM_COMMAND//\#/\\#}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} \"${SLURM_COMMAND_ESCAPED}\""

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh, line 28
Same issue as check_nvls_fabric_slurm.sh - missing # escaping in default case.

        SLURM_COMMAND_ESCAPED="${SLURM_COMMAND//\#/\\#}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} \"${SLURM_COMMAND_ESCAPED}\""

.ci/scripts/run_tests_ucc_nvls_mpi_slurm.sh, line 29
Missing # escaping in default case - needs same fix as other SLURM wrappers.

        SLURM_COMMAND_ESCAPED="${SLURM_COMMAND//\#/\\#}"
        eval "${SSH_CMD} ${SLURM_HEAD_NODE} \"${SLURM_COMMAND_ESCAPED}\""

@ikryukov
Copy link
Collaborator Author

/build

1 similar comment
@ikryukov
Copy link
Collaborator Author

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Additional Comments (2)

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh, line 9
Inconsistent MPI launcher: SRUN_PERFTEST uses --mpi=pmix while run_tests_ucc_nvls_mpi_slurm.sh (which runs the same run_tests_ucc_nvls_mpi.sh script) uses --mpi=pmi2. This could cause unexpected behavior differences between test runs.

readonly SRUN_BASE="srun --jobid=${SLURM_JOB_ID} --nodes=${SLURM_NODES} --mpi=pmi2 --ntasks-per-node=${NVLS_MPI_PPN:-4} --container-image=${DOCKER_IMAGE_NAME}"

.ci/scripts/run_tests_ucc_nvls_all.sh, line 25
This script is identical to run_tests_ucc_nvls.sh. Consider using a symlink or sourcing one from the other to avoid maintenance issues when changes need to be synchronized.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@janjust
Copy link
Collaborator

janjust commented Feb 27, 2026

ready for review only to enable runs

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Additional Comments (1)

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh, line 7
Uses --mpi=pmix while other NVLS test scripts use --mpi=pmi2. Verify pmix is available in your environment and this inconsistency is intentional.

@ikryukov
Copy link
Collaborator Author

/build

@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 2, 2026

/build

@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 3, 2026

/build

@ikryukov ikryukov requested a review from dpressle March 3, 2026 11:09
@ikryukov ikryukov force-pushed the nvls_tests branch 2 times, most recently from e731933 to f80fb69 Compare March 3, 2026 15:20
@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 3, 2026

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (3)

.ci/scripts/run_tests_ucc_nvls_all.sh, line 24
The comment states that perftest srun uses ntasks-per-node=1, but SRUN_BASE in run_tests_ucc_nvls_all_slurm.sh (line 7) now uses --ntasks-per-node=${NVLS_MPI_PPN:-4} (defaults to 4). This satisfies the "multiple ranks per node" requirement for reduce_scatter NVLS, so the stated justification for disabling this test may no longer be accurate.

Either re-enable the reduce_scatter test if it now works with ntasks-per-node=4, or update the comment to reflect the current reason for keeping it disabled (e.g., if it's tested via MPI instead for a different reason).

# Disabled: reduce_scatter NVLS is tested via MPI tests instead.
#echo "INFO: NVLS perftest (reduce_scatter) ..."

.ci/scripts/run_tests_ucc_nvls_all_slurm.sh, line 32
The default SSH case (line 31) omits double-quotes around ${SLURM_COMMAND} unlike the dlcluster branch (line 27). The SLURM_COMMAND variable contains spaces and potentially special characters (e.g., docker registry paths with colons from DOCKER_IMAGE_NAME). Without quotes, the SSH client receives a series of arguments rather than a single remote command string, which can cause shell misparsing of the srun invocation.

            eval "${SSH_CMD} ${SLURM_HEAD_NODE} \"${SLURM_COMMAND}\""

.ci/Dockerfile.nvls, line 40
The HPC-X tarball is fetched from an external host without verifying a checksum. If the upstream server returns corrupted data or is ever compromised, the build will silently succeed with an unverified binary. This reduces build reproducibility and introduces a supply-chain risk.

Consider adding a sha256sum verification step after the wget:

wget -q "https://content.mellanox.com/hpc/hpc-x/${HPCX_VERSION}_${HPCX_CUDA}/${HPCX_FILENAME}.tbz" && \
    echo "EXPECTED_SHA256_HERE ${HPCX_FILENAME}.tbz" | sha256sum -c - && \
    tar xf "${HPCX_FILENAME}.tbz" && \

You will need to obtain the correct SHA256 hash from the official HPC-X release documentation.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (1)

.ci/scripts/run_tests_ucc_nvls_all.sh, line 6
The fabric check runs once per MPI task, not once per node. Since this script serves as the srun entrypoint with --ntasks-per-node=4, the fabric check executes 4 times simultaneously per node, producing redundant output.

Since fabric validation is a per-node operation, consider running it only on local rank 0 to avoid duplication:

if [ "${SLURM_LOCALID:-0}" = "0" ]; then
    "${SCRIPT_DIR}/check_nvls_fabric.sh"
fi

Alternatively, run the fabric check in a separate srun step with --ntasks-per-node=1 before the perftest step.

@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 3, 2026

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (1)

.ci/scripts/check_nvls_fabric.sh, line 1
Missing set -o pipefail option. Without it, failures in the left side of pipes will be silently masked. For example, line 7 (nvidia-smi ... | wc -l) will succeed overall even if nvidia-smi fails — the pipeline exit status comes only from wc -l.

#!/bin/bash -xe
set -o pipefail

This is consistent with build_ucc.sh which includes this option.

@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 4, 2026

/build

@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 6, 2026

/build

@ikryukov ikryukov requested a review from dpressle March 6, 2026 14:04
@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 6, 2026

/build

@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 9, 2026

/build

@Sergei-Lebedev Sergei-Lebedev enabled auto-merge (squash) March 9, 2026 09:36
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
@Sergei-Lebedev
Copy link
Contributor

/build

@Sergei-Lebedev Sergei-Lebedev merged commit b7d4f76 into openucx:master Mar 10, 2026
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants