CI: enhance NVLS tests#1269
Conversation
|
/build |
1 similar comment
|
/build |
|
| Filename | Overview |
|---|---|
| .ci/Dockerfile.nvls | New Dockerfile switching to the official CUDA base image and installing HPC-X from a tarball. The HPC-X download lacks checksum verification, which is a minor supply chain risk. |
| .ci/pipeline/test_nvls_matrix.yaml | Pipeline updated to split tests into perftest and MPI steps. The onfail on the perftest step can prematurely free the Slurm allocation before the MPI tests step executes, potentially causing a cascading failure. |
| .ci/scripts/build_ucc.sh | Makes the --with-tls configure flag configurable via UCC_BUILD_TLS; default value preserves the previous hardcoded list. Clean, safe change. |
| .ci/scripts/check_nvls_fabric.sh | New smoke-test script that validates NVLink fabric registration, state, and P2P topology before running UCC tests. Logic for counting completed GPUs and failure detection is correct. |
| .ci/scripts/run_nvls_slurm.sh | Renamed and refactored to accept the container script and tasks-per-node as arguments; also upgrades MPI from pmi2 to pmix and improves SSH quoting. Well structured. |
| .ci/scripts/run_tests_ucc_nvls_all.sh | New perftest runner script. The comment that reduce_scatter is "tested via MPI tests instead" is inaccurate — it is also commented out in the MPI test script, leaving no reduce_scatter NVLS coverage. |
| .ci/scripts/run_tests_ucc_nvls_mpi.sh | New MPI correctness test runner for NVLS allreduce. The reduce_scatter test variant is commented out, inconsistent with the comment in run_tests_ucc_nvls_all.sh. |
Comments Outside Diff (3)
-
.ci/scripts/run_tests_ucc_nvls_all.sh, line 22-25 (link)reduce_scattereffectively untestedThe comment here says
reduce_scatterNVLS is "tested via MPI tests instead", but inrun_tests_ucc_nvls_mpi.shthereduce_scattertest is also commented out (lines 23–25). As a result,reduce_scatterNVLS coverage is completely absent from the CI pipeline.Either enable the
reduce_scattertest inrun_tests_ucc_nvls_mpi.sh, or update this comment to reflect that it is currently disabled everywhere. -
.ci/Dockerfile.nvls, line 34-40 (link)No integrity verification for the HPC-X tarball
The HPC-X tarball is downloaded with
wget -qover HTTPS but without any checksum verification. A compromised or substituted tarball would be silently accepted. Consider adding a SHA-256 digest check after the download, e.g.:echo "<expected_sha256> ${HPCX_FILENAME}.tbz" | sha256sum -c -
This is especially relevant for a Dockerfile that is baked and distributed as a shared CI image.
-
.ci/pipeline/test_nvls_matrix.yaml, line 72-92 (link)Premature Slurm allocation teardown may break the MPI tests step
The "Run UCC NVLS perftest" step has an
onfailblock that callsstop_slurm_allocation.sh. If the CI framework continues to the next step after executingonfail(rather than aborting the pipeline), the "Run UCC NVLS MPI tests" step will run against a Slurm job that has already been freed, causing it to fail with a stale job ID.The "Run UCC NVLS MPI tests" step then also calls
stop_slurm_allocation.shin itsalwaysblock, resulting in a double-stop.If the intent is "run MPI tests only when perftest succeeds", consider removing the
onfailfrom the perftest step and relying solely on thealwaysblock of the MPI tests step for cleanup — both on success and failure paths.
Last reviewed commit: 9c23025
|
/build |
1 similar comment
|
/build |
|
/build |
Additional Comments (3)
|
|
/build |
Additional Comments (3)
|
|
/build |
1 similar comment
|
/build |
Additional Comments (2)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
|
ready for review only to enable runs |
Additional Comments (1)
|
|
/build |
|
/build |
|
/build |
e731933 to
f80fb69
Compare
|
/build |
Additional Comments (3)
Either re-enable the reduce_scatter test if it now works with
Consider adding a You will need to obtain the correct SHA256 hash from the official HPC-X release documentation. |
Additional Comments (1)
Since fabric validation is a per-node operation, consider running it only on local rank 0 to avoid duplication: if [ "${SLURM_LOCALID:-0}" = "0" ]; then
"${SCRIPT_DIR}/check_nvls_fabric.sh"
fiAlternatively, run the fabric check in a separate srun step with |
|
/build |
Additional Comments (1)
This is consistent with |
|
/build |
|
/build |
|
/build |
|
/build |
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
d2955ba to
9c23025
Compare
|
/build |
What
nvcr.io/nvidia/cuda:${CUDA_VER}-devel-ubuntu24.04(more lightweight)