When running Int8 unary reductions on a certain axis/dim (with GPU), results are incorrect. cupynumeric fails to find the proper minimum on a given axis. This does not occur on a CPU only build path. Additionally, maximum can fail as well. I have not tested other unary reductions w/ axis.
This occurs on GPU builds from 25.10 (prebuilt binaries for cuNumeric.jl) & 26.05 (the nightly build I have in this legate-issue environment information above).
Running Int8 unary reductions without axis kwarg works and returns correct results.
The expected results of my reproducer below should print "mismatches 0".
In our reproducer, we expect there to be no mismatches between numpy and cuynumeric.
Our reproducer creates a 256x256 array in both numpy and cupynumeric. The cupynumeric array is generated from the numpy array constructor ensuring that the arrays in both models are the same.
Then we do a unary minimum reduction on axis=0 expecting the minimum value. This should result in 256 seperate results. Multiple results was necessary to test as sometimes cupynumeric's reported minimum is correct. ~half of the 256 are different from numpy to cupynumeric.
This resulting output was ran on an A30x GPU; however, the same failure occured on a 5060 GPU as well.
>>> ref
array([-128, -128, -127, -128, -128, -128, -128, -128, -126, -122, -128,
-128, -128, -128, -125, -127, -127, -128, -128, -127, -128, -126,
-127, -128, -128, -128, -128, -127, -128, -128, -128, -128, -128,
-127, -128, -128, -128, -128, -128, -128, -128, -128, -128, -127,
-125, -128, -125, -128, -128, -128, -127, -128, -123, -128, -128,
-128, -127, -128, -125, -127, -128, -128, -128, -127, -128, -127,
-126, -127, -126, -128, -128, -127, -127, -127, -127, -128, -128,
-128, -128, -127, -128, -128, -127, -128, -128, -128, -128, -127,
-125, -127, -128, -128, -127, -126, -128, -128, -127, -128, -124,
-128, -128, -128, -128, -127, -126, -128, -128, -127, -128, -127,
-125, -128, -128, -128, -127, -128, -128, -125, -128, -128, -128,
-126, -128, -127, -128, -128, -128, -128, -127, -126, -127, -128,
-127, -127, -125, -128, -128, -125, -127, -128, -128, -125, -127,
-126, -128, -128, -128, -127, -127, -128, -126, -128, -128, -128,
-128, -126, -126, -127, -128, -127, -126, -127, -127, -127, -128,
-127, -128, -128, -127, -128, -127, -127, -128, -127, -128, -127,
-126, -126, -127, -128, -128, -127, -127, -128, -128, -126, -127,
-128, -128, -128, -128, -128, -128, -126, -128, -128, -128, -128,
-128, -128, -128, -127, -128, -127, -128, -128, -128, -128, -128,
-127, -128, -128, -128, -125, -128, -128, -127, -128, -128, -126,
-128, -128, -127, -127, -126, -128, -125, -128, -125, -128, -128,
-128, -128, -128, -126, -126, -128, -128, -126, -128, -128, -127,
-127, -126, -128, -128, -128, -125, -127, -126, -127, -127, -127,
-128, -128, -125], dtype=int8)
>>> got
array([-128, -126, -123, -126, -128, -128, -127, -126, -126, -121, -110,
-127, -128, -127, -122, -124, -127, -119, -127, -126, -128, -114,
-126, -124, -128, -127, -125, -122, -128, -118, -128, -128, -128,
-123, -122, -119, -128, -124, -122, -117, -128, -112, -126, -121,
-125, -120, -124, -128, -128, -128, -120, -128, -123, -120, -128,
-122, -127, -121, -116, -94, -128, -119, -113, -122, -128, -127,
-125, -125, -126, -121, -112, -127, -127, -127, -126, -127, -128,
-123, -113, -109, -128, -121, -125, -127, -128, -127, -127, -120,
-125, -93, -118, -108, -127, -126, -128, -123, -127, -122, -1,
-1, -128, -124, -1, -1, -126, -124, -1, -1, -128, -127,
-1, -1, -128, -115, -1, -1, -128, -107, -1, -1, -128,
-125, -1, -1, -128, -106, -127, -117, -127, -114, -111, -124,
-127, -114, -117, -120, -128, -112, -120, -125, -128, -125, -119,
-126, -128, -119, -122, -124, -127, -125, -124, -125, -128, -122,
-123, -112, -126, -125, -125, -121, -126, -125, -117, -114, -128,
-94, -118, -122, -127, -110, -118, -123, -128, -114, -96, -125,
-126, -115, -126, -124, -128, -110, -124, -128, -128, -124, -126,
-116, -128, -113, -128, -124, -128, -124, -128, -1, -128, -122,
-128, -1, -128, -125, -119, -1, -128, -126, -128, -1, -128,
-110, -126, -1, -128, -124, -125, -1, -127, -104, -125, -1,
-128, -122, -127, -1, -126, -105, -125, -127, -125, -124, -128,
-1, -128, -127, -116, -126, -128, -126, -126, -1, -128, -110,
-121, -1, -128, -128, -125, -123, -127, -126, -124, -123, -127,
-127, -126, -125], dtype=int8)
Software versions
System info:
Python : 3.13.11 | packaged by conda-forge | (main, Jan 26 2026, 23:57:06) [GCC 14.3.0]
Platform : Linux-5.15.0-174-generic-x86_64-with-glibc2.35
GPU driver : 580.105.08
GPU devices :
GPU 0 : NVIDIA A30X
Package versions:
legion : legion-25.12.0-49-g27b2c7fec-dirty (commit: 27b2c7fec5979298aca6fae935e8d857b58004eb)
legate : 26.05.00.dev0
cupynumeric : 26.05.00.dev+6.g2b1bfee8
numpy : 2.3.5
scipy : 1.16.3
numba : (failed to detect)
Legate build configuration:
build_type : Release
use_openmp : True
use_cuda : True
networks : ucx
conduit :
configure_options : --LEGATE_ARCH=arch-conda;--with-python;--with-cc=/tmp/conda-croot/legate/_build_env/bin/x86_64-conda-linux-gnu-cc;--with-cxx=/tmp/conda-croot/legate/_build_env/bin/x86_64-conda-linux-gnu-c++;--build-march=haswell;--cmake-generator=Ninja;--with-openmp;--with-cuda;--build-type=release;--with-ucx
Package details:
cuda-version : cuda-version-13.1-h2ff5cdb_3 (conda-forge)
legate : legate-26.5.0.dev0-pypi_0 (pypi)
cupynumeric : cupynumeric-26.05.00.dev6-cuda13_py313_gpu_g2b1bfee8_6 (legate-nightly)
Jupyter notebook / Jupyter Lab version
No response
Expected behavior
When running Int8 unary reductions on a certain axis/dim (with GPU), results are incorrect. cupynumeric fails to find the proper minimum on a given axis. This does not occur on a CPU only build path. Additionally, maximum can fail as well. I have not tested other unary reductions w/ axis.
This occurs on GPU builds from 25.10 (prebuilt binaries for cuNumeric.jl) & 26.05 (the nightly build I have in this legate-issue environment information above).
Running Int8 unary reductions without axis kwarg works and returns correct results.
The expected results of my reproducer below should print "mismatches 0".
Observed behavior
In our reproducer, we expect there to be no mismatches between numpy and cuynumeric.
Our reproducer creates a 256x256 array in both numpy and cupynumeric. The cupynumeric array is generated from the numpy array constructor ensuring that the arrays in both models are the same.
Then we do a unary minimum reduction on axis=0 expecting the minimum value. This should result in 256 seperate results. Multiple results was necessary to test as sometimes cupynumeric's reported minimum is correct. ~half of the 256 are different from numpy to cupynumeric.
This resulting output was ran on an A30x GPU; however, the same failure occured on a 5060 GPU as well.
Example code or instructions
Stack traceback or browser console output
No response