Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native code order of magnitude slower than translated code on Apple M1 #17989

Closed
neurolabusc opened this issue Dec 12, 2020 · 20 comments · Fixed by #20131
Closed

Native code order of magnitude slower than translated code on Apple M1 #17989

neurolabusc opened this issue Dec 12, 2020 · 20 comments · Fixed by #20131

Comments

@neurolabusc
Copy link

I realize numpy is using experimental compilers for native builds on the M1, and still has some bugs, so it might be premature to discuss optimizations. Perhaps this is a feature request and not a bug. However, one would expect that native ARM code would typically be at least as fast as translated x86-64 code. I noticed that the nibabel bench_finite_range.py test is much slower for the native code than translated code. I found translated code (Python 3.8.3, NumPy version 1.19.4) is x10 faster than native code (Python 3.9.1rc1, NumPy version 1.19.4)

Reproducing code example:

# -*- coding: utf-8 -*-

import numpy as np
from numpy.testing import measure
#example where translated code (Python 3.8.3, NumPy version 1.19.4) is x10 faster than native code (Python 3.9.1rc1, NumPy version 1.19.4)
rng = np.random.RandomState(20111001)
img_shape = (128, 128, 64, 10)
repeat = 100
arr = rng.normal(size=img_shape)
mtime = measure('np.max(arr)', repeat)
print('%30s %6.2f' % ('max all finite', mtime))
mtime = measure('np.min(arr)', repeat)
print('%30s %6.2f' % ('min all finite', mtime))
arr[:, :, :, 1] = np.nan
mtime = measure('np.max(arr)', repeat)
print('%30s %6.2f' % ('max all nan', mtime))
mtime = measure('np.min(arr)', repeat)
print('%30s %6.2f' % ('min all nan', mtime))

Performance:

Translated:

$ time ./numpy_native_slower_than_translated.py
                max all finite   0.18
                min all finite   0.18
                   max all nan   0.18
                   min all nan   0.19
./numpy_native_slower_than_translated.py  1.32s user 1.28s system 214% cpu 1.213 total

Native:

$ time ./numpy_native_slower_than_translated.py
                max all finite   1.98
                min all finite   1.99
                   max all nan   1.99
                   min all nan   1.98
./numpy_native_slower_than_translated.py  8.49s user 0.14s system 104% cpu 8.237 total

NumPy/Python version information:

Translated:

  • 1.19.4 3.8.3 (default, May 19 2020, 13:54:14)
    [Clang 10.0.0 ]

Native:

  • 1.19.4 3.9.1rc1 | packaged by conda-forge | (default, Nov 28 2020, 22:21:58)
    [Clang 11.0.0 ]
@mattip
Copy link
Member

mattip commented Dec 12, 2020

Is there a way to test the HEAD or the numpy 1.20rc1 with native? We have started to use SIMD intrinsics in a way that might make this faster.

@neurolabusc
Copy link
Author

Happy to help, are there instructions for building on this platform? I have Clang 12.0.0 and gFortran 11.0.0 20201114 (experimental) but not gcc proper. It looks like there are dependencies for bras - do I use brew to install from source?

$pip install numpy==1.20rc1 
Collecting numpy==1.20rc1
  Using cached numpy-1.20.0rc1.zip (7.7 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Building wheels for collected packages: numpy
  Building wheel for numpy (PEP 517) ... error
  ERROR: Command errored out with exit status 1:
   command: /Users/chris/miniforge3/bin/python3.9 /Users/chris/miniforge3/lib/python3.9/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpsavxx_s9
       cwd: /private/var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/pip-install-t6cbgk1m/numpy_96013a2b4f96408886c142437efacd24
  Complete output (708 lines):
  Running from numpy source directory.
  numpy/random/_bounded_integers.pxd.in has not changed
  numpy/random/_philox.pyx has not changed
  numpy/random/_bounded_integers.pyx.in has not changed
  numpy/random/_sfc64.pyx has not changed
  numpy/random/_mt19937.pyx has not changed
  numpy/random/bit_generator.pyx has not changed
  Processing numpy/random/_bounded_integers.pyx
  numpy/random/mtrand.pyx has not changed
  numpy/random/_generator.pyx has not changed
  numpy/random/_pcg64.pyx has not changed
  numpy/random/_common.pyx has not changed
  Cythonizing sources
  blas_opt_info:
  blas_mkl_info:
  customize UnixCCompiler
    libraries mkl_rt not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  blis_info:
    libraries blis not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  openblas_info:
    libraries openblas not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  atlas_3_10_blas_threads_info:
  Setting PTATLAS=ATLAS
    libraries tatlas not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  atlas_3_10_blas_info:
    libraries satlas not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  atlas_blas_threads_info:
  Setting PTATLAS=ATLAS
    libraries ptf77blas,ptcblas,atlas not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  atlas_blas_info:
    libraries f77blas,cblas,atlas not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  /private/var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/pip-install-t6cbgk1m/numpy_96013a2b4f96408886c142437efacd24/numpy/distutils/system_info.py:1989: UserWarning:
      Optimized (vendor) Blas libraries are not found.
      Falls back to netlib Blas library which has worse performance.
      A better performance should be easily gained by switching
      Blas library.
    if self._calc_info(blas):
  blas_info:
  C compiler: clang -pthread -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /Users/chris/miniforge3/include -arch arm64 -fPIC -O2 -isystem /Users/chris/miniforge3/include -arch arm64
  
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var/folders
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var/folders/4p
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh
  compile options: '-I/usr/local/include -I/opt/local/include -I/Users/chris/miniforge3/include -c'
  clang: /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/source.c
  /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/source.c:1:10: fatal error: 'cblas.h' file not found
  #include <cblas.h>
           ^~~~~~~~~
  1 error generated.

@mattip
Copy link
Member

mattip commented Dec 13, 2020

homebrew install openblas or so?

@neurolabusc
Copy link
Author

I have been able to install many native libraries with homebrew by installing from source. However, the successful formulas appear to compile with clang, not gcc. It looks to me like openblas demands gcc:

$export ARCHFLAGS='-arch arm64'
$brew install -s openblas

Warning: You are running macOS on a arm64 CPU architecture.
We do not provide support for this (yet).
Reinstall Homebrew under Rosetta 2 until we support it.
You will encounter build failures with some formulae.
Please create pull requests instead of asking for help on Homebrew's GitHub,
Twitter or any other official channels. You are responsible for resolving
any issues you experience while you are running this
unsupported configuration.

==> Downloading https://raw.githubusercontent.com/Homebrew/formula-patches/7baf6e2f/gcc/bigsur.diff
######################################################################## 100.0%
==> Downloading https://ftp.gnu.org/gnu/gcc/gcc-10.2.0/gcc-10.2.0.tar.xz
Already downloaded: /Users/chrisrorden/Library/Caches/Homebrew/downloads/c8d4ca732a98ae691d04472b15de6d9e06a09016499af6ff16c4f55081bfc6b9--gcc-10.2.0.tar.xz

...
Error: You are running macOS on a arm64 CPU architecture.

As a Hail Mary, I tried to compile Iain Sandoe's experimental gcc build from source. I could not find any explicit instructions, so used a generic recipe, which did not go well...

$git clone https://github.com/iains/gcc-darwin-arm64
$cd gcc-darwin-arm64
$contrib/download_prerequisites
$mkdir build && cd build
$../configure --prefix=/usr/local/gcc-11 \
              --enable-checking=release \
              --enable-languages=c,c++,fortran \
              --disable-multilib \
              --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk \
              --program-suffix=-11

$make -j 
...

make[3]: /Users/chrisrorden/src/gcc-darwin-arm64/build/./prev-gcc/xg++: Resource temporarily unavailable
xg++: fatal error: cannot execute ‘/Users/chrisrorden/src/gcc-darwin-arm64/build/./prev-gcc/cc1plus’: vfork: Operation timed out
compilation terminated.

@stweil
Copy link

stweil commented Dec 13, 2020

I compiled OpenBLAS (git master) with clang and built numpy (git master) with it. The result is not faster:

            max all finite   1.97
            min all finite   1.97
               max all nan   1.97
               min all nan   1.98
   11.37 real         9.84 user         0.30 sys

A second run starts faster:

    8.28 real        10.09 user         0.03 sys

@mattip
Copy link
Member

mattip commented Dec 14, 2020

I wonder what SIMD features are exposed in native and in translated mode. Could you try this code with 1.20?

python -c "from numpy.core._multiarray_umath import __cpu_features__ as feat; print([f for f in feat if feat[f]])"

At some point we should expose a nicer interface to the __cpu_features__ dictionary of SIMD features detected at runtime.

@stweil
Copy link

stweil commented Dec 14, 2020

Native ARM with numpy git master:

% python -c "from numpy.core._multiarray_umath import __cpu_features__ as feat; print([f for f in feat if feat[f]])"
['NEON', 'NEON_FP16', 'NEON_VFPV4', 'ASIMD', 'FPHP', 'ASIMDHP']

Do min and max as in the test code above use SIMD features?

@mattip
Copy link
Member

mattip commented Dec 14, 2020

What does the translated processor show? There is SIMD code for avx512f, but none for Neon (yet).

@stweil
Copy link

stweil commented Dec 14, 2020

That might explain the results. I don't have a translated Python installed. @neurolabusc, could you please test that?

@stweil
Copy link

stweil commented Dec 14, 2020

There is SIMD code for avx512f, but none for Neon (yet).

There is a numpy/core/src/common/simd/neon/ in git master. Which part is missing?

@mattip
Copy link
Member

mattip commented Dec 14, 2020

The code in numpy/core/src/common/simd/neon is the infrastructure used by universal intrinsics to abstract away architecture-specific code inside the actual implementations. What is missing is that someone needs to rewrite the avx512f code for minumum and maximum to use the universal intrinsics. xref gh-17985 which is doing the rewrite for add/subtract/multiply/divide

neurolabusc added a commit to neurolabusc/simd that referenced this issue Dec 14, 2020
@neurolabusc
Copy link
Author

Rosetta supports different forms of SSE, but does not support any variant of AVX. The times for running each test on a macOS Intel i5-8259U are ~0.6s and on a Linux AMD 3900x ~0.33. Neither of those computers support avx512f.

Note that for many benchmarks the native code outperforms translated code. Here are the nibabel benchmarks which rely on numpy. The only test from that battery which showed a regression was bench_finite_range.py, which led me to drill down and determine that the np.min and np/max functions were specifically slow.

I also wrote a C program that reports the maximum for an array of 64-bit doubles of the same size. The program times both scalar and SIMD (SSE/neon) instructions. The translated X86 code is much slower than native code for scalar, but they perform just as fast for the SIMD. The caveat with this test mimicking numpy's NaN propagation behavior (e.g. amax vs nanmax).

$g++ -O3 -o tstX86 main.cpp  -target x86_64-apple-macos10.12 -DmyDisableAVX; ./tstX86 4
Reporting minimum time for 4 tests
max64=1: min/mean	2091	2091	ms
max64SSE=1: min/mean	332	332	ms
max64(NaN)=1: min/mean	2086	2088	ms
max64SSE(NaN)=0.793516: min/mean	332	332	ms


$make -j; ./tst 4
g++ -O3 -o tst main.cpp -march=armv8-a+fp+simd+crypto+crc
Reporting minimum time for 4 tests
max64=1: min/mean	659	659	ms
max64SSE=1: min/mean	330	330	ms
max64(NaN)=1: min/mean	658	659	ms
max64SSE(NaN)=nan: min/mean	330	330	ms

@neurolabusc
Copy link
Author

Just an update that same performance observed with numpy-1.20.1 with Python 3.9.2

@mattip
Copy link
Member

mattip commented Mar 23, 2021

Nothing will change until someone rewrites the code for max, min to use universal intrinsics.

@akbir
Copy link

akbir commented May 5, 2021

Hi @mattip, I have access to hardware and would love to try get this moving. Can we make a specifc issue for this and I'll try get started?

Will certainly need help but keen to sink some hours into this!

@neurolabusc
Copy link
Author

@Developer-Ecosystem-Engineering numpy is a tremendously useful and popular tool. Extending ARM support for universal intrinsics could have a profound impact.

@Developer-Ecosystem-Engineering
Copy link
Contributor

@Developer-Ecosystem-Engineering numpy is a tremendously useful and popular tool. Extending ARM support for universal intrinsics could have a profound impact.

Accelerate support was re-enabled here #18874, its worth checking with that support

Developer-Ecosystem-Engineering added a commit to Developer-Ecosystem-Engineering/numpy that referenced this issue Oct 18, 2021
This fixes numpy#17989 by adding ARM NEON implementations for min/max and fmin/max.

Before: Rosetta faster than native arm64 by `1.2x - 8.6x`.
After: Native arm64 faster than Rosetta by `1.6x - 6.7x`.  (2.8x - 15.5x improvement)

**Benchmarks**
```
       before           after         ratio
     [b0e1a44]       [8301ffd7]
     <main>           <gh-issue-17989/improve-neon-min-max>
+     32.6±0.04μs      37.5±0.08μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 1, 'd')
+     32.6±0.06μs      37.5±0.04μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 1, 'd')
+     37.8±0.09μs      43.2±0.09μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'f')
+     37.7±0.09μs       42.9±0.1μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 2, 'd')
+      37.9±0.2μs      43.0±0.02μs     1.14  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 2, 'd')
+     37.7±0.01μs         42.3±1μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 2, 2, 'd')
+     34.2±0.07μs      38.1±0.05μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 2, 'f')
+     32.6±0.03μs      35.8±0.04μs     1.10  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 1, 'f')
+      37.1±0.1μs       40.3±0.1μs     1.09  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 1, 2, 'd')
+      37.2±0.1μs      40.3±0.04μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 4, 'f')
+     37.1±0.09μs      40.3±0.07μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 1, 2, 'd')
+      68.6±0.5μs       74.2±0.3μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'd')
+      37.1±0.2μs       40.0±0.1μs     1.08  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 1, 2, 'd')
+        2.42±0μs      2.61±0.05μs     1.08  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'numpy.int16'>)
+      69.1±0.7μs       73.5±0.7μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 4, 4, 'd')
+      54.7±0.3μs       58.0±0.2μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 4, 'd')
+      54.5±0.2μs       57.8±0.2μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 2, 4, 'd')
+     3.78±0.04μs      4.00±0.02μs     1.06  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'str'>)
+      54.8±0.2μs       57.9±0.3μs     1.06  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 4, 'd')
+     3.68±0.01μs      3.87±0.02μs     1.05  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'object'>)
+      69.6±0.2μs       73.1±0.2μs     1.05  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'd')
+         229±2μs        241±0.2μs     1.05  bench_random.Bounded.time_bounded('PCG64', [<class 'numpy.uint64'>, 1535])
-      73.0±0.8μs       69.5±0.2μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 4, 'd')
-      37.6±0.1μs       35.7±0.3μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 1, 4, 'f')
-     88.7±0.04μs       84.2±0.7μs     0.95  bench_lib.Pad.time_pad((256, 128, 1), 1, 'wrap')
-      57.9±0.2μs       54.8±0.2μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 4, 'd')
-      39.9±0.2μs      37.2±0.04μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 2, 'd')
-     2.66±0.01μs      2.47±0.01μs     0.93  bench_lib.Nan.time_nanmin(200, 0)
-     2.65±0.02μs      2.46±0.04μs     0.93  bench_lib.Nan.time_nanmin(200, 50.0)
-     2.64±0.01μs      2.45±0.01μs     0.93  bench_lib.Nan.time_nanmax(200, 90.0)
-        2.64±0μs      2.44±0.02μs     0.92  bench_lib.Nan.time_nanmax(200, 0)
-     2.68±0.02μs         2.48±0μs     0.92  bench_lib.Nan.time_nanmax(200, 2.0)
-     40.2±0.01μs       37.1±0.1μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 4, 'f')
-        2.69±0μs         2.47±0μs     0.92  bench_lib.Nan.time_nanmin(200, 2.0)
-     2.70±0.02μs      2.48±0.02μs     0.92  bench_lib.Nan.time_nanmax(200, 0.1)
-        2.70±0μs         2.47±0μs     0.91  bench_lib.Nan.time_nanmin(200, 90.0)
-        2.70±0μs         2.46±0μs     0.91  bench_lib.Nan.time_nanmin(200, 0.1)
-        2.70±0μs      2.42±0.01μs     0.90  bench_lib.Nan.time_nanmax(200, 50.0)
-      11.8±0.6ms       10.6±0.6ms     0.89  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'str'>)
-      42.7±0.1μs      37.8±0.02μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 2, 'd')
-     42.8±0.03μs       37.8±0.2μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 2, 'd')
-      43.1±0.2μs      37.7±0.09μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'f')
-     37.5±0.07μs      32.6±0.06μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 1, 'd')
-     41.7±0.03μs      36.3±0.07μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 4, 'd')
-       166±0.8μs          144±1μs     0.87  bench_ufunc.UFunc.time_ufunc_types('fmin')
-      11.6±0.8ms      10.0±0.01ms     0.87  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'str'>)
-       167±0.9μs          144±2μs     0.86  bench_ufunc.UFunc.time_ufunc_types('minimum')
-         168±4μs        143±0.5μs     0.85  bench_ufunc.UFunc.time_ufunc_types('fmax')
-         167±1μs        142±0.8μs     0.85  bench_ufunc.UFunc.time_ufunc_types('maximum')
-        7.10±0μs      4.97±0.01μs     0.70  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 2)
-     7.11±0.07μs      4.96±0.01μs     0.70  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 2)
-     7.05±0.07μs         4.68±0μs     0.66  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 4)
-        7.13±0μs      4.68±0.01μs     0.66  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 4)
-       461±0.2μs          297±7μs     0.64  bench_app.MaxesOfDots.time_it
-     7.04±0.07μs         3.95±0μs     0.56  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 2)
-     7.06±0.06μs      3.95±0.01μs     0.56  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 2)
-     7.09±0.06μs         3.24±0μs     0.46  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 1)
-     7.12±0.07μs      3.25±0.02μs     0.46  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 1)
-     14.5±0.02μs         3.98±0μs     0.27  bench_reduce.MinMax.time_max(<class 'numpy.int64'>)
-      14.6±0.1μs      4.00±0.01μs     0.27  bench_reduce.MinMax.time_min(<class 'numpy.int64'>)
-     6.88±0.06μs         1.34±0μs     0.19  bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 1)
-        7.00±0μs         1.33±0μs     0.19  bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 1)
-     39.4±0.01μs      3.95±0.01μs     0.10  bench_reduce.MinMax.time_min(<class 'numpy.float64'>)
-     39.4±0.01μs      3.95±0.02μs     0.10  bench_reduce.MinMax.time_max(<class 'numpy.float64'>)
-      254±0.02μs       22.8±0.2μs     0.09  bench_lib.Nan.time_nanmax(200000, 50.0)
-       253±0.1μs       22.7±0.1μs     0.09  bench_lib.Nan.time_nanmin(200000, 0)
-      254±0.06μs      22.7±0.09μs     0.09  bench_lib.Nan.time_nanmin(200000, 2.0)
-      254±0.01μs      22.7±0.03μs     0.09  bench_lib.Nan.time_nanmin(200000, 0.1)
-      254±0.04μs      22.7±0.02μs     0.09  bench_lib.Nan.time_nanmin(200000, 50.0)
-       253±0.1μs      22.7±0.04μs     0.09  bench_lib.Nan.time_nanmax(200000, 0.1)
-      253±0.03μs      22.7±0.04μs     0.09  bench_lib.Nan.time_nanmin(200000, 90.0)
-      253±0.02μs      22.7±0.07μs     0.09  bench_lib.Nan.time_nanmax(200000, 0)
-      254±0.03μs      22.7±0.02μs     0.09  bench_lib.Nan.time_nanmax(200000, 90.0)
-      254±0.09μs      22.7±0.04μs     0.09  bench_lib.Nan.time_nanmax(200000, 2.0)
-     39.2±0.01μs      2.51±0.01μs     0.06  bench_reduce.MinMax.time_max(<class 'numpy.float32'>)
-     39.2±0.01μs      2.50±0.01μs     0.06  bench_reduce.MinMax.time_min(<class 'numpy.float32'>)
```

Size change of _multiarray_umath.cpython-39-darwin.so:
Before: 3,890,723
After: 3,924,035
Change: +33,312 (~ +0.856 %)
@rgommers
Copy link
Member

@neurolabusc there's a fix for this in PR gh-20131 thanks to @Developer-Ecosystem-Engineering. It's be great if you could test or review that PR.

@neurolabusc
Copy link
Author

@rgommers thanks for bringing this to my attention. While I develop other tools, both Numpy and SIMD intrinsics are well outside my expertise. Therefore, I do not think I am suitable to review this PR. @Developer-Ecosystem-Engineering thanks for this PR that includes my specific test but also provides SIMD intrinsics for a wide range of computations. This will benefit macOS users as well as those using other ARM CPUs. This looks like a tremendous contribution!

Once the PR is accepted, this issue can be closed.

@neurolabusc
Copy link
Author

This fix was introduced in numpy 1.23 which was released on June 22, 2022. On the same computer as my original post:

                max all finite   0.14
                min all finite   0.14
                   max all nan   0.14
                   min all nan   0.14
python ./numpy_native_slower_than_translated.py   2.76s  user 0.07s system 295% cpu 0.959 total

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants