Skip to content

Conversation

seiko2plus
Copy link
Member

@seiko2plus seiko2plus commented Sep 6, 2025

numpy-simd-routines added as subrepo in meson subprojects
directory and the current FP configuration is static, ~1ulp used for double-precision
~4ulp for single-precision with handling floating-point errors,
special-cases extended precision for large arguments,
subnormals are enabled by default too.

numpy-simd-routines supports all SIMD extensions that are supported
by Google Highway including non-FMA extensions and is fully independent
from libm to guarantee unified results across all compilers and
platforms.

Full benchmarks will be provided within the pull-request, the following
benchmark was tested against clang-19 and x86 CPU (Ryzen7 7700X)
with AVX512 enabled.

Note: that there was no SIMD optimization enabled for sin/cos
for double-precision before, only single-precision.

Before After Ratio Benchmark (Parameter)
713±6μs 633±6μs 0.89 UnaryFP(<ufunc 'cos'>, 1, 2, 'f')
717±9μs 637±6μs 0.89 UnaryFP(<ufunc 'cos'>, 4, 1, 'f')
705±3μs 607±10μs 0.86 UnaryFP(<ufunc 'sin'>, 4, 1, 'f')
714±10μs 595±0.5μs 0.83 UnaryFP(<ufunc 'sin'>, 1, 2, 'f')
370±0.3μs 277±4μs 0.75 UnaryFP(<ufunc 'cos'>, 1, 1, 'f')
373±2μs 236±0.6μs 0.63 UnaryFP(<ufunc 'sin'>, 1, 1, 'f')
1.06±0.01ms 648±3μs 0.61 UnaryFP(<ufunc 'cos'>, 4, 2, 'f')
1.06±0.01ms 617±30μs 0.58 UnaryFP(<ufunc 'sin'>, 4, 2, 'f')
5.06±0.06ms 2.61±0.3ms 0.52 UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'd')
1.48±0ms 715±5μs 0.48 UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'f')
1.50±0.01ms 639±6μs 0.43 UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'f')
5.15±0.1ms 1.96±0.01ms 0.38 UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'd')
5.72±0.02ms 2.09±0.1ms 0.37 UnaryFP(<ufunc 'cos'>, 4, 2, 'd')
5.76±0.01ms 2.03±0.08ms 0.35 UnaryFP(<ufunc 'sin'>, 4, 2, 'd')
5.07±0.08ms 1.76±0.2ms 0.35 UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'd')
6.04±0.04ms 2.05±0.09ms 0.34 UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'd')
5.79±0.03ms 1.90±0.2ms 0.33 UnaryFP(<ufunc 'sin'>, 4, 1, 'd')
2.29±0.1ms 762±40μs 0.33 UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'f')
5.72±0.1ms 1.75±0.07ms 0.31 UnaryFP(<ufunc 'cos'>, 4, 1, 'd')
6.04±0.03ms 1.82±0.2ms 0.3 UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'd')
2.49±0.1ms 748±30μs 0.3 UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'f')
2.23±0.1ms 634±6μs 0.28 UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'f')
1.31±0.03ms 367±5μs 0.28 UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'f')
2.55±0.09ms 654±30μs 0.26 UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'f')
4.97±0.03ms 1.14±0ms 0.23 UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'd')
5.67±0.01ms 1.22±0.03ms 0.22 UnaryFP(<ufunc 'cos'>, 1, 2, 'd')
5.76±0.03ms 1.28±0.06ms 0.22 UnaryFP(<ufunc 'sin'>, 1, 2, 'd')
1.26±0.01ms 272±2μs 0.22 UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'f')
7.03±0.02ms 1.31±0.01ms 0.19 UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'd')
5.67±0.01ms 810±9μs 0.14 UnaryFP(<ufunc 'cos'>, 1, 1, 'd')
5.71±0.01ms 817±40μs 0.14 UnaryFP(<ufunc 'sin'>, 1, 1, 'd')
7.05±0.03ms 915±4μs 0.13 UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'd')

@seiko2plus seiko2plus force-pushed the brings_npsr branch 4 times, most recently from 09414c8 to af5d98a Compare September 7, 2025 22:33
@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 10, 2025
@seiko2plus seiko2plus added this to the 2.4.0 release milestone Sep 13, 2025
  numpy-simd-routines added as subrepo in meson subprojects
  directory and the current FP configuration is static, ~1ulp used for double-precision
  ~4ulp for single-precision with handling floating-point errors,
  special-cases extended precision for large arguments,
  subnormals are enabled by default too.

  numpy-simd-routines supports all SIMD extensions that are supported
  by Google Highway including non-FMA extensions and is fully independent
  from libm to guarantee unified results across all compilers and
  platforms.

  Full benchmarks will be provided within the pull-request, the following
  benchmark was tested against clang-19 and x86 CPU (Ryzen7 7700X)
  with AVX512 enabled.

  Note: that there was no SIMD optimization enabled for sin/cos
  for double-precision, only single-precision.

  | Before        | After       |  Ratio | Benchmark (Parameter)                    |
  |---------------|-------------|--------|------------------------------------------|
  | 713±6μs       | 633±6μs     |   0.89 | UnaryFP(<ufunc 'cos'>, 1, 2, 'f')        |
  | 717±9μs       | 637±6μs     |   0.89 | UnaryFP(<ufunc 'cos'>, 4, 1, 'f')        |
  | 705±3μs       | 607±10μs    |   0.86 | UnaryFP(<ufunc 'sin'>, 4, 1, 'f')        |
  | 714±10μs      | 595±0.5μs   |   0.83 | UnaryFP(<ufunc 'sin'>, 1, 2, 'f')        |
  | 370±0.3μs     | 277±4μs     |   0.75 | UnaryFP(<ufunc 'cos'>, 1, 1, 'f')        |
  | 373±2μs       | 236±0.6μs   |   0.63 | UnaryFP(<ufunc 'sin'>, 1, 1, 'f')        |
  | 1.06±0.01ms   | 648±3μs     |   0.61 | UnaryFP(<ufunc 'cos'>, 4, 2, 'f')        |
  | 1.06±0.01ms   | 617±30μs    |   0.58 | UnaryFP(<ufunc 'sin'>, 4, 2, 'f')        |
  | 5.06±0.06ms   | 2.61±0.3ms  |   0.52 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'd') |
  | 1.48±0ms      | 715±5μs     |   0.48 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'f') |
  | 1.50±0.01ms   | 639±6μs     |   0.43 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'f') |
  | 5.15±0.1ms    | 1.96±0.01ms |   0.38 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'd') |
  | 5.72±0.02ms   | 2.09±0.1ms  |   0.37 | UnaryFP(<ufunc 'cos'>, 4, 2, 'd')        |
  | 5.76±0.01ms   | 2.03±0.08ms |   0.35 | UnaryFP(<ufunc 'sin'>, 4, 2, 'd')        |
  | 5.07±0.08ms   | 1.76±0.2ms  |   0.35 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'd') |
  | 6.04±0.04ms   | 2.05±0.09ms |   0.34 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'd') |
  | 5.79±0.03ms   | 1.90±0.2ms  |   0.33 | UnaryFP(<ufunc 'sin'>, 4, 1, 'd')        |
  | 2.29±0.1ms    | 762±40μs    |   0.33 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'f') |
  | 5.72±0.1ms    | 1.75±0.07ms |   0.31 | UnaryFP(<ufunc 'cos'>, 4, 1, 'd')        |
  | 6.04±0.03ms   | 1.82±0.2ms  |   0.3  | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'd') |
  | 2.49±0.1ms    | 748±30μs    |   0.3  | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'f') |
  | 2.23±0.1ms    | 634±6μs     |   0.28 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'f') |
  | 1.31±0.03ms   | 367±5μs     |   0.28 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'f') |
  | 2.55±0.09ms   | 654±30μs    |   0.26 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'f') |
  | 4.97±0.03ms   | 1.14±0ms    |   0.23 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'd') |
  | 5.67±0.01ms   | 1.22±0.03ms |   0.22 | UnaryFP(<ufunc 'cos'>, 1, 2, 'd')        |
  | 5.76±0.03ms   | 1.28±0.06ms |   0.22 | UnaryFP(<ufunc 'sin'>, 1, 2, 'd')        |
  | 1.26±0.01ms   | 272±2μs     |   0.22 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'f') |
  | 7.03±0.02ms   | 1.31±0.01ms |   0.19 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'd') |
  | 5.67±0.01ms   | 810±9μs     |   0.14 | UnaryFP(<ufunc 'cos'>, 1, 1, 'd')        |
  | 5.71±0.01ms   | 817±40μs    |   0.14 | UnaryFP(<ufunc 'sin'>, 1, 1, 'd')        |
  | 7.05±0.03ms   | 915±4μs     |   0.13 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'd') |
Allow up to 3 ULP error for float32 sin/cos when native
FMA is not available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants