Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics) #17587

Merged
merged 3 commits into from
Dec 26, 2020

Conversation

seiko2plus
Copy link
Member

@seiko2plus seiko2plus commented Oct 19, 2020

Merge after #17790, #17789

SIMD: Replace raw SIMD of sin/cos with NPYV

The new code improves the performance of non-contiguous memory access
for the output array without any reduction in performance.
For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.

TODO:

Performance tests(ASV)

Args

--bench-compare master bench_ufunc_strides.Unary -- --sort name --cpu-affinity 1,5

X86

I had to count on my local machine because I couldn't able to get stable ratios using aws.
see standalone benchamrk for AVX512F.

CPU
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
Stepping:            10
CPU MHz:             1800.344
CPU max MHz:         4000.0000
CPU min MHz:         400.0000
BogoMIPS:            3984.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx
OS
Linux ac6279ab1a82 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 x86_64 x86_64 GNU/Linux
gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0

Benchmark

AVX2 & FMA3 - Changed only
       before           after         ratio
     [098a3b41]       [a0322ee9]
     <master>         <to_npyv_sincos_f32>
        259~3us       55.1~0.2us     0.21  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 2, 'f')
        260~4us       56.2~0.2us     0.22  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 4, 'f')
      334~0.8us      60.4~0.07us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 2, 'f')
      335~0.9us       61.5~0.2us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 4, 'f')
      337~0.4us       62.1~0.2us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 2, 'f')
        339~2us       61.2~0.6us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 4, 'f')
       266~10us       54.9~0.2us     0.21  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 2, 'f')
       270~20us       55.6~0.2us     0.21  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 4, 'f')
        331~3us       60.3~0.1us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 2, 'f')
        332~2us       61.0~0.3us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 4, 'f')
        336~1us       61.7~0.3us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 2, 'f')
      335~0.2us       61.5~0.4us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 4, 'f')

Power little-endian

CPU
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix


OS
Linux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2) 

Benchmark

VSX2(ISA >= 2.07) - Changed only
       before           after         ratio
     [098a3b41]       [a0322ee9]
     <master>         <to_npyv_sincos_f32>
       120±0.2μs      44.7±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 1, 'f')
       121±0.5μs      48.9±0.04μs     0.40  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 2, 'f')
       121±0.3μs      49.1±0.04μs     0.41  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 4, 'f')
       121±0.2μs      48.7±0.02μs     0.40  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 1, 'f')
       121±0.1μs      52.4±0.04μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 2, 'f')
       121±0.1μs      52.5±0.05μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 4, 'f')
       121±0.2μs      48.8±0.06μs     0.40  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 1, 'f')
       122±0.6μs      52.6±0.04μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 2, 'f')
      122±0.09μs      53.0±0.01μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 4, 'f')
       126±0.6μs      44.1±0.01μs     0.35  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 1, 'f')
       131±0.5μs      48.2±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 2, 'f')
       130±0.7μs      48.4±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 4, 'f')
       131±0.6μs      47.9±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 1, 'f')
       131±0.5μs      51.4±0.04μs     0.39  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 2, 'f')
       131±0.6μs      51.6±0.02μs     0.39  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 4, 'f')
       130±0.7μs      48.1±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 1, 'f')
       131±0.2μs      51.7±0.02μs     0.39  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 2, 'f')
       131±0.4μs      52.0±0.05μs     0.40  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 4, 'f')

Performance tests(standalone #15987)

Args used within #15987

--filter "(sin|cos)::.*[f]" --strides 1 2 3 10 --msleep 1 --iteration 100

Note: --msleep 1 force the running thread to sleep 1 millisecond before collecting each sample
to revert any frequency reduction, since it seems that throttling effect on wall time when AVX512F is enabled.

X86

CPU
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GH
                                 z
Stepping:                        7
CPU MHz:                         3604.410
BogoMIPS:                        6000.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        2 MiB
L3 cache:                        35.8 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no m
                                 icrocode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __us
                                 er pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP dis
                                 abled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep m
                                 trr pge mca cmov pat pse36 clflush mmx fxsr s
                                 se sse2 ss ht syscall nx pdpe1gb rdtscp lm co
                                 nstant_tsc rep_good nopl xtopology nonstop_ts
                                 c cpuid aperfmperf tsc_known_freq pni pclmulq
                                 dq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic m
                                 ovbe popcnt tsc_deadline_timer aes xsave avx 
                                 f16c rdrand hypervisor lahf_lm abm 3dnowprefe
                                 tch invpcid_single pti fsgsbase tsc_adjust bm
                                 i1 avx2 smep bmi2 erms invpcid mpx avx512f av
                                 x512dq rdseed adx smap clflushopt clwb avx512
                                 cd avx512bw avx512vl xsaveopt xsavec xgetbv1 
                                 xsaves ida arat pku ospke
OS
Linux ip-172-31-28-146 5.4.0-1025-aws
gcc version 7.5.0 (Ubuntu 7.5.0-6ubuntu2)

Benchmark

AVX512F - Contiguous only

metric: gmean, units: ms

name of test before_contig_avx512f after_contig_avx512f after_contig_avx512f vs before_contig_avx512f
cos::1024      f::1  ->  f::1 0.0011 0.0011 1.01
cos::2048      f::1  ->  f::1 0.0018 0.0018 1.01
cos::4096      f::1  ->  f::1 0.0033 0.0029 1.13
sin::1024      f::1  ->  f::1 0.0011 0.0011 1.01
sin::2048      f::1  ->  f::1 0.0018 0.0018 0.98
sin::4096      f::1  ->  f::1 0.0032 0.0030 1.07
AVX512F

metric: gmean, units: ms

name of test before_avx512f after_avx512f after_avx512f vs before_avx512f
cos::1024      f::1  ->  f::1 0.0011 0.0011 1.01
cos::2048      f::1  ->  f::1 0.0018 0.0018 0.99
cos::4096      f::1  ->  f::1 0.0034 0.0032 1.05
cos::1024      f::1  ->  f::2 0.0139 0.0010 14.02
cos::2048      f::1  ->  f::2 0.0278 0.0019 14.76
cos::4096      f::1  ->  f::2 0.0561 0.0046 12.17
cos::1024      f::1  ->  f::3 0.0140 0.0010 13.88
cos::2048      f::1  ->  f::3 0.0280 0.0019 14.76
cos::4096      f::1  ->  f::3 0.0565 0.0045 12.54
cos::1024      f::1  ->  f::10 0.0140 0.0012 12.03
cos::2048      f::1  ->  f::10 0.0280 0.0020 14.18
cos::4096      f::1  ->  f::10 0.0562 0.0046 12.13
cos::1024      f::2  ->  f::1 0.0010 0.0009 1.07
cos::2048      f::2  ->  f::1 0.0019 0.0017 1.08
cos::4096      f::2  ->  f::1 0.0038 0.0035 1.09
cos::1024      f::2  ->  f::2 0.0200 0.0012 16.48
cos::2048      f::2  ->  f::2 0.0400 0.0025 16.22
cos::4096      f::2  ->  f::2 0.0799 0.0048 16.82
cos::1024      f::2  ->  f::3 0.0200 0.0012 16.53
cos::2048      f::2  ->  f::3 0.0400 0.0024 16.88
cos::4096      f::2  ->  f::3 0.0800 0.0047 17.02
cos::1024      f::2  ->  f::10 0.0200 0.0013 15.8
cos::2048      f::2  ->  f::10 0.0400 0.0025 16.08
cos::4096      f::2  ->  f::10 0.0801 0.0050 16.02
cos::1024      f::3  ->  f::1 0.0010 0.0009 1.07
cos::2048      f::3  ->  f::1 0.0019 0.0017 1.09
cos::4096      f::3  ->  f::1 0.0039 0.0035 1.11
cos::1024      f::3  ->  f::2 0.0200 0.0012 16.6
cos::2048      f::3  ->  f::2 0.0400 0.0024 16.65
cos::4096      f::3  ->  f::2 0.0802 0.0048 16.53
cos::1024      f::3  ->  f::3 0.0200 0.0012 16.69
cos::2048      f::3  ->  f::3 0.0400 0.0024 16.92
cos::4096      f::3  ->  f::3 0.0802 0.0047 17.09
cos::1024      f::3  ->  f::10 0.0200 0.0013 15.8
cos::2048      f::3  ->  f::10 0.0400 0.0025 16.01
cos::4096      f::3  ->  f::10 0.0804 0.0050 16.23
cos::1024      f::10  ->  f::1 0.0011 0.0010 1.08
cos::2048      f::10  ->  f::1 0.0021 0.0019 1.11
cos::4096      f::10  ->  f::1 0.0042 0.0038 1.11
cos::1024      f::10  ->  f::2 0.0200 0.0013 15.14
cos::2048      f::10  ->  f::2 0.0400 0.0026 15.54
cos::4096      f::10  ->  f::2 0.0801 0.0052 15.46
cos::1024      f::10  ->  f::3 0.0200 0.0013 14.92
cos::2048      f::10  ->  f::3 0.0400 0.0026 15.54
cos::4096      f::10  ->  f::3 0.0799 0.0051 15.59
cos::1024      f::10  ->  f::10 0.0200 0.0014 14.02
cos::2048      f::10  ->  f::10 0.0400 0.0028 14.4
cos::4096      f::10  ->  f::10 0.0802 0.0055 14.49
sin::1024      f::1  ->  f::1 0.0011 0.0011 1.01
sin::2048      f::1  ->  f::1 0.0017 0.0017 1.03
sin::4096      f::1  ->  f::1 0.0033 0.0032 1.02
sin::1024      f::1  ->  f::2 0.0132 0.0013 10.26
sin::2048      f::1  ->  f::2 0.0264 0.0020 13.36
sin::4096      f::1  ->  f::2 0.0533 0.0046 11.55
sin::1024      f::1  ->  f::3 0.0132 0.0013 10.5
sin::2048      f::1  ->  f::3 0.0267 0.0020 13.49
sin::4096      f::1  ->  f::3 0.0532 0.0046 11.61
sin::1024      f::1  ->  f::10 0.0132 0.0014 9.35
sin::2048      f::1  ->  f::10 0.0264 0.0021 12.63
sin::4096      f::1  ->  f::10 0.0528 0.0047 11.31
sin::1024      f::2  ->  f::1 0.0012 0.0011 1.04
sin::2048      f::2  ->  f::1 0.0020 0.0019 1.06
sin::4096      f::2  ->  f::1 0.0038 0.0035 1.07
sin::1024      f::2  ->  f::2 0.0181 0.0015 12.21
sin::2048      f::2  ->  f::2 0.0361 0.0023 15.41
sin::4096      f::2  ->  f::2 0.0723 0.0047 15.53
sin::1024      f::2  ->  f::3 0.0181 0.0014 12.73
sin::2048      f::2  ->  f::3 0.0364 0.0023 15.76
sin::4096      f::2  ->  f::3 0.0723 0.0047 15.26
sin::1024      f::2  ->  f::10 0.0181 0.0015 12.2
sin::2048      f::2  ->  f::10 0.0362 0.0024 14.85
sin::4096      f::2  ->  f::10 0.0724 0.0049 14.82
sin::1024      f::3  ->  f::1 0.0012 0.0011 1.04
sin::2048      f::3  ->  f::1 0.0020 0.0019 1.05
sin::4096      f::3  ->  f::1 0.0038 0.0036 1.08
sin::1024      f::3  ->  f::2 0.0181 0.0015 12.45
sin::2048      f::3  ->  f::2 0.0362 0.0024 15.39
sin::4096      f::3  ->  f::2 0.0724 0.0047 15.44
sin::1024      f::3  ->  f::3 0.0181 0.0014 12.65
sin::2048      f::3  ->  f::3 0.0365 0.0023 15.71
sin::4096      f::3  ->  f::3 0.0724 0.0047 15.26
sin::1024      f::3  ->  f::10 0.0181 0.0015 12.29
sin::2048      f::3  ->  f::10 0.0362 0.0025 14.77
sin::4096      f::3  ->  f::10 0.0726 0.0049 14.92
sin::1024      f::10  ->  f::1 0.0013 0.0012 1.04
sin::2048      f::10  ->  f::1 0.0022 0.0021 1.08
sin::4096      f::10  ->  f::1 0.0042 0.0038 1.09
sin::1024      f::10  ->  f::2 0.0181 0.0015 11.79
sin::2048      f::10  ->  f::2 0.0361 0.0025 14.26
sin::4096      f::10  ->  f::2 0.0725 0.0051 14.28
sin::1024      f::10  ->  f::3 0.0181 0.0015 11.81
sin::2048      f::10  ->  f::3 0.0364 0.0025 14.37
sin::4096      f::10  ->  f::3 0.0723 0.0052 13.91
sin::1024      f::10  ->  f::10 0.0181 0.0016 11.17
sin::2048      f::10  ->  f::10 0.0362 0.0027 13.24
sin::4096      f::10  ->  f::10 0.0725 0.0055 13.3
AVX2 & FMA3 - Contiguous only

metric: gmean, units: ms

name of test before_contig_avx2_fma3 after_contig_avx2_fma3 after_contig_avx2_fma3 vs before_contig_avx2_fma3
cos::1024      f::1  ->  f::1 0.0015 0.0014 1.05
cos::2048      f::1  ->  f::1 0.0027 0.0026 1.05
cos::4096      f::1  ->  f::1 0.0053 0.0051 1.04
sin::1024      f::1  ->  f::1 0.0014 0.0014 1.02
sin::2048      f::1  ->  f::1 0.0026 0.0026 1.03
sin::4096      f::1  ->  f::1 0.0051 0.0050 1.03
AVX2 & FMA3

metric: gmean, units: ms

name of test before_avx2_fma3 after_avx2_fma3 after_avx2_fma3 vs before_avx2_fma3
cos::1024      f::1  ->  f::1 0.0015 0.0014 1.05
cos::2048      f::1  ->  f::1 0.0027 0.0026 1.05
cos::4096      f::1  ->  f::1 0.0052 0.0050 1.05
cos::1024      f::1  ->  f::2 0.0139 0.0019 7.24
cos::2048      f::1  ->  f::2 0.0279 0.0037 7.46
cos::4096      f::1  ->  f::2 0.0556 0.0073 7.59
cos::1024      f::1  ->  f::3 0.0139 0.0019 7.2
cos::2048      f::1  ->  f::3 0.0279 0.0037 7.51
cos::4096      f::1  ->  f::3 0.0552 0.0073 7.61
cos::1024      f::1  ->  f::10 0.0139 0.0019 7.39
cos::2048      f::1  ->  f::10 0.0279 0.0037 7.57
cos::4096      f::1  ->  f::10 0.0557 0.0072 7.72
cos::1024      f::2  ->  f::1 0.0018 0.0018 0.99
cos::2048      f::2  ->  f::1 0.0035 0.0035 0.99
cos::4096      f::2  ->  f::1 0.0066 0.0069 0.96
cos::1024      f::2  ->  f::2 0.0188 0.0023 8.2
cos::2048      f::2  ->  f::2 0.0376 0.0048 7.83
cos::4096      f::2  ->  f::2 0.0750 0.0088 8.55
cos::1024      f::2  ->  f::3 0.0188 0.0023 8.34
cos::2048      f::2  ->  f::3 0.0376 0.0044 8.52
cos::4096      f::2  ->  f::3 0.0751 0.0088 8.52
cos::1024      f::2  ->  f::10 0.0188 0.0023 8.28
cos::2048      f::2  ->  f::10 0.0376 0.0045 8.33
cos::4096      f::2  ->  f::10 0.0752 0.0090 8.32
cos::1024      f::3  ->  f::1 0.0018 0.0018 1.0
cos::2048      f::3  ->  f::1 0.0035 0.0035 1.0
cos::4096      f::3  ->  f::1 0.0067 0.0072 0.93
cos::1024      f::3  ->  f::2 0.0188 0.0022 8.43
cos::2048      f::3  ->  f::2 0.0375 0.0044 8.51
cos::4096      f::3  ->  f::2 0.0752 0.0092 8.15
cos::1024      f::3  ->  f::3 0.0188 0.0023 8.31
cos::2048      f::3  ->  f::3 0.0376 0.0044 8.54
cos::4096      f::3  ->  f::3 0.0750 0.0093 8.1
cos::1024      f::3  ->  f::10 0.0188 0.0024 7.93
cos::2048      f::3  ->  f::10 0.0375 0.0045 8.36
cos::4096      f::3  ->  f::10 0.0753 0.0094 8.04
cos::1024      f::10  ->  f::1 0.0019 0.0020 0.96
cos::2048      f::10  ->  f::1 0.0036 0.0037 0.96
cos::4096      f::10  ->  f::1 0.0072 0.0073 0.98
cos::1024      f::10  ->  f::2 0.0188 0.0025 7.66
cos::2048      f::10  ->  f::2 0.0375 0.0048 7.79
cos::4096      f::10  ->  f::2 0.0748 0.0096 7.8
cos::1024      f::10  ->  f::3 0.0188 0.0025 7.56
cos::2048      f::10  ->  f::3 0.0376 0.0048 7.78
cos::4096      f::10  ->  f::3 0.0750 0.0097 7.74
cos::1024      f::10  ->  f::10 0.0188 0.0025 7.52
cos::2048      f::10  ->  f::10 0.0375 0.0049 7.65
cos::4096      f::10  ->  f::10 0.0753 0.0098 7.69
sin::1024      f::1  ->  f::1 0.0015 0.0014 1.05
sin::2048      f::1  ->  f::1 0.0027 0.0025 1.05
sin::4096      f::1  ->  f::1 0.0051 0.0048 1.07
sin::1024      f::1  ->  f::2 0.0139 0.0018 7.5
sin::2048      f::1  ->  f::2 0.0277 0.0037 7.51
sin::4096      f::1  ->  f::2 0.0555 0.0071 7.8
sin::1024      f::1  ->  f::3 0.0138 0.0018 7.5
sin::2048      f::1  ->  f::3 0.0278 0.0037 7.6
sin::4096      f::1  ->  f::3 0.0556 0.0072 7.75
sin::1024      f::1  ->  f::10 0.0139 0.0018 7.56
sin::2048      f::1  ->  f::10 0.0277 0.0036 7.67
sin::4096      f::1  ->  f::10 0.0556 0.0071 7.88
sin::1024      f::2  ->  f::1 0.0018 0.0018 1.02
sin::2048      f::2  ->  f::1 0.0034 0.0034 0.99
sin::4096      f::2  ->  f::1 0.0065 0.0067 0.97
sin::1024      f::2  ->  f::2 0.0190 0.0022 8.48
sin::2048      f::2  ->  f::2 0.0382 0.0047 8.1
sin::4096      f::2  ->  f::2 0.0766 0.0086 8.95
sin::1024      f::2  ->  f::3 0.0190 0.0022 8.77
sin::2048      f::2  ->  f::3 0.0383 0.0043 8.84
sin::4096      f::2  ->  f::3 0.0762 0.0087 8.77
sin::1024      f::2  ->  f::10 0.0191 0.0022 8.68
sin::2048      f::2  ->  f::10 0.0382 0.0044 8.6
sin::4096      f::2  ->  f::10 0.0762 0.0087 8.72
sin::1024      f::3  ->  f::1 0.0018 0.0018 1.02
sin::2048      f::3  ->  f::1 0.0034 0.0034 0.99
sin::4096      f::3  ->  f::1 0.0066 0.0067 0.98
sin::1024      f::3  ->  f::2 0.0191 0.0022 8.77
sin::2048      f::3  ->  f::2 0.0382 0.0044 8.77
sin::4096      f::3  ->  f::2 0.0761 0.0086 8.87
sin::1024      f::3  ->  f::3 0.0191 0.0022 8.87
sin::2048      f::3  ->  f::3 0.0383 0.0043 8.87
sin::4096      f::3  ->  f::3 0.0760 0.0087 8.78
sin::1024      f::3  ->  f::10 0.0191 0.0022 8.76
sin::2048      f::3  ->  f::10 0.0383 0.0044 8.75
sin::4096      f::3  ->  f::10 0.0761 0.0088 8.66
sin::1024      f::10  ->  f::1 0.0019 0.0018 1.02
sin::2048      f::10  ->  f::1 0.0035 0.0035 1.0
sin::4096      f::10  ->  f::1 0.0068 0.0069 0.99
sin::1024      f::10  ->  f::2 0.0191 0.0022 8.63
sin::2048      f::10  ->  f::2 0.0381 0.0045 8.56
sin::4096      f::10  ->  f::2 0.0765 0.0088 8.74
sin::1024      f::10  ->  f::3 0.0192 0.0022 8.69
sin::2048      f::10  ->  f::3 0.0382 0.0044 8.6
sin::4096      f::10  ->  f::3 0.0765 0.0089 8.64
sin::1024      f::10  ->  f::10 0.0191 0.0022 8.52
sin::2048      f::10  ->  f::10 0.0382 0.0045 8.46
sin::4096      f::10  ->  f::10 0.0766 0.0090 8.55

ARM8 64-bit

CPU
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        243.75
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        4 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

OS
Linux ip-172-31-6-63 5.4.0-1024-aws #24-Ubuntu SMP Sat Sep 5 06:17:48 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
gcc-7 (Ubuntu/Linaro 7.5.0-6ubuntu2) 7.5.0

Benchmark

ASIMD - Contiguous only

metric: gmean, units: ms

name of test before_contig after_contig after_contig vs before_contig
cos::1024      f::1  ->  f::1 0.0072 0.0037 1.93
cos::2048      f::1  ->  f::1 0.0149 0.0074 2.0
cos::4096      f::1  ->  f::1 0.0313 0.0149 2.11
sin::1024      f::1  ->  f::1 0.0072 0.0037 1.97
sin::2048      f::1  ->  f::1 0.0148 0.0073 2.03
sin::4096      f::1  ->  f::1 0.0305 0.0146 2.09
ASIMD

metric: gmean, units: ms

name of test before after after vs before
cos::1024      f::1  ->  f::1 0.0057 0.0037 1.53
cos::2048      f::1  ->  f::1 0.0125 0.0074 1.68
cos::4096      f::1  ->  f::1 0.0260 0.0149 1.75
cos::1024      f::1  ->  f::2 0.0057 0.0042 1.37
cos::2048      f::1  ->  f::2 0.0124 0.0083 1.49
cos::4096      f::1  ->  f::2 0.0260 0.0166 1.56
cos::1024      f::1  ->  f::3 0.0057 0.0042 1.37
cos::2048      f::1  ->  f::3 0.0124 0.0083 1.49
cos::4096      f::1  ->  f::3 0.0260 0.0167 1.56
cos::1024      f::1  ->  f::10 0.0057 0.0042 1.37
cos::2048      f::1  ->  f::10 0.0127 0.0086 1.48
cos::4096      f::1  ->  f::10 0.0262 0.0167 1.57
cos::1024      f::2  ->  f::1 0.0060 0.0040 1.5
cos::2048      f::2  ->  f::1 0.0125 0.0080 1.56
cos::4096      f::2  ->  f::1 0.0261 0.0160 1.63
cos::1024      f::2  ->  f::2 0.0060 0.0044 1.36
cos::2048      f::2  ->  f::2 0.0125 0.0088 1.42
cos::4096      f::2  ->  f::2 0.0261 0.0177 1.47
cos::1024      f::2  ->  f::3 0.0061 0.0044 1.37
cos::2048      f::2  ->  f::3 0.0125 0.0088 1.41
cos::4096      f::2  ->  f::3 0.0262 0.0177 1.48
cos::1024      f::2  ->  f::10 0.0060 0.0044 1.36
cos::2048      f::2  ->  f::10 0.0126 0.0089 1.42
cos::4096      f::2  ->  f::10 0.0264 0.0177 1.49
cos::1024      f::3  ->  f::1 0.0057 0.0042 1.35
cos::2048      f::3  ->  f::1 0.0126 0.0084 1.51
cos::4096      f::3  ->  f::1 0.0265 0.0168 1.57
cos::1024      f::3  ->  f::2 0.0057 0.0047 1.22
cos::2048      f::3  ->  f::2 0.0126 0.0093 1.36
cos::4096      f::3  ->  f::2 0.0265 0.0187 1.42
cos::1024      f::3  ->  f::3 0.0057 0.0047 1.22
cos::2048      f::3  ->  f::3 0.0127 0.0093 1.36
cos::4096      f::3  ->  f::3 0.0266 0.0187 1.42
cos::1024      f::3  ->  f::10 0.0057 0.0047 1.22
cos::2048      f::3  ->  f::10 0.0128 0.0094 1.37
cos::4096      f::3  ->  f::10 0.0266 0.0187 1.43
cos::1024      f::10  ->  f::1 0.0060 0.0048 1.26
cos::2048      f::10  ->  f::1 0.0125 0.0095 1.31
cos::4096      f::10  ->  f::1 0.0263 0.0190 1.38
cos::1024      f::10  ->  f::2 0.0061 0.0051 1.2
cos::2048      f::10  ->  f::2 0.0125 0.0102 1.23
cos::4096      f::10  ->  f::2 0.0263 0.0204 1.29
cos::1024      f::10  ->  f::3 0.0061 0.0051 1.18
cos::2048      f::10  ->  f::3 0.0125 0.0102 1.22
cos::4096      f::10  ->  f::3 0.0263 0.0204 1.29
cos::1024      f::10  ->  f::10 0.0061 0.0052 1.16
cos::2048      f::10  ->  f::10 0.0126 0.0102 1.23
cos::4096      f::10  ->  f::10 0.0264 0.0206 1.28
sin::1024      f::1  ->  f::1 0.0073 0.0037 2.0
sin::2048      f::1  ->  f::1 0.0147 0.0073 2.01
sin::4096      f::1  ->  f::1 0.0300 0.0146 2.06
sin::1024      f::1  ->  f::2 0.0074 0.0041 1.79
sin::2048      f::1  ->  f::2 0.0146 0.0082 1.78
sin::4096      f::1  ->  f::2 0.0300 0.0164 1.83
sin::1024      f::1  ->  f::3 0.0073 0.0041 1.79
sin::2048      f::1  ->  f::3 0.0146 0.0082 1.78
sin::4096      f::1  ->  f::3 0.0300 0.0164 1.83
sin::1024      f::1  ->  f::10 0.0073 0.0041 1.78
sin::2048      f::1  ->  f::10 0.0147 0.0085 1.74
sin::4096      f::1  ->  f::10 0.0301 0.0164 1.83
sin::1024      f::2  ->  f::1 0.0072 0.0039 1.85
sin::2048      f::2  ->  f::1 0.0147 0.0078 1.89
sin::4096      f::2  ->  f::1 0.0301 0.0156 1.93
sin::1024      f::2  ->  f::2 0.0072 0.0044 1.65
sin::2048      f::2  ->  f::2 0.0147 0.0088 1.68
sin::4096      f::2  ->  f::2 0.0301 0.0176 1.71
sin::1024      f::2  ->  f::3 0.0073 0.0044 1.66
sin::2048      f::2  ->  f::3 0.0147 0.0088 1.68
sin::4096      f::2  ->  f::3 0.0302 0.0176 1.72
sin::1024      f::2  ->  f::10 0.0073 0.0044 1.66
sin::2048      f::2  ->  f::10 0.0148 0.0088 1.68
sin::4096      f::2  ->  f::10 0.0302 0.0175 1.72
sin::1024      f::3  ->  f::1 0.0073 0.0042 1.75
sin::2048      f::3  ->  f::1 0.0146 0.0083 1.76
sin::4096      f::3  ->  f::1 0.0299 0.0167 1.79
sin::1024      f::3  ->  f::2 0.0073 0.0046 1.59
sin::2048      f::3  ->  f::2 0.0146 0.0091 1.6
sin::4096      f::3  ->  f::2 0.0299 0.0183 1.63
sin::1024      f::3  ->  f::3 0.0073 0.0046 1.57
sin::2048      f::3  ->  f::3 0.0147 0.0091 1.6
sin::4096      f::3  ->  f::3 0.0299 0.0183 1.64
sin::1024      f::3  ->  f::10 0.0073 0.0046 1.59
sin::2048      f::3  ->  f::10 0.0147 0.0092 1.61
sin::4096      f::3  ->  f::10 0.0300 0.0183 1.64
sin::1024      f::10  ->  f::1 0.0073 0.0047 1.57
sin::2048      f::10  ->  f::1 0.0147 0.0094 1.57
sin::4096      f::10  ->  f::1 0.0301 0.0187 1.61
sin::1024      f::10  ->  f::2 0.0073 0.0050 1.45
sin::2048      f::10  ->  f::2 0.0146 0.0101 1.45
sin::4096      f::10  ->  f::2 0.0301 0.0201 1.5
sin::1024      f::10  ->  f::3 0.0073 0.0050 1.46
sin::2048      f::10  ->  f::3 0.0146 0.0101 1.45
sin::4096      f::10  ->  f::3 0.0301 0.0201 1.5
sin::1024      f::10  ->  f::10 0.0074 0.0051 1.45
sin::2048      f::10  ->  f::10 0.0147 0.0101 1.45
sin::4096      f::10  ->  f::10 0.0302 0.0201 1.5

Power little-endian

CPU
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix


OS
Linux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2) 

Benchmark

VSX2(ISA >= 2.07) - Contiguous only

metric: gmean, units: ms

name of test before_contig after_contig after_contig vs before_contig
cos::1024      f::1  ->  f::1 0.0131 0.0044 2.94
cos::2048      f::1  ->  f::1 0.0265 0.0089 2.99
cos::4096      f::1  ->  f::1 0.0535 0.0176 3.03
sin::1024      f::1  ->  f::1 0.0134 0.0042 3.16
sin::2048      f::1  ->  f::1 0.0265 0.0085 3.13
sin::4096      f::1  ->  f::1 0.0542 0.0169 3.2
VSX2(ISA >= 2.07)

metric: gmean, units: ms

name of test before after after vs before
cos::1024      f::1  ->  f::1 0.0127 0.0044 2.86
cos::2048      f::1  ->  f::1 0.0264 0.0088 2.99
cos::4096      f::1  ->  f::1 0.0538 0.0184 2.92
cos::1024      f::1  ->  f::2 0.0129 0.0048 2.7
cos::2048      f::1  ->  f::2 0.0268 0.0095 2.83
cos::4096      f::1  ->  f::2 0.0546 0.0189 2.88
cos::1024      f::1  ->  f::3 0.0129 0.0047 2.72
cos::2048      f::1  ->  f::3 0.0268 0.0095 2.82
cos::4096      f::1  ->  f::3 0.0547 0.0189 2.89
cos::1024      f::1  ->  f::10 0.0129 0.0047 2.72
cos::2048      f::1  ->  f::10 0.0269 0.0095 2.84
cos::4096      f::1  ->  f::10 0.0546 0.0190 2.87
cos::1024      f::2  ->  f::1 0.0131 0.0048 2.73
cos::2048      f::2  ->  f::1 0.0266 0.0095 2.79
cos::4096      f::2  ->  f::1 0.0547 0.0191 2.87
cos::1024      f::2  ->  f::2 0.0130 0.0051 2.55
cos::2048      f::2  ->  f::2 0.0266 0.0102 2.61
cos::4096      f::2  ->  f::2 0.0544 0.0210 2.58
cos::1024      f::2  ->  f::3 0.0131 0.0051 2.56
cos::2048      f::2  ->  f::3 0.0281 0.0103 2.74
cos::4096      f::2  ->  f::3 0.0544 0.0208 2.62
cos::1024      f::2  ->  f::10 0.0131 0.0051 2.55
cos::2048      f::2  ->  f::10 0.0266 0.0102 2.6
cos::4096      f::2  ->  f::10 0.0544 0.0205 2.65
cos::1024      f::3  ->  f::1 0.0130 0.0048 2.7
cos::2048      f::3  ->  f::1 0.0271 0.0096 2.84
cos::4096      f::3  ->  f::1 0.0543 0.0191 2.83
cos::1024      f::3  ->  f::2 0.0129 0.0051 2.53
cos::2048      f::3  ->  f::2 0.0271 0.0102 2.65
cos::4096      f::3  ->  f::2 0.0543 0.0205 2.65
cos::1024      f::3  ->  f::3 0.0130 0.0051 2.53
cos::2048      f::3  ->  f::3 0.0272 0.0102 2.66
cos::4096      f::3  ->  f::3 0.0543 0.0205 2.65
cos::1024      f::3  ->  f::10 0.0130 0.0053 2.46
cos::2048      f::3  ->  f::10 0.0280 0.0102 2.73
cos::4096      f::3  ->  f::10 0.0563 0.0204 2.75
cos::1024      f::10  ->  f::1 0.0133 0.0048 2.76
cos::2048      f::10  ->  f::1 0.0265 0.0096 2.77
cos::4096      f::10  ->  f::1 0.0551 0.0191 2.89
cos::1024      f::10  ->  f::2 0.0133 0.0051 2.6
cos::2048      f::10  ->  f::2 0.0266 0.0102 2.59
cos::4096      f::10  ->  f::2 0.0552 0.0205 2.7
cos::1024      f::10  ->  f::3 0.0133 0.0051 2.59
cos::2048      f::10  ->  f::3 0.0266 0.0102 2.59
cos::4096      f::10  ->  f::3 0.0552 0.0205 2.7
cos::1024      f::10  ->  f::10 0.0133 0.0051 2.58
cos::2048      f::10  ->  f::10 0.0265 0.0102 2.59
cos::4096      f::10  ->  f::10 0.0552 0.0205 2.7
sin::1024      f::1  ->  f::1 0.0134 0.0042 3.16
sin::2048      f::1  ->  f::1 0.0271 0.0085 3.2
sin::4096      f::1  ->  f::1 0.0535 0.0169 3.17
sin::1024      f::1  ->  f::2 0.0133 0.0046 2.9
sin::2048      f::1  ->  f::2 0.0268 0.0091 2.93
sin::4096      f::1  ->  f::2 0.0530 0.0183 2.9
sin::1024      f::1  ->  f::3 0.0133 0.0046 2.9
sin::2048      f::1  ->  f::3 0.0268 0.0091 2.94
sin::4096      f::1  ->  f::3 0.0530 0.0188 2.82
sin::1024      f::1  ->  f::10 0.0133 0.0047 2.83
sin::2048      f::1  ->  f::10 0.0268 0.0094 2.87
sin::4096      f::1  ->  f::10 0.0530 0.0183 2.9
sin::1024      f::2  ->  f::1 0.0131 0.0046 2.87
sin::2048      f::2  ->  f::1 0.0264 0.0092 2.89
sin::4096      f::2  ->  f::1 0.0530 0.0183 2.9
sin::1024      f::2  ->  f::2 0.0131 0.0050 2.65
sin::2048      f::2  ->  f::2 0.0264 0.0099 2.68
sin::4096      f::2  ->  f::2 0.0531 0.0198 2.68
sin::1024      f::2  ->  f::3 0.0131 0.0049 2.66
sin::2048      f::2  ->  f::3 0.0264 0.0099 2.68
sin::4096      f::2  ->  f::3 0.0530 0.0198 2.68
sin::1024      f::2  ->  f::10 0.0131 0.0050 2.65
sin::2048      f::2  ->  f::10 0.0265 0.0099 2.68
sin::4096      f::2  ->  f::10 0.0531 0.0197 2.69
sin::1024      f::3  ->  f::1 0.0130 0.0046 2.82
sin::2048      f::3  ->  f::1 0.0263 0.0092 2.86
sin::4096      f::3  ->  f::1 0.0532 0.0183 2.9
sin::1024      f::3  ->  f::2 0.0130 0.0050 2.61
sin::2048      f::3  ->  f::2 0.0263 0.0099 2.65
sin::4096      f::3  ->  f::2 0.0533 0.0198 2.69
sin::1024      f::3  ->  f::3 0.0136 0.0050 2.74
sin::2048      f::3  ->  f::3 0.0263 0.0099 2.65
sin::4096      f::3  ->  f::3 0.0533 0.0198 2.69
sin::1024      f::3  ->  f::10 0.0130 0.0050 2.61
sin::2048      f::3  ->  f::10 0.0263 0.0099 2.66
sin::4096      f::3  ->  f::10 0.0532 0.0198 2.69
sin::1024      f::10  ->  f::1 0.0128 0.0046 2.78
sin::2048      f::10  ->  f::1 0.0264 0.0092 2.88
sin::4096      f::10  ->  f::1 0.0537 0.0184 2.91
sin::1024      f::10  ->  f::2 0.0133 0.0050 2.67
sin::2048      f::10  ->  f::2 0.0264 0.0099 2.66
sin::4096      f::10  ->  f::2 0.0537 0.0198 2.72
sin::1024      f::10  ->  f::3 0.0128 0.0050 2.58
sin::2048      f::10  ->  f::3 0.0264 0.0099 2.67
sin::4096      f::10  ->  f::3 0.0537 0.0198 2.71
sin::1024      f::10  ->  f::10 0.0128 0.0050 2.58
sin::2048      f::10  ->  f::10 0.0264 0.0099 2.67
sin::4096      f::10  ->  f::10 0.0537 0.0198 2.71

@seiko2plus seiko2plus force-pushed the to_npyv_sincos_f32 branch 12 times, most recently from 7161e30 to e01dc6e Compare October 21, 2020 10:25
@seiko2plus seiko2plus marked this pull request as ready for review October 21, 2020 10:25
@seiko2plus seiko2plus force-pushed the to_npyv_sincos_f32 branch 2 times, most recently from 518fd92 to 2a01e5f Compare November 1, 2020 16:39
@charris charris added 01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Nov 2, 2020
@seiko2plus seiko2plus force-pushed the to_npyv_sincos_f32 branch 2 times, most recently from 360472c to bb08eb2 Compare November 11, 2020 03:40
@seiko2plus
Copy link
Member Author

the new NPYV intrinsics have moved to separate pull-requests #17790, #17789

@seiko2plus seiko2plus force-pushed the to_npyv_sincos_f32 branch 7 times, most recently from b958d43 to a0322ee Compare December 26, 2020 10:48
@seiko2plus
Copy link
Member Author

ping @mattip

@@ -0,0 +1,230 @@
/*@targets
** $maxopt $werror baseline
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
** $maxopt $werror baseline
** $maxopt baseline

remove treating warnings as errors after the CI pass the tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is passing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I temporarily use this policy during the development to detect any warnings.

@mattip
Copy link
Member

mattip commented Dec 26, 2020

Nice speedups. Is this for 32-bit float only or also for 64-bit?

Edit: 32 bit only.

   The new code improves the performance of non-contiguous memory access
   for the output array without any reduction in performance.
   For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.
  This test should not be exclusive to AVX. this patch also
  extends unary test to cover different sets of output strides.
@seiko2plus
Copy link
Member Author

@mattip, just replaced the raw SIMD code of f32 with NPYV.

@mattip mattip merged commit ce82028 into numpy:master Dec 26, 2020
@mattip
Copy link
Member

mattip commented Dec 26, 2020

Thanks @seiko2plus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants