SIMD optimizations with Highway #3618

kleisauke · 2023-08-20T17:08:53Z

This PR optimizes the reduce{h,v}, convi, and morph operations using portable SIMD/vector instructions through Highway. The liborc paths serve as fallbacks whenever Highway >= v1.0.5 is unavailable^[1].

Motivation

Traditionally, libvips depends on liborc's runtime compiler to dynamically generate optimized SIMD/vector code specifically for the target architecture. However, maintaining this code proved challenging and it didn't generalize to other architectures (such as WebAssembly). Additionally, it lacked support for newer instruction sets (like AVX2 and AVX-512), and the vector paths of liborc didn't match the precision of the C paths (as noted here).

Highway is a C++ library with carefully-chosen functions that map well to CPU instructions without extensive compiler transformations. Because Highway is a library (rather than a code generator or compiler) it facilitates straightforward development, debugging, and maintenance of the code. Highway supports five architectures^[2]; the same application code can target various instruction sets, including those with 'scalable' vectors (size unknown at compile time).

Usage

Users can view available targets for their platform using the --targets flag:

$ vips --targets
builtin targets:   AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2
supported targets: AVX3_ZEN4 AVX3_DL AVX3 AVX2 SSE4 SSSE3 SSE2

Additionally, users can specify which available targets to use at runtime via the VIPS_VECTOR environment variable, particularly useful for testing and benchmarking:

$ VIPS_VECTOR="-2049" vips --targets
builtin targets:   AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2
supported targets: SSE4

As always, you have the option to disable vector paths with the --vips-novector flag or the VIPS_NOVECTOR environment variable.

Accuracy and performance

This PR underwent accuracy and speed testing on the following targets:
https://gist.github.com/kleisauke/1f28a9fc156c753bcb1239b6fc1a2e62

It produces identical output to the C paths on these targets, addressing issue #2047.

On my AMD Ryzen 9 7900 workstation, this implementation shows a noticeable speed improvement, ranging from ~15% to even ~2.5 times faster, depending on the number of worker threads used. See the benchmark results at:
https://github.com/kleisauke/vips-microbench/blob/master/results/simd-highway.md

Feel free to benchmark this across additional architectures!

Backward compatibility

Several liborc-specific functions are now deprecated; see API changes below for details^[3].

This PR should not affect backward compatibility. The abi-compliance-checker result is available at:
https://kleisauke.nl/compat_reports/vips/master_to_simd-highway/compat_report.html

References

^[1]: Highway packaging status

^[2]: Highway targets

Highway currently targets the following 'clusters' of features:

x86:
- SSE2 (any x64)
- SSSE3 (~Intel Core)
- SSE4 (~Nehalem)
- AVX2 (~Haswell)
- AVX3 (~Skylake)
- AVX3_DL (~Icelake)
- AVX3_ZEN4 (Zen4).
- AVX3_SPR (~Sapphire Rapids)
Arm:
- NEON (Armv7+)
- SVE (plus its specialization for 256-bit vectors SVE_256)
- SVE2 (plus its specialization for 128-bit vectors SVE2_128)
POWER:
- PPC8 (v2.07)
- PPC9 (v3.0)
- PPC10 (v3.1B)
RISC-V:
- RVV (1.0)
WebAssembly:
- WASM
- WASM_EMU256 (a 2x unrolled version of wasm128)

^[3]: API Changes

memory.h:

void vips_tracked_free(void *s);
+void vips_tracked_aligned_free(void *s);
void *vips_tracked_malloc(size_t size);
+void *vips_tracked_aligned_alloc(size_t size, size_t align);

A new function to allocate memory aligned on a specific boundary, along with a function for releasing that memory.

vector.h:

-void vips_vector_init(void);
gboolean vips_vector_isenabled(void);
void vips_vector_set_enabled(gboolean enabled);

-void vips_vector_free(VipsVector *vector);
-VipsVector *vips_vector_new(const char *name, int dsize);
-void vips_vector_constant(VipsVector *vector, char *name, int value, int size);
-void vips_vector_source_scanline(VipsVector *vector, char *name, int line, int size);
-int vips_vector_source_name(VipsVector *vector, const char *name, int size);
-void vips_vector_temporary(VipsVector *vector, const char *name, int size);
-int vips_vector_parameter(VipsVector *vector, const char *name, int size);
-int vips_vector_destination(VipsVector *vector, const char *name, int size);
-void vips_vector_asm2(VipsVector *vector, const char *op, const char *a, const char *b);
-void vips_vector_asm3(VipsVector *vector, const char *op, const char *a, const char *b, const char *c);
-gboolean vips_vector_full(VipsVector *vector);
-gboolean vips_vector_compile(VipsVector *vector);
-void vips_vector_print(VipsVector *vector);
-void vips_executor_set_program(VipsExecutor *executor, VipsVector *vector, int n);
-void vips_executor_set_scanline(VipsExecutor *executor, VipsRegion *ir, int x, int y);
-void vips_executor_set_destination(VipsExecutor *executor, void *value);
-void vips_executor_set_parameter(VipsExecutor *executor, int var, int value);
-void vips_executor_set_array(VipsExecutor *executor, int var, void *value);
-void vips_executor_run(VipsExecutor *executor);
-void vips_vector_to_fixed_point(double *in, int *out, int n, int scale);
+gint64 vips_vector_get_builtin_targets(void);
+gint64 vips_vector_get_supported_targets(void);
+const char *vips_vector_target_name(gint64 target);
+void vips_vector_disable_targets(gint64 disabled_targets);

New functions to obtain or disable specific targets; the previous VipsVector / VipsExecutor APIs are deprecated.

In preparation for highway. Also, don't use liborc for `vips_abs()`, as that didn't yield any usable speedup.

In addition to disabling SIMD completely using `--vips-novector` or `VIPS_NOVECTOR`, one has the option to selectively override specific SIMD targets using: - the `VIPS_VECTOR` environment variable; - the `vips_vector_disable_targets()` function. Handy for testing and benchmarking purposes.

This partially reverts commit dfdf899.

In favor of `InterleaveLower` / `InterleaveUpper`.

Just fallback to the C paths if SIMD is not supported.

For images with 3 or 4 bands.

By casting back to the unpremultiplied format immediately after `vips_premultiply()`.

The fixed-point coefficients are 16-bit.

jcupitt · 2023-08-20T17:22:33Z

This is fantastic Kleis, what a huge project, and congratulations on getting it over the line.

I'll run some tests here.

jcupitt · 2023-08-21T14:12:50Z

I tried a few things:

$ time vips gaussblur wtc.jpg x.jpg 10

real	0m1.092s
user	0m7.167s
sys	0m0.224s
$ time vips gaussblur wtc.jpg x.jpg 10

real	0m0.879s
user	0m2.247s
sys	0m0.184s

This is limited by jpg encode and decode, but you can see a nice improvement in CPU time. If you make it more CPU limited, the speedup is more obvious:

$ time vips gaussblur wtc.jpg x.jpg 100

real	0m14.970s
user	7m50.982s
sys	0m0.380s
$ time vips gaussblur wtc.jpg x.jpg 100

real	0m1.991s
user	0m20.958s
sys	0m0.800s

Haha 7x faster in real time because sigma 100 will make master fall off the orc path.

reduceh is 2.5x faster, though it makes little difference to image resize times. Morph is slightly quicker.

I like the new highway infrastructure. It should make it relatively simple to add more highway paths, for example to VipsInterpolate, or maybe even composite (I expect highway could beat the compiler thing we use now).

I've not noticed any bad results.

lovell · 2023-08-21T15:04:04Z

Wow, this is great, thank you Kleis! I'll go away and do some testing, but please don't let that stop you merging.

We don't currently include vector paths via oss-fuzz but perhaps we might want to consider doing so?

In preparation for libvips/libvips#3618.

kleisauke · 2023-08-22T10:27:01Z

reduceh is 2.5x faster, though it makes little difference to image resize times.

Indeed, somehow vips_resize() / vips_thumbnail() doesn't really benefit from this. It seems that shrink{h,v} + reduce{h,v} is actually slower than reduce{h,v} alone, so we might consider defaulting to gap = 0.0, as noticed in libvips/pyvips#148 (comment).

We don't currently include vector paths via oss-fuzz but perhaps we might want to consider doing so?

I just opened PR google/oss-fuzz#10868 for this. I'll update fuzz/oss_fuzz_build.sh after that lands.

jcupitt · 2023-08-22T11:48:12Z

so we might consider defaulting to gap = 0.0, as noticed in libvips/pyvips#148 (comment).

I think this would be a bad idea for large shrinks -- if you are shrinking by x100, for example, reducev would need to read the input image in huge chunks. shrinkv has the nice property of never fetching too many input scanlines in one go.

kleisauke · 2023-08-22T14:02:26Z

I think this would be a bad idea for large shrinks

Ah, you're right. I tested this with:

Details

Benchmark script: https://gist.github.com/kleisauke/ea7f7e12ae043aa1151dbc09987600a7

$ curl -LO https://github.com/kleisauke/vips-microbench/raw/master/images/x.jpg
$ python3 gap-bench.py --gap=2.0 -o gap-2.0.json
$ python3 gap-bench.py --gap=0.0 -o gap-0.0.json
$ python3 -m pyperf compare_to gap-2.0.json gap-0.0.json --table
+----------------+---------+----------------------+
| Benchmark      | gap-2.0 | gap-0.0              |
+================+=========+======================+
| 4x             | 567 ms  | 305 ms: 1.86x faster |
+----------------+---------+----------------------+
| 8x             | 424 ms  | 306 ms: 1.39x faster |
+----------------+---------+----------------------+
| 9.4x           | 391 ms  | 303 ms: 1.29x faster |
+----------------+---------+----------------------+
| 16x            | 355 ms  | 315 ms: 1.12x faster |
+----------------+---------+----------------------+
| 64x            | 338 ms  | 415 ms: 1.23x slower |
+----------------+---------+----------------------+
| Geometric mean | (ref)   | 1.17x faster         |
+----------------+---------+----------------------+

Benchmark hidden because not significant (2): 2x, 32x

In preparation for libvips/libvips#3618.

jcupitt · 2023-08-23T02:41:30Z

A big difference in memory use too:

john@banana ~/pics $ /usr/bin/time -f %M:%e vips resize x.jpg x2.jpg 0.01
150484:0.42
john@banana ~/pics $ /usr/bin/time -f %M:%e vips resize x.jpg x2.jpg 0.01 --gap 0
511796:0.80

lovell · 2023-09-01T09:13:52Z

My initial testing on an Intel i7-1255U (2 physical cores with hyperthreading, 8 physical without, "12 cores" in total) laptop with AVX2 suggests there is a noticeable variance in multi-threaded resize performance when compared with liborc, with a seemingly-random range of +15% at best to -5% at worst.

I've yet to dig into the details but it could be a clock speed reduction of the non-hyperthreading cores when hot, or AVX2 "heavy" operations causing slowdown due to throttling / lane widening, or maybe some operations are now too fast and now there are more cache evictions.

kleisauke · 2023-09-10T11:56:44Z

Thanks for testing @lovell! The 5% slowdown sounds like a CPU clock throttling issue, does this also occur with export VIPS_CONCURRENCY=1 or export VIPS_VECTOR=-2049?

It could also be due to the over-computation issue mentioned in #2757, which could be circumvented by forcing random access (I'm not sure if this can be done in CLI). I'll have a look to see if I can reproduce this on my old AVX2 laptop.

kleisauke · 2023-09-24T10:50:53Z

I'll have a look to see if I can reproduce this on my old AVX2 laptop.

I could not reproduce this on my old AVX2 laptop. Tested with:

Details

Benchmark script
https://gist.github.com/kleisauke/a2669ec11118de41f36401415e144fd7

Test environment

HP 250 G5 - i5-6200U
Fedora 38

$ vips --targets
builtin targets:   AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSE2
supported targets: AVX2 SSE4 SSSE3 SSE2

Images

Image	Dimensions
`2569067123_aca715a2ee_o.jpg`	2725×2225
`alpha-premultiply-2048x1536-paper.png`	2048×1536
`4.webp`	1024×772

Results

$ curl -LO https://github.com/lovell/sharp/raw/main/test/fixtures/2569067123_aca715a2ee_o.jpg
$ curl -LO https://github.com/lovell/sharp/raw/main/test/fixtures/alpha-premultiply-2048x1536-paper.png
$ curl -LO https://github.com/lovell/sharp/raw/main/test/fixtures/4.webp

$ python3 thumbnail-bench.py 2569067123_aca715a2ee_o.jpg -o jpeg-highway.json
.....................
720x: Mean +- std dev: 49.2 ms +- 0.8 ms

$ python3 thumbnail-bench.py alpha-premultiply-2048x1536-paper.png -o png-highway.json
.....................
720x: Mean +- std dev: 83.8 ms +- 1.6 ms

$ python3 thumbnail-bench.py 4.webp -o webp-highway.json
.....................
720x: Mean +- std dev: 30.3 ms +- 0.4 ms

$ python3 thumbnail-bench.py 2569067123_aca715a2ee_o.jpg -o jpeg-orc.json
.....................
720x: Mean +- std dev: 60.0 ms +- 1.9 ms

$ python3 thumbnail-bench.py alpha-premultiply-2048x1536-paper.png -o png-orc.json
.....................
720x: Mean +- std dev: 100 ms +- 2 ms

$ python3 thumbnail-bench.py 4.webp -o webp-orc.json
.....................
720x: Mean +- std dev: 30.3 ms +- 0.4 ms

$ python3 -m pyperf compare_to jpeg-orc.json jpeg-highway.json  --table
+-----------+----------+-----------------------+
| Benchmark | jpeg-orc | jpeg-highway          |
+===========+==========+=======================+
| 720x      | 60.0 ms  | 49.2 ms: 1.22x faster |
+-----------+----------+-----------------------+

$ python3 -m pyperf compare_to png-orc.json png-highway.json --table
+-----------+---------+-----------------------+
| Benchmark | png-orc | png-highway           |
+===========+=========+=======================+
| 720x      | 100 ms  | 83.8 ms: 1.20x faster |
+-----------+---------+-----------------------+

$ python3 -m pyperf compare_to webp-orc.json webp-highway.json --table
Benchmark hidden because not significant (1): 720x

Notes

The liborc benchmark is done on this PR by compiling with -Dhighway=disabled, this ensures it benefits from the improvement done in commit 2ece8c2.
WebP scale-on-load shrinks directly to target dimensions, so it's expected that there are no performance improvements in the WebP benchmark.
The laptop charger was plugged in and the power profile was set to "performance" during the benchmarks.
```
$ powerprofilesctl get
performance
```

So on this benchmark, Highway is ~16% to ~22% faster when compared with liborc.

lovell · 2023-09-26T12:33:15Z

I've done more testing and can confirm VIPS_CONCURRENCY definitely impacts performance. When set to a value of 1 this branch is consistently ~25% faster. As I increase the concurrency the variability increases and the performance starts to drop at around a value of 4, which I think would suggest CPU throttling. I guess this change increases the importance of limiting concurrency to the max physical cores.

jcupitt · 2023-09-26T13:04:31Z

I think Intel brand this as Turbo Boost.

If the CPU is mostly just using one core, that single core gets about a 20% or 30% clock bump above the standard rated frequency. Once you start to hot a couple of cores, it'll clock down to normal speeds.

Maybe disable turbo boost and try benchmarking again? I always forget how to do this, but SO suggests:

https://askubuntu.com/a/620114

The other factor might be the cache. Your cores will share L2/L3, so single core performance will in effect get a cache boost.

See: libvips/libvips#3618

kleisauke · 2023-09-26T18:06:13Z

\o/, this will be in libvips 8.15.

jan-wassenberg · 2023-09-26T18:17:28Z

Highway main author here. Great to see this, congrats @kleisauke on the great results and thanks for letting us know :)

I guess this change increases the importance of limiting concurrency to the max physical cores.

Agreed. Hyperthreads are sharing the vector units of a core; on Intel there are essentially 2 arithmetic ports plus one for shuffles, and it is pretty easy to keep them busy with a single thread. Scheduling threads onto the same core will not help and may actually hurt. When benchmarking, I usually use taskset or numactl to pin threads to the first HT in a core.

In addition to that, faster vectorization means we are closer to being memory-bound, and in particular influenced by background activity that happens to use more of the bandwidth during the test.

jcupitt · 2023-09-26T19:42:44Z

Congratulations on landing this huge thing Kleis!

kmartinez · 2023-10-02T16:14:22Z

just catching up with this - its "music to my ears" as you can expect ;-) well done Kleis!
can't wait to test it on my 7900 too!

libvips/convolution/convi_hwy.cpp

kleisauke added 27 commits August 19, 2023 12:55

Deprecate the vector API

77eada0

In preparation for highway. Also, don't use liborc for `vips_abs()`, as that didn't yield any usable speedup.

Start with highway integration

ed8bf0b

Add --targets flag to vips utility

09b4b4a

memory: add aligned allocators

ce7a34a

buffer: use aligned allocation for uchar images

09e0b51

Implement reduce{h,v} using highway ops

0454e4d

reduce{h,v}: use FATSTRIP for the highway paths

c61720a

This partially reverts commit dfdf899.

Implement convi using highway ops

864f5a8

Implement morph using highway ops

3f889a5

Implement convi and reduce{h,v} using ReorderWidenMulAccumulate

b3929f2

Avoid use of ReorderDemote2To on RVV/SVE

b897a79

Avoid use of ZipLower / ZipUpper

2338ce7

In favor of `InterleaveLower` / `InterleaveUpper`.

Prefer use of RearrangeToOddPlusEven in convi and reduce{h,v}

274e945

Enable the 2x unroll loop on all targets

26534f3

Bump minimum Highway version to 1.0.4

72f74ce

Disable emulated and scalarized SIMD paths

078af21

Just fallback to the C paths if SIMD is not supported.

Support vectors up to 256 bits

f02a8ed

Support vectors of >= 512 bits

0ace80c

Disable SSSE3 by default

ec76fe0

Optimize uchar reduceh

71a9fa5

For images with 3 or 4 bands.

Speed-up RGBA thumbnail

2ece8c2

By casting back to the unpremultiplied format immediately after `vips_premultiply()`.

Regenerate POTFILES.in

22e80e9

Remove ->matrixi in favor of ->matrixs

cf14281

The fixed-point coefficients are 16-bit.

Add ChangeLog note

d6abe0f

Ensure ABI compatibility

e2d77b4

Speed-up morph implementation

40e2884

jcupitt mentioned this pull request Aug 21, 2023

Different resize issues libvips/pyvips#148

Open

jcupitt approved these changes Aug 21, 2023

View reviewed changes

kleisauke mentioned this pull request Aug 22, 2023

libvips: add highway as a dependency google/oss-fuzz#10868

Merged

kleisauke added a commit to kleisauke/oss-fuzz that referenced this pull request Aug 22, 2023

libvips: add highway as a dependency

b856912

In preparation for libvips/libvips#3618.

DavidKorczynski pushed a commit to google/oss-fuzz that referenced this pull request Aug 22, 2023

libvips: add highway as a dependency (#10868)

8923107

In preparation for libvips/libvips#3618.

OSS-Fuzz: test against highway

7243040

kleisauke mentioned this pull request Aug 28, 2023

tiffload: add 16-bit float support #3626

Merged

Merge branch 'master' into simd-highway

4876673

lovell approved these changes Sep 26, 2023

View reviewed changes

kleisauke merged commit b32cb5e into libvips:master Sep 26, 2023
6 checks passed

kleisauke mentioned this pull request Sep 26, 2023

Docs: add libvips to list of users google/highway#1783

Merged

kleisauke added a commit to kleisauke/highway that referenced this pull request Sep 26, 2023

Docs: add libvips to list of users

ee549a7

See: libvips/libvips#3618

kleisauke added this to the 8.15 milestone Sep 26, 2023

kleisauke commented Oct 4, 2023

View reviewed changes

libvips/convolution/convi_hwy.cpp Show resolved Hide resolved

kleisauke mentioned this pull request Feb 19, 2024

Highway improvements #3860

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD optimizations with Highway #3618

SIMD optimizations with Highway #3618

kleisauke commented Aug 20, 2023

jcupitt commented Aug 20, 2023

jcupitt commented Aug 21, 2023

lovell commented Aug 21, 2023

kleisauke commented Aug 22, 2023

jcupitt commented Aug 22, 2023

kleisauke commented Aug 22, 2023

jcupitt commented Aug 23, 2023

lovell commented Sep 1, 2023

kleisauke commented Sep 10, 2023 •

edited

kleisauke commented Sep 24, 2023

lovell commented Sep 26, 2023

jcupitt commented Sep 26, 2023

kleisauke commented Sep 26, 2023

jan-wassenberg commented Sep 26, 2023

jcupitt commented Sep 26, 2023

kmartinez commented Oct 2, 2023

SIMD optimizations with Highway #3618

SIMD optimizations with Highway #3618

Conversation

kleisauke commented Aug 20, 2023

Motivation

Usage

Accuracy and performance

Backward compatibility

References

jcupitt commented Aug 20, 2023

jcupitt commented Aug 21, 2023

lovell commented Aug 21, 2023

kleisauke commented Aug 22, 2023

jcupitt commented Aug 22, 2023

kleisauke commented Aug 22, 2023

jcupitt commented Aug 23, 2023

lovell commented Sep 1, 2023

kleisauke commented Sep 10, 2023 • edited

kleisauke commented Sep 24, 2023

lovell commented Sep 26, 2023

jcupitt commented Sep 26, 2023

kleisauke commented Sep 26, 2023

jan-wassenberg commented Sep 26, 2023

jcupitt commented Sep 26, 2023

kmartinez commented Oct 2, 2023

kleisauke commented Sep 10, 2023 •

edited