Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD optimizations with Highway #3618

Merged
merged 29 commits into from
Sep 26, 2023
Merged

Conversation

kleisauke
Copy link
Member

This PR optimizes the reduce{h,v}, convi, and morph operations using portable SIMD/vector instructions through Highway. The liborc paths serve as fallbacks whenever Highway >= v1.0.5 is unavailable[1].

Motivation

Traditionally, libvips depends on liborc's runtime compiler to dynamically generate optimized SIMD/vector code specifically for the target architecture. However, maintaining this code proved challenging and it didn't generalize to other architectures (such as WebAssembly). Additionally, it lacked support for newer instruction sets (like AVX2 and AVX-512), and the vector paths of liborc didn't match the precision of the C paths (as noted here).

Highway is a C++ library with carefully-chosen functions that map well to CPU instructions without extensive compiler transformations. Because Highway is a library (rather than a code generator or compiler) it facilitates straightforward development, debugging, and maintenance of the code. Highway supports five architectures[2]; the same application code can target various instruction sets, including those with 'scalable' vectors (size unknown at compile time).

Usage

Users can view available targets for their platform using the --targets flag:

$ vips --targets
builtin targets:   AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2
supported targets: AVX3_ZEN4 AVX3_DL AVX3 AVX2 SSE4 SSSE3 SSE2

Additionally, users can specify which available targets to use at runtime via the VIPS_VECTOR environment variable, particularly useful for testing and benchmarking:

$ VIPS_VECTOR="-2049" vips --targets
builtin targets:   AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSSE3 SSE2
supported targets: SSE4

As always, you have the option to disable vector paths with the --vips-novector flag or the VIPS_NOVECTOR environment variable.

Accuracy and performance

This PR underwent accuracy and speed testing on the following targets:
https://gist.github.com/kleisauke/1f28a9fc156c753bcb1239b6fc1a2e62

It produces identical output to the C paths on these targets, addressing issue #2047.

On my AMD Ryzen 9 7900 workstation, this implementation shows a noticeable speed improvement, ranging from ~15% to even ~2.5 times faster, depending on the number of worker threads used. See the benchmark results at:
https://github.com/kleisauke/vips-microbench/blob/master/results/simd-highway.md

Feel free to benchmark this across additional architectures!

Backward compatibility

Several liborc-specific functions are now deprecated; see API changes below for details[3].

This PR should not affect backward compatibility. The abi-compliance-checker result is available at:
https://kleisauke.nl/compat_reports/vips/master_to_simd-highway/compat_report.html

References

[1]: Highway packaging status

Highway packaging status

[2]: Highway targets

Highway currently targets the following 'clusters' of features:

  • x86:
    • SSE2 (any x64)
    • SSSE3 (~Intel Core)
    • SSE4 (~Nehalem)
    • AVX2 (~Haswell)
    • AVX3 (~Skylake)
    • AVX3_DL (~Icelake)
    • AVX3_ZEN4 (Zen4).
    • AVX3_SPR (~Sapphire Rapids)
  • Arm:
    • NEON (Armv7+)
    • SVE (plus its specialization for 256-bit vectors SVE_256)
    • SVE2 (plus its specialization for 128-bit vectors SVE2_128)
  • POWER:
    • PPC8 (v2.07)
    • PPC9 (v3.0)
    • PPC10 (v3.1B)
  • RISC-V:
    • RVV (1.0)
  • WebAssembly:
    • WASM
    • WASM_EMU256 (a 2x unrolled version of wasm128)
[3]: API Changes

memory.h:

void vips_tracked_free(void *s);
+void vips_tracked_aligned_free(void *s);
void *vips_tracked_malloc(size_t size);
+void *vips_tracked_aligned_alloc(size_t size, size_t align);

A new function to allocate memory aligned on a specific boundary, along with a function for releasing that memory.

vector.h:

-void vips_vector_init(void);
gboolean vips_vector_isenabled(void);
void vips_vector_set_enabled(gboolean enabled);

-void vips_vector_free(VipsVector *vector);
-VipsVector *vips_vector_new(const char *name, int dsize);
-void vips_vector_constant(VipsVector *vector, char *name, int value, int size);
-void vips_vector_source_scanline(VipsVector *vector, char *name, int line, int size);
-int vips_vector_source_name(VipsVector *vector, const char *name, int size);
-void vips_vector_temporary(VipsVector *vector, const char *name, int size);
-int vips_vector_parameter(VipsVector *vector, const char *name, int size);
-int vips_vector_destination(VipsVector *vector, const char *name, int size);
-void vips_vector_asm2(VipsVector *vector, const char *op, const char *a, const char *b);
-void vips_vector_asm3(VipsVector *vector, const char *op, const char *a, const char *b, const char *c);
-gboolean vips_vector_full(VipsVector *vector);
-gboolean vips_vector_compile(VipsVector *vector);
-void vips_vector_print(VipsVector *vector);
-void vips_executor_set_program(VipsExecutor *executor, VipsVector *vector, int n);
-void vips_executor_set_scanline(VipsExecutor *executor, VipsRegion *ir, int x, int y);
-void vips_executor_set_destination(VipsExecutor *executor, void *value);
-void vips_executor_set_parameter(VipsExecutor *executor, int var, int value);
-void vips_executor_set_array(VipsExecutor *executor, int var, void *value);
-void vips_executor_run(VipsExecutor *executor);
-void vips_vector_to_fixed_point(double *in, int *out, int n, int scale);
+gint64 vips_vector_get_builtin_targets(void);
+gint64 vips_vector_get_supported_targets(void);
+const char *vips_vector_target_name(gint64 target);
+void vips_vector_disable_targets(gint64 disabled_targets);

New functions to obtain or disable specific targets; the previous VipsVector / VipsExecutor APIs are deprecated.

In preparation for highway.

Also, don't use liborc for `vips_abs()`, as that didn't yield any
usable speedup.
In addition to disabling SIMD completely using `--vips-novector`
or `VIPS_NOVECTOR`, one has the option to selectively override
specific SIMD targets using:
- the `VIPS_VECTOR` environment variable;
- the `vips_vector_disable_targets()` function.

Handy for testing and benchmarking purposes.
In favor of `InterleaveLower` / `InterleaveUpper`.
Just fallback to the C paths if SIMD is not supported.
For images with 3 or 4 bands.
By casting back to the unpremultiplied format immediately after
`vips_premultiply()`.
The fixed-point coefficients are 16-bit.
@jcupitt
Copy link
Member

jcupitt commented Aug 20, 2023

This is fantastic Kleis, what a huge project, and congratulations on getting it over the line.

I'll run some tests here.

@jcupitt
Copy link
Member

jcupitt commented Aug 21, 2023

I tried a few things:

$ time vips gaussblur wtc.jpg x.jpg 10

real	0m1.092s
user	0m7.167s
sys	0m0.224s
$ time vips gaussblur wtc.jpg x.jpg 10

real	0m0.879s
user	0m2.247s
sys	0m0.184s

This is limited by jpg encode and decode, but you can see a nice improvement in CPU time. If you make it more CPU limited, the speedup is more obvious:

$ time vips gaussblur wtc.jpg x.jpg 100

real	0m14.970s
user	7m50.982s
sys	0m0.380s
$ time vips gaussblur wtc.jpg x.jpg 100

real	0m1.991s
user	0m20.958s
sys	0m0.800s

Haha 7x faster in real time because sigma 100 will make master fall off the orc path.

reduceh is 2.5x faster, though it makes little difference to image resize times. Morph is slightly quicker.

I like the new highway infrastructure. It should make it relatively simple to add more highway paths, for example to VipsInterpolate, or maybe even composite (I expect highway could beat the compiler thing we use now).

I've not noticed any bad results.

@lovell
Copy link
Member

lovell commented Aug 21, 2023

Wow, this is great, thank you Kleis! I'll go away and do some testing, but please don't let that stop you merging.

We don't currently include vector paths via oss-fuzz but perhaps we might want to consider doing so?

kleisauke added a commit to kleisauke/oss-fuzz that referenced this pull request Aug 22, 2023
@kleisauke
Copy link
Member Author

reduceh is 2.5x faster, though it makes little difference to image resize times.

Indeed, somehow vips_resize() / vips_thumbnail() doesn't really benefit from this. It seems that shrink{h,v} + reduce{h,v} is actually slower than reduce{h,v} alone, so we might consider defaulting to gap = 0.0, as noticed in libvips/pyvips#148 (comment).

We don't currently include vector paths via oss-fuzz but perhaps we might want to consider doing so?

I just opened PR google/oss-fuzz#10868 for this. I'll update fuzz/oss_fuzz_build.sh after that lands.

@jcupitt
Copy link
Member

jcupitt commented Aug 22, 2023

so we might consider defaulting to gap = 0.0, as noticed in libvips/pyvips#148 (comment).

I think this would be a bad idea for large shrinks -- if you are shrinking by x100, for example, reducev would need to read the input image in huge chunks. shrinkv has the nice property of never fetching too many input scanlines in one go.

@kleisauke
Copy link
Member Author

I think this would be a bad idea for large shrinks

Ah, you're right. I tested this with:

Details

Benchmark script: https://gist.github.com/kleisauke/ea7f7e12ae043aa1151dbc09987600a7

$ curl -LO https://github.com/kleisauke/vips-microbench/raw/master/images/x.jpg
$ python3 gap-bench.py --gap=2.0 -o gap-2.0.json
$ python3 gap-bench.py --gap=0.0 -o gap-0.0.json
$ python3 -m pyperf compare_to gap-2.0.json gap-0.0.json --table
+----------------+---------+----------------------+
| Benchmark      | gap-2.0 | gap-0.0              |
+================+=========+======================+
| 4x             | 567 ms  | 305 ms: 1.86x faster |
+----------------+---------+----------------------+
| 8x             | 424 ms  | 306 ms: 1.39x faster |
+----------------+---------+----------------------+
| 9.4x           | 391 ms  | 303 ms: 1.29x faster |
+----------------+---------+----------------------+
| 16x            | 355 ms  | 315 ms: 1.12x faster |
+----------------+---------+----------------------+
| 64x            | 338 ms  | 415 ms: 1.23x slower |
+----------------+---------+----------------------+
| Geometric mean | (ref)   | 1.17x faster         |
+----------------+---------+----------------------+

Benchmark hidden because not significant (2): 2x, 32x

DavidKorczynski pushed a commit to google/oss-fuzz that referenced this pull request Aug 22, 2023
@jcupitt
Copy link
Member

jcupitt commented Aug 23, 2023

A big difference in memory use too:

john@banana ~/pics $ /usr/bin/time -f %M:%e vips resize x.jpg x2.jpg 0.01
150484:0.42
john@banana ~/pics $ /usr/bin/time -f %M:%e vips resize x.jpg x2.jpg 0.01 --gap 0
511796:0.80

@lovell
Copy link
Member

lovell commented Sep 1, 2023

My initial testing on an Intel i7-1255U (2 physical cores with hyperthreading, 8 physical without, "12 cores" in total) laptop with AVX2 suggests there is a noticeable variance in multi-threaded resize performance when compared with liborc, with a seemingly-random range of +15% at best to -5% at worst.

I've yet to dig into the details but it could be a clock speed reduction of the non-hyperthreading cores when hot, or AVX2 "heavy" operations causing slowdown due to throttling / lane widening, or maybe some operations are now too fast and now there are more cache evictions.

@kleisauke
Copy link
Member Author

kleisauke commented Sep 10, 2023

Thanks for testing @lovell! The 5% slowdown sounds like a CPU clock throttling issue, does this also occur with export VIPS_CONCURRENCY=1 or export VIPS_VECTOR=-2049?

It could also be due to the over-computation issue mentioned in #2757, which could be circumvented by forcing random access (I'm not sure if this can be done in CLI). I'll have a look to see if I can reproduce this on my old AVX2 laptop.

@kleisauke
Copy link
Member Author

I'll have a look to see if I can reproduce this on my old AVX2 laptop.

I could not reproduce this on my old AVX2 laptop. Tested with:

Details

Benchmark script
https://gist.github.com/kleisauke/a2669ec11118de41f36401415e144fd7

Test environment

  • HP 250 G5 - i5-6200U
  • Fedora 38
  • $ vips --targets
    builtin targets:   AVX3_SPR AVX3_ZEN4 AVX3 AVX2 SSE4 SSE2
    supported targets: AVX2 SSE4 SSSE3 SSE2

Images

Image Dimensions
2569067123_aca715a2ee_o.jpg 2725×2225
alpha-premultiply-2048x1536-paper.png 2048×1536
4.webp 1024×772

Results

$ curl -LO https://github.com/lovell/sharp/raw/main/test/fixtures/2569067123_aca715a2ee_o.jpg
$ curl -LO https://github.com/lovell/sharp/raw/main/test/fixtures/alpha-premultiply-2048x1536-paper.png
$ curl -LO https://github.com/lovell/sharp/raw/main/test/fixtures/4.webp

$ python3 thumbnail-bench.py 2569067123_aca715a2ee_o.jpg -o jpeg-highway.json
.....................
720x: Mean +- std dev: 49.2 ms +- 0.8 ms

$ python3 thumbnail-bench.py alpha-premultiply-2048x1536-paper.png -o png-highway.json
.....................
720x: Mean +- std dev: 83.8 ms +- 1.6 ms

$ python3 thumbnail-bench.py 4.webp -o webp-highway.json
.....................
720x: Mean +- std dev: 30.3 ms +- 0.4 ms

$ python3 thumbnail-bench.py 2569067123_aca715a2ee_o.jpg -o jpeg-orc.json
.....................
720x: Mean +- std dev: 60.0 ms +- 1.9 ms

$ python3 thumbnail-bench.py alpha-premultiply-2048x1536-paper.png -o png-orc.json
.....................
720x: Mean +- std dev: 100 ms +- 2 ms

$ python3 thumbnail-bench.py 4.webp -o webp-orc.json
.....................
720x: Mean +- std dev: 30.3 ms +- 0.4 ms

$ python3 -m pyperf compare_to jpeg-orc.json jpeg-highway.json  --table
+-----------+----------+-----------------------+
| Benchmark | jpeg-orc | jpeg-highway          |
+===========+==========+=======================+
| 720x      | 60.0 ms  | 49.2 ms: 1.22x faster |
+-----------+----------+-----------------------+

$ python3 -m pyperf compare_to png-orc.json png-highway.json --table
+-----------+---------+-----------------------+
| Benchmark | png-orc | png-highway           |
+===========+=========+=======================+
| 720x      | 100 ms  | 83.8 ms: 1.20x faster |
+-----------+---------+-----------------------+

$ python3 -m pyperf compare_to webp-orc.json webp-highway.json --table
Benchmark hidden because not significant (1): 720x

Notes

  • The liborc benchmark is done on this PR by compiling with -Dhighway=disabled, this ensures it benefits from the improvement done in commit 2ece8c2.
  • WebP scale-on-load shrinks directly to target dimensions, so it's expected that there are no performance improvements in the WebP benchmark.
  • The laptop charger was plugged in and the power profile was set to "performance" during the benchmarks.
    $ powerprofilesctl get
    performance

So on this benchmark, Highway is ~16% to ~22% faster when compared with liborc.

@lovell
Copy link
Member

lovell commented Sep 26, 2023

I've done more testing and can confirm VIPS_CONCURRENCY definitely impacts performance. When set to a value of 1 this branch is consistently ~25% faster. As I increase the concurrency the variability increases and the performance starts to drop at around a value of 4, which I think would suggest CPU throttling. I guess this change increases the importance of limiting concurrency to the max physical cores.

@jcupitt
Copy link
Member

jcupitt commented Sep 26, 2023

I think Intel brand this as Turbo Boost.

If the CPU is mostly just using one core, that single core gets about a 20% or 30% clock bump above the standard rated frequency. Once you start to hot a couple of cores, it'll clock down to normal speeds.

Maybe disable turbo boost and try benchmarking again? I always forget how to do this, but SO suggests:

https://askubuntu.com/a/620114

The other factor might be the cache. Your cores will share L2/L3, so single core performance will in effect get a cache boost.

@kleisauke kleisauke merged commit b32cb5e into libvips:master Sep 26, 2023
6 checks passed
kleisauke added a commit to kleisauke/highway that referenced this pull request Sep 26, 2023
@kleisauke kleisauke added this to the 8.15 milestone Sep 26, 2023
@kleisauke
Copy link
Member Author

\o/, this will be in libvips 8.15.

@jan-wassenberg
Copy link

Highway main author here. Great to see this, congrats @kleisauke on the great results and thanks for letting us know :)

I guess this change increases the importance of limiting concurrency to the max physical cores.

Agreed. Hyperthreads are sharing the vector units of a core; on Intel there are essentially 2 arithmetic ports plus one for shuffles, and it is pretty easy to keep them busy with a single thread. Scheduling threads onto the same core will not help and may actually hurt. When benchmarking, I usually use taskset or numactl to pin threads to the first HT in a core.

In addition to that, faster vectorization means we are closer to being memory-bound, and in particular influenced by background activity that happens to use more of the bandwidth during the test.

@jcupitt
Copy link
Member

jcupitt commented Sep 26, 2023

Congratulations on landing this huge thing Kleis!

@kmartinez
Copy link
Member

just catching up with this - its "music to my ears" as you can expect ;-) well done Kleis!
can't wait to test it on my 7900 too!

@kleisauke kleisauke mentioned this pull request Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants