[WIP] SIMD-accelerated IDCT #146

lovasoa · 2020-04-13T21:50:52Z

This commit is a first step towards SIMD-accelerated IDCT.
It optimizes only the first part of the IDCT, and no fallback has been implemented for non-nightly compilers.

Benchmark results:

Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 11.3s or reduce sample count to 40
decode a 512x512 JPEG   time:   [2.2417 ms 2.2633 ms 2.2846 ms]                                   
                        change: [-26.302% -23.320% -20.299%] (p = 0.00 < 0.05)
                        Performance has improved.

Closes: #79
Please merge #144 first

lovasoa · 2020-04-13T21:53:10Z

@fintelia , @HeroicKatora : This PR is not finalized, but I'd love to get your feedback early.

@lilith : This may be of some interest to you.

fintelia

I like this direction! We need to figure out a way to handle cases were packed_simd is unavailable, but perhaps just making it an optional feature + duplicating the dequantize_and_idct_block_* functions would be enough?

lovasoa · 2020-04-14T07:24:15Z

I like this direction! We need to figure out a way to handle cases were packed_simd is unavailable, but perhaps just making it an optional feature + duplicating the dequantize_and_idct_block_* functions would be enough?

Yes, we can do that. But ideally, I would have wanted even non-simd targets to benefit from this change, by using simulated simd, which is still faster than loops. Unfortunately, the API of ssimd seems to be out of date, and the crate seems to be unmaintained...

src/idct.rs

lovasoa · 2020-04-15T07:07:12Z

I have started to work on adapting ssimd to the latest packed-simd interface, so that we can use it with stable compilers here. The amount of work to do is moderate, since we use only a few simd types and operators.

https://github.com/lovasoa/ssimd

This commit is a first step towards SIMD-accelerated IDCT. It optimizes only the first part of the IDCT, and no fallback has been implemented for non-nightly compilers. Benchmark results: Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 11.3s or reduce sample count to 40 decode a 512x512 JPEG time: [2.2417 ms 2.2633 ms 2.2846 ms] change: [-26.302% -23.320% -20.299%] (p = 0.00 < 0.05) Performance has improved.

Benchmark: decode a 512x512 JPEG time: [2.1719 ms 2.1900 ms 2.2063 ms]

decode a 512x512 JPEG time: [2.1308 ms 2.1523 ms 2.1742 ms]

The packed_simd features enables SIMD. If the feature isn't enabled, ssimd is used, which uses plain structs to emulate simd vectors. When optimizations are enabled and the target architecture supports it, the compiler should still emit SIMD instructions for the ssimd code.

This necessitates to transpose the result matrix at the end of the IDCT, but benchmarking still shows a ~5% performance improvement with this (on the 512x12 jpeg decode benchmark, when compiled on rust nightly with target-cpu=native and the packed_simd feature)

This greatly improves the benchmark times on non-simd targets

lovasoa · 2020-04-16T17:38:10Z

Final benchmark

Before this PR

$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo bench 'decode a 512x512 JPEG'

decode a 512x512 JPEG   time:   [2.6583 ms 2.6783 ms 2.6984 ms]

After this PR

all optimizations, `packed_simd` feature disabled

$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo bench 'decode a 512x512 JPEG'

decode a 512x512 JPEG   time:   [3.5087 ms 3.5182 ms 3.5281 ms]

**~ 30% worse**

all optimizations, `packed_simd` feature enabled

$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo +nightly bench --features="packed_simd" 'decode a 512x512 JPEG'

decode a 512x512 JPEG   time:   [2.0040 ms 2.0151 ms 2.0262 ms]

**~ 20% better**

So contrarily to what I thought, the benchmarks didn't improve, they worsened for the non-simd case.

HeroicKatora · 2020-06-15T15:49:33Z

Is there something you still wanted to do here? It's still marked as a draft and the comment alludes a partial regression.

lovasoa · 2020-06-15T16:04:43Z

Hi @HeroicKatora !
Yes, there is a performance regression for the non-simd version, so this PR should probably not be merged as-is. I contacted @lilith about this PR, but she told me her company doesn't need this anymore and thus isn't ready to fund this anymore, so I lost a little bit of the motivation needed to work on this.

I wanted to switch from packed_simd to simdeez, but then I found an inconsistency in simdeez, for which I proposed a PR. It took some time, but it is now merged, so there is no more blocker. Are you interested in working on this, @HeroicKatora ?

HeroicKatora · 2020-06-15T16:15:06Z

I see, and can totally feel how that would turn off some motivation. Yeah, the total of 20% performance is signifcant. Switching to a stable crate with 1.0 would be positive as well and if it helps avoid the regression even better!

I'm looking for what project to focus on next after apng is done, so maybe. The alternative is that I focus on [image-canvas] instead, still shepherding this if you think it has the potential.

veluca93 · 2021-12-14T18:06:35Z

Hi all!
I was wondering if there'd be some interest in picking this up, and, if not, if anybody here would mind me trying my hand at some handcrafted SIMD optimizations :) (probably with intrinsics for x86/sse and arm/neon, as I think portable SIMD is not stabilized yet)

lovasoa · 2021-12-15T08:52:38Z

Yes, please pick this up !

lovasoa changed the title ~~[WIP] SIMD-accelerated DCT~~ [WIP] SIMD-accelerated IDCT Apr 13, 2020

fintelia reviewed Apr 14, 2020

View reviewed changes

HeroicKatora reviewed Apr 14, 2020

View reviewed changes

src/idct.rs Outdated Show resolved Hide resolved

src/idct.rs Outdated Show resolved Hide resolved

lovasoa added 6 commits April 15, 2020 16:01

Port the second part of idct (columns) to SIMD

be3cec6

Benchmark: decode a 512x512 JPEG time: [2.1719 ms 2.1900 ms 2.2063 ms]

Refactor IDCT code to make it more readable

7c11366

Factor out the IDCT SIMD matrix transposition

828c5fa

decode a 512x512 JPEG time: [2.1308 ms 2.1523 ms 2.1742 ms]

Port dequantize_and_idct_block_4x4 to simd

feeac4e

lovasoa force-pushed the idct-simd branch from 1ead405 to 16c218d Compare April 15, 2020 14:07

lovasoa added 4 commits April 15, 2020 17:41

[performance] separate simd vector loading from calculations

f3236cf

test packed_simd on nightly

4e62feb

benchmark packed_simd on nightly

e1ffc1f

lovasoa force-pushed the idct-simd branch from c18b947 to cd9bea7 Compare April 15, 2020 18:54

lovasoa added 4 commits April 16, 2020 18:06

[performance] change the matrix transposition algorithm

497d71c

This greatly improves the benchmark times on non-simd targets

updated dependency

9a06bcd

mark slice writes as unaligned

91d6957

add performance section to the README.md

fbb9969

Merge branch 'master' of github.com:kaksmet/jpeg-decoder

6104063

lovasoa mentioned this pull request Jun 15, 2020

thread 'rustc' panicked at 'Box<Any>', src/librustc_errors/lib.rs:912:9 when compiling packed_simd rust-lang/rust#73377

Closed

Allow CI failure with nightly rust

78246e2

lovasoa mentioned this pull request Jun 15, 2020

more i16 and u16 operations arduano/simdeez#33

Open

lovasoa mentioned this pull request Oct 18, 2020

jpeg-decoder is slower than libjpeg-turbo #155

Open

veluca93 mentioned this pull request Dec 16, 2021

SSSE3 implementations of 8x8 IDCT and YCbCr conversion #211

Merged

HeroicKatora closed this Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] SIMD-accelerated IDCT #146

[WIP] SIMD-accelerated IDCT #146

lovasoa commented Apr 13, 2020 •

edited

lovasoa commented Apr 13, 2020

fintelia left a comment

lovasoa commented Apr 14, 2020

lovasoa commented Apr 15, 2020

lovasoa commented Apr 16, 2020

HeroicKatora commented Jun 15, 2020

lovasoa commented Jun 15, 2020

HeroicKatora commented Jun 15, 2020

veluca93 commented Dec 14, 2021

lovasoa commented Dec 15, 2021

[WIP] SIMD-accelerated IDCT #146

[WIP] SIMD-accelerated IDCT #146

Conversation

lovasoa commented Apr 13, 2020 • edited

lovasoa commented Apr 13, 2020

fintelia left a comment

Choose a reason for hiding this comment

lovasoa commented Apr 14, 2020

lovasoa commented Apr 15, 2020

lovasoa commented Apr 16, 2020

Before this PR

After this PR

all optimizations, packed_simd feature disabled

all optimizations, packed_simd feature enabled

HeroicKatora commented Jun 15, 2020

lovasoa commented Jun 15, 2020

HeroicKatora commented Jun 15, 2020

veluca93 commented Dec 14, 2021

lovasoa commented Dec 15, 2021

lovasoa commented Apr 13, 2020 •

edited

all optimizations, `packed_simd` feature disabled

all optimizations, `packed_simd` feature enabled