Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] SIMD-accelerated IDCT #146

Closed
wants to merge 16 commits into from
Closed

Conversation

lovasoa
Copy link
Contributor

@lovasoa lovasoa commented Apr 13, 2020

This commit is a first step towards SIMD-accelerated IDCT.
It optimizes only the first part of the IDCT, and no fallback has been implemented for non-nightly compilers.

Benchmark results:

Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 11.3s or reduce sample count to 40
decode a 512x512 JPEG   time:   [2.2417 ms 2.2633 ms 2.2846 ms]                                   
                        change: [-26.302% -23.320% -20.299%] (p = 0.00 < 0.05)
                        Performance has improved.

Closes: #79
Please merge #144 first

@lovasoa
Copy link
Contributor Author

lovasoa commented Apr 13, 2020

@fintelia , @HeroicKatora : This PR is not finalized, but I'd love to get your feedback early.

@lilith : This may be of some interest to you.

@lovasoa lovasoa changed the title [WIP] SIMD-accelerated DCT [WIP] SIMD-accelerated IDCT Apr 13, 2020
Copy link
Contributor

@fintelia fintelia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this direction! We need to figure out a way to handle cases were packed_simd is unavailable, but perhaps just making it an optional feature + duplicating the dequantize_and_idct_block_* functions would be enough?

@lovasoa
Copy link
Contributor Author

lovasoa commented Apr 14, 2020

I like this direction! We need to figure out a way to handle cases were packed_simd is unavailable, but perhaps just making it an optional feature + duplicating the dequantize_and_idct_block_* functions would be enough?

Yes, we can do that. But ideally, I would have wanted even non-simd targets to benefit from this change, by using simulated simd, which is still faster than loops. Unfortunately, the API of ssimd seems to be out of date, and the crate seems to be unmaintained...

src/idct.rs Outdated Show resolved Hide resolved
src/idct.rs Outdated Show resolved Hide resolved
@lovasoa
Copy link
Contributor Author

lovasoa commented Apr 15, 2020

I have started to work on adapting ssimd to the latest packed-simd interface, so that we can use it with stable compilers here. The amount of work to do is moderate, since we use only a few simd types and operators.

https://github.com/lovasoa/ssimd

This commit is a first step towards SIMD-accelerated IDCT.
It optimizes only the first part of the IDCT, and no fallback has been implemented for non-nightly compilers.

Benchmark results:

Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 11.3s or reduce sample count to 40
decode a 512x512 JPEG   time:   [2.2417 ms 2.2633 ms 2.2846 ms]
                        change: [-26.302% -23.320% -20.299%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmark:
decode a 512x512 JPEG   time:   [2.1719 ms 2.1900 ms 2.2063 ms]
decode a 512x512 JPEG   time:   [2.1308 ms 2.1523 ms 2.1742 ms]
The packed_simd features enables SIMD.
If the feature isn't enabled, ssimd is used, which uses plain structs to emulate simd vectors.
When optimizations are enabled and the target architecture supports it, the compiler should still
emit SIMD instructions for the ssimd code.
This necessitates to transpose the result matrix at the end of the IDCT,
but benchmarking still shows a ~5% performance improvement with this
(on the 512x12 jpeg decode benchmark, when compiled on rust nightly
with target-cpu=native and the packed_simd feature)
@lovasoa
Copy link
Contributor Author

lovasoa commented Apr 16, 2020

Final benchmark

Before this PR

$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo bench 'decode a 512x512 JPEG'

decode a 512x512 JPEG   time:   [2.6583 ms 2.6783 ms 2.6984 ms]      

After this PR

all optimizations, packed_simd feature disabled

$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo bench 'decode a 512x512 JPEG'

decode a 512x512 JPEG   time:   [3.5087 ms 3.5182 ms 3.5281 ms]                                   

**~ 30% worse**

all optimizations, packed_simd feature enabled

$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo +nightly bench --features="packed_simd" 'decode a 512x512 JPEG'

decode a 512x512 JPEG   time:   [2.0040 ms 2.0151 ms 2.0262 ms]                                   

**~ 20% better**

So contrarily to what I thought, the benchmarks didn't improve, they worsened for the non-simd case.

@HeroicKatora
Copy link
Member

Is there something you still wanted to do here? It's still marked as a draft and the comment alludes a partial regression.

@lovasoa
Copy link
Contributor Author

lovasoa commented Jun 15, 2020

Hi @HeroicKatora !
Yes, there is a performance regression for the non-simd version, so this PR should probably not be merged as-is. I contacted @lilith about this PR, but she told me her company doesn't need this anymore and thus isn't ready to fund this anymore, so I lost a little bit of the motivation needed to work on this.

I wanted to switch from packed_simd to simdeez, but then I found an inconsistency in simdeez, for which I proposed a PR. It took some time, but it is now merged, so there is no more blocker. Are you interested in working on this, @HeroicKatora ?

@HeroicKatora
Copy link
Member

I see, and can totally feel how that would turn off some motivation. Yeah, the total of 20% performance is signifcant. Switching to a stable crate with 1.0 would be positive as well and if it helps avoid the regression even better!

I'm looking for what project to focus on next after apng is done, so maybe. The alternative is that I focus on [image-canvas] instead, still shepherding this if you think it has the potential.

@veluca93
Copy link
Contributor

Hi all!
I was wondering if there'd be some interest in picking this up, and, if not, if anybody here would mind me trying my hand at some handcrafted SIMD optimizations :) (probably with intrinsics for x86/sse and arm/neon, as I think portable SIMD is not stabilized yet)

@lovasoa
Copy link
Contributor Author

lovasoa commented Dec 15, 2021

Yes, please pick this up !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider accelerated IDCT?
4 participants