Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MP3: move bounds checks out of the hot loop #101

Merged
merged 8 commits into from Jan 30, 2022

Conversation

Shnatsel
Copy link
Contributor

@Shnatsel Shnatsel commented Jan 28, 2022

Move bounds checks out of the hot loop to be performed only once outside it. This should provide a modest performance boost, especially on CPUs with less powerful speculative execution.

Reduces the output of cargo asm for this function from 488 to 412 lines, when annotated with #[inline(never)] to make it easy to inspect.

Instruction counts before 125 movss 92 mov 85 lea 25 cmp 18 call 16 mulss 14 xor 14 jmp 13 addss 11 jne 6 ud2 6 push 6 pop 6 jae 5 jb 5 add 3 je 1 subss 1 sub 1 ret
Instruction counts after 125 movss 74 lea 68 mov 16 mulss 14 xor 14 jmp 14 cmp 13 addss 12 call 11 jne 6 push 6 pop 5 add 3 je 1 subss 1 sub 1 ret

This doesn't introduce any new panics - the code would have panicked anyway, now the check is simply performed once per function call outside the hot loop.

There might be a more elegant iterator-based way than .try_into().unwrap(); I haven't really looked into it.

Many other functions in this file can be given a similar treatment; this should provide a total boost of a few % in terms of performance for end-to-end decoding. This PR is meant to demonstrate the approach. I could proceed with converting the rest to this idiom, if you wish.

@pdeljanov
Copy link
Owner

I actually used this technique in the new fft module before the fft32 call. It worked pretty well there so I'd be interested in making similar changes to the MP3 decoder if it offers us any performance uplift.

I use this script to benchmark changes against either ffmpeg or a baseline version.

#!/bin/bash
IN="${1@Q}"
hyperfine -m 20 "ffmpeg -threads 1 -benchmark -v 0 -i ${IN} -f null -" "./target/release/symphonia-play --decode-only ${IN}"
# hyperfine -m 20 "./symphonia-play-baseline --decode-only ${IN}" "./target/release/symphonia-play --decode-only ${IN}"

What kind of numbers are you seeing?

@Shnatsel
Copy link
Contributor Author

My machine is usually too noisy to pick up the difference of a few % (which is why I've been relying on instruction counts), but I'll give it a shot.

I know there is some way to measure end-to-end instruction counts too, I just don't know what it is. IIRC rustc benchmarking uses it.

@Shnatsel
Copy link
Contributor Author

I had to crank the test count all the way up to -m 1000 to get results that are reproducible from test to test, but yes, I am seeing a 1% improvement in runtime from this. No difference in instruction count according to perf stat -e instructions, curiously.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Jan 28, 2022

I've applied the same technique to imdct36() and the loop inside hybrid_synthesis(), and got the total performance increase to 3% compared to master.

The instruction count for hybrid_synthesis() exploded from 1154 to 1658; I think the removed bounds checks have allowed the compiler to unroll the fixed-size loop in hybrid_synthesis().

@pdeljanov
Copy link
Owner

So this is fun. These changes result in incorrect decoding (try running symphonia-check on a few files). I was confused for a while because the code looked correct, but turns out this line:

let sub_band: &mut [f32; 18] = &mut samples[start..(start + 18)].try_into().unwrap();

needs to be

let sub_band: &mut [f32; 18] = (&mut samples[start..(start + 18)]).try_into().unwrap();

I believe in the first case try_into is making a [f32; 18] from the 18 samples, and then you're getting getting a mutable slice to that copy. Whereas in the second case we're getting a mutable slice &mut [f32] first, and then try_into yields &mut [f32; 18] to the original samples.

@Shnatsel
Copy link
Contributor Author

Yep, that seems to have been it! With the fix applied symphonia-check seems to pass. But the performance improvement is back down to 1%.

My apologies for causing breakage - I did not expect such a simple change to cause so much trouble.

@pdeljanov
Copy link
Owner

No problem. It was a very non-obvious issue.

I'm measuring about a 1-3% change on my system (Linux 5.16, Intel Core i7 4790k), though it varies. It's something, but this definitely isn't the hottest part of the decoder so major gains are hard to come by.

Sorry for forgetting to mention it earlier, but there's a clippy warning too. I'll merge the PR after that's cleaned up and I test a bit more.

Thanks!

@pdeljanov pdeljanov added this to the v0.5 milestone Jan 29, 2022
@Shnatsel
Copy link
Contributor Author

Yeah, eliminating bounds checks usually results only in single-digit gains even in hot codepaths. At least x86_64 CPUs are very good at speculating past them.

@Shnatsel
Copy link
Contributor Author

Clippy lint fixed!

@pdeljanov pdeljanov merged commit 3ec7761 into pdeljanov:master Jan 30, 2022
@pdeljanov
Copy link
Owner

Thanks, merged!

@Shnatsel Shnatsel deleted the less-bounds-checks branch February 1, 2022 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants