Merged
Conversation
197g
approved these changes
Mar 9, 2026
Comment on lines
+579
to
+583
| // This weird casting to u64 and back to u16 is to help the compiler | ||
| // optimize RGBA8 -> RGBA16 conversions. Without it, the conversion is | ||
| // not properly vectorized and about 30% slower. | ||
| let x: u64 = c8.into(); | ||
| ((x << 8) | x) as u16 |
Member
There was a problem hiding this comment.
What the hell 😂 I'll take it but maybe we should make a more stable version of it by specializing on Primitive or adding a method converting 4 of these at a time.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I spent a little time optimizing color conversions for
DynamicImage.Changes
f32 -> u8/u16: This changed the most. I replaced
f32::roundwith a simple+ 0.5and removed unnecessary clamping. The mapping of NaN -> 1.0 (uN::MAX) is done usingf32::min(which compiles to a single instruction on x86).This made those conversions around 4x faster. This was a little surprising to me, since
f32::roundis around 10x slower than the addition I replaced it with. I suspect the optimized conversion is new memory-bound on my machine.u8/u16 -> f32: I removed the unnecessary clamping.
I couldn't measure any change in performance. This conversion is memory-bound on my machine. Dropping the image size to 32x32 to ensure everything is in L1 cache makes the difference measurable, but it's very small (around 2%). Replacing division with multiplication would speed it up around 25% for small 32x32 images, but I didn't do that, because it likely won't be faster in most cases in practice and because the results would be slightly different (1 ulp error).
u16 -> u8: I cleaned up the code a little and replaced the comment with a simple derivation for why the implemented expression is correct.
I couldn't measure any change in performance. This doesn't surprise me, since everything I cleaned up, LLVM can optimize away as well.
u8 -> u16: I cleaned up the code a little and added a comment explaining why it does bit operations in u64.
I expanded the benchmark to measure the changes in this PR.
Benchmark
I'll only list the benched functions that changed in perf:
All other benches in
benches/convert.rsremained unchanged.Overall, I'm a little disappointed that those are the only ones I could make faster. But it really seems like all conversions involving f32 operations are now memory-bound on my machine. I could be wrong about that though.
Full benchmark output of this PR