Faster from primitive by RunDevelopment · Pull Request #2834 · image-rs/image

RunDevelopment · 2026-03-09T16:07:30Z

I spent a little time optimizing color conversions for DynamicImage.

Changes

f32 -> u8/u16: This changed the most. I replaced f32::round with a simple + 0.5 and removed unnecessary clamping. The mapping of NaN -> 1.0 (uN::MAX) is done using f32::min (which compiles to a single instruction on x86).

This made those conversions around 4x faster. This was a little surprising to me, since f32::round is around 10x slower than the addition I replaced it with. I suspect the optimized conversion is new memory-bound on my machine.
u8/u16 -> f32: I removed the unnecessary clamping.

I couldn't measure any change in performance. This conversion is memory-bound on my machine. Dropping the image size to 32x32 to ensure everything is in L1 cache makes the difference measurable, but it's very small (around 2%). Replacing division with multiplication would speed it up around 25% for small 32x32 images, but I didn't do that, because it likely won't be faster in most cases in practice and because the results would be slightly different (1 ulp error).
u16 -> u8: I cleaned up the code a little and replaced the comment with a simple derivation for why the implemented expression is correct.

I couldn't measure any change in performance. This doesn't surprise me, since everything I cleaned up, LLVM can optimize away as well.
u8 -> u16: I cleaned up the code a little and added a comment explaining why it does bit operations in u64.
I expanded the benchmark to measure the changes in this PR.

Benchmark

I'll only list the benched functions that changed in perf:

Bench	Old	New	Change
cast_dynamic_rgba32f_rgb8	764.72 µs	211.68 µs	-72.525%
cast_dynamic_rgba32f_rgba8	1.0827 ms	221.45 µs	-79.431%
cast_dynamic_rgba32f_rgb16	953.11 µs	228.23 µs	-72.957%
cast_dynamic_rgba32f_rgba16	1.0224 ms	225.19 µs	-78.107%

All other benches in benches/convert.rs remained unchanged.

Overall, I'm a little disappointed that those are the only ones I could make faster. But it really seems like all conversions involving f32 operations are now memory-bound on my machine. I could be wrong about that though.

Full benchmark output of this PR

cast_dynamic_rgba8_rgb8 time:   [36.431 µs 36.511 µs 36.598 µs]
                        change: [-0.5191% +0.4075% +1.4537%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

cast_dynamic_rgba8_rgba8
                        time:   [6.8210 µs 6.8350 µs 6.8513 µs]
                        change: [-3.3241% -2.4207% -1.4005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

cast_dynamic_rgba8_rgb16
                        time:   [64.219 µs 64.427 µs 64.653 µs]
                        change: [-2.1792% -1.0303% +0.1578%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  5 (5.00%) high severe

cast_dynamic_rgba8_rgba16
                        time:   [26.398 µs 26.516 µs 26.643 µs]
                        change: [+0.0070% +1.3666% +2.5613%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

cast_dynamic_rgba8_rgb32f
                        time:   [125.84 µs 126.16 µs 126.51 µs]
                        change: [-1.6634% -0.3934% +0.6685%] (p = 0.55 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

cast_dynamic_rgba8_rgba32f
                        time:   [239.84 µs 244.21 µs 248.30 µs]
                        change: [-3.1970% -0.6553% +1.8503%] (p = 0.62 > 0.05)
                        No change in performance detected.
Found 30 outliers among 100 measurements (30.00%)
  15 (15.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  11 (11.00%) high severe

cast_dynamic_luma8_luma16
                        time:   [3.1003 µs 3.1088 µs 3.1179 µs]
                        change: [+1.3416% +2.2472% +3.1901%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

cast_dynamic_luma8_luma_alpha8
                        time:   [3.1374 µs 3.1443 µs 3.1520 µs]
                        change: [+0.8242% +1.6446% +2.3048%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

cast_dynamic_luma8_luma_alpha16
                        time:   [7.6762 µs 7.6945 µs 7.7159 µs]
                        change: [-1.6185% -0.7899% +0.0411%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

cast_dynamic_luma_alpha8_luma_alpha16
                        time:   [7.8329 µs 7.8484 µs 7.8658 µs]
                        change: [+1.6826% +2.6069% +3.4518%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

cast_dynamic_luma_alpha8_luma8
                        time:   [27.516 µs 27.579 µs 27.652 µs]
                        change: [-0.9844% -0.0644% +0.9082%] (p = 0.90 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

cast_dynamic_luma_alpha8_luma16
                        time:   [23.351 µs 23.396 µs 23.447 µs]
                        change: [-1.6613% -0.7873% +0.1536%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

cast_dynamic_luma_alpha16_luma_alpha16
                        time:   [6.7388 µs 6.7536 µs 6.7701 µs]
                        change: [+1.6782% +2.6128% +3.6859%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

cast_dynamic_luma_alpha16_luma8
                        time:   [21.634 µs 21.688 µs 21.751 µs]
                        change: [-0.7914% -0.0336% +0.6721%] (p = 0.94 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

cast_dynamic_luma_alpha16_luma16
                        time:   [13.000 µs 13.027 µs 13.058 µs]
                        change: [+0.1235% +1.0986% +2.0638%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

cast_dynamic_rgba32f_rgb8
                        time:   [211.16 µs 211.68 µs 212.31 µs]
                        change: [-73.168% -72.525% -72.090%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

cast_dynamic_rgba32f_rgba8
                        time:   [220.60 µs 221.45 µs 222.45 µs]
                        change: [-79.622% -79.431% -79.229%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  11 (11.00%) high severe

cast_dynamic_rgba32f_rgb16
                        time:   [227.73 µs 228.23 µs 228.79 µs]
                        change: [-74.802% -72.957% -71.155%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

cast_dynamic_rgba32f_rgba16
                        time:   [224.58 µs 225.19 µs 225.84 µs]
                        change: [-78.398% -78.107% -77.840%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

cast_dynamic_rgba32f_rgb32f
                        time:   [56.053 µs 56.211 µs 56.393 µs]
                        change: [-4.6539% -3.6654% -2.6325%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

cast_dynamic_rgba32f_rgba32f
                        time:   [232.61 µs 237.75 µs 242.48 µs]
                        change: [-9.0033% -6.8416% -4.5868%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 28 outliers among 100 measurements (28.00%)
  15 (15.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe

cast_dynamic_rgba8_l8   time:   [351.94 µs 355.01 µs 359.06 µs]
                        change: [+5.4899% +10.878% +17.032%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) high mild
  15 (15.00%) high severe

cast_dynamic_rgba8_l16  time:   [363.22 µs 364.91 µs 367.05 µs]
                        change: [-3.0519% -1.7687% -0.1828%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

cast_dynamic_rgba8_la16 time:   [818.04 µs 820.30 µs 823.14 µs]
                        change: [-1.8467% -0.6053% +0.4350%] (p = 0.35 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) high mild
  9 (9.00%) high severe

197g · 2026-03-09T16:44:10Z

src/color.rs

+        // This weird casting to u64 and back to u16 is to help the compiler
+        // optimize RGBA8 -> RGBA16 conversions. Without it, the conversion is
+        // not properly vectorized and about 30% slower.
+        let x: u64 = c8.into();
+        ((x << 8) | x) as u16


What the hell 😂 I'll take it but maybe we should make a more stable version of it by specializing on Primitive or adding a method converting 4 of these at a time.

src/color.rs

RunDevelopment added 4 commits March 9, 2026 14:19

Optimized FromPrimitive implementations

6d753c8

Recover lost perf

a52c95f

Revert int -> f32 since perf didn't change

cf42b87

Expand benchmark

2da5953

197g approved these changes Mar 9, 2026

View reviewed changes

197g merged commit d928046 into image-rs:main Mar 9, 2026
31 checks passed

RunDevelopment deleted the faster-from-primitive branch March 9, 2026 17:07

RunDevelopment mentioned this pull request Mar 10, 2026

Don't use f32::round for Cicp #2837

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster from primitive#2834

Faster from primitive#2834
197g merged 4 commits intoimage-rs:mainfrom
RunDevelopment:faster-from-primitive

RunDevelopment commented Mar 9, 2026

Uh oh!

197g Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RunDevelopment commented Mar 9, 2026

Changes

Benchmark

Uh oh!

197g Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants