Skip to content

Faster from primitive#2834

Merged
197g merged 4 commits intoimage-rs:mainfrom
RunDevelopment:faster-from-primitive
Mar 9, 2026
Merged

Faster from primitive#2834
197g merged 4 commits intoimage-rs:mainfrom
RunDevelopment:faster-from-primitive

Conversation

@RunDevelopment
Copy link
Member

I spent a little time optimizing color conversions for DynamicImage.

Changes

  • f32 -> u8/u16: This changed the most. I replaced f32::round with a simple + 0.5 and removed unnecessary clamping. The mapping of NaN -> 1.0 (uN::MAX) is done using f32::min (which compiles to a single instruction on x86).

    This made those conversions around 4x faster. This was a little surprising to me, since f32::round is around 10x slower than the addition I replaced it with. I suspect the optimized conversion is new memory-bound on my machine.

  • u8/u16 -> f32: I removed the unnecessary clamping.

    I couldn't measure any change in performance. This conversion is memory-bound on my machine. Dropping the image size to 32x32 to ensure everything is in L1 cache makes the difference measurable, but it's very small (around 2%). Replacing division with multiplication would speed it up around 25% for small 32x32 images, but I didn't do that, because it likely won't be faster in most cases in practice and because the results would be slightly different (1 ulp error).

  • u16 -> u8: I cleaned up the code a little and replaced the comment with a simple derivation for why the implemented expression is correct.

    I couldn't measure any change in performance. This doesn't surprise me, since everything I cleaned up, LLVM can optimize away as well.

  • u8 -> u16: I cleaned up the code a little and added a comment explaining why it does bit operations in u64.

  • I expanded the benchmark to measure the changes in this PR.

Benchmark

I'll only list the benched functions that changed in perf:

Bench Old New Change
cast_dynamic_rgba32f_rgb8 764.72 µs 211.68 µs -72.525%
cast_dynamic_rgba32f_rgba8 1.0827 ms 221.45 µs -79.431%
cast_dynamic_rgba32f_rgb16 953.11 µs 228.23 µs -72.957%
cast_dynamic_rgba32f_rgba16 1.0224 ms 225.19 µs -78.107%

All other benches in benches/convert.rs remained unchanged.

Overall, I'm a little disappointed that those are the only ones I could make faster. But it really seems like all conversions involving f32 operations are now memory-bound on my machine. I could be wrong about that though.

Full benchmark output of this PR
cast_dynamic_rgba8_rgb8 time:   [36.431 µs 36.511 µs 36.598 µs]
                        change: [-0.5191% +0.4075% +1.4537%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

cast_dynamic_rgba8_rgba8
                        time:   [6.8210 µs 6.8350 µs 6.8513 µs]
                        change: [-3.3241% -2.4207% -1.4005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

cast_dynamic_rgba8_rgb16
                        time:   [64.219 µs 64.427 µs 64.653 µs]
                        change: [-2.1792% -1.0303% +0.1578%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  5 (5.00%) high severe

cast_dynamic_rgba8_rgba16
                        time:   [26.398 µs 26.516 µs 26.643 µs]
                        change: [+0.0070% +1.3666% +2.5613%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

cast_dynamic_rgba8_rgb32f
                        time:   [125.84 µs 126.16 µs 126.51 µs]
                        change: [-1.6634% -0.3934% +0.6685%] (p = 0.55 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

cast_dynamic_rgba8_rgba32f
                        time:   [239.84 µs 244.21 µs 248.30 µs]
                        change: [-3.1970% -0.6553% +1.8503%] (p = 0.62 > 0.05)
                        No change in performance detected.
Found 30 outliers among 100 measurements (30.00%)
  15 (15.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  11 (11.00%) high severe

cast_dynamic_luma8_luma16
                        time:   [3.1003 µs 3.1088 µs 3.1179 µs]
                        change: [+1.3416% +2.2472% +3.1901%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

cast_dynamic_luma8_luma_alpha8
                        time:   [3.1374 µs 3.1443 µs 3.1520 µs]
                        change: [+0.8242% +1.6446% +2.3048%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

cast_dynamic_luma8_luma_alpha16
                        time:   [7.6762 µs 7.6945 µs 7.7159 µs]
                        change: [-1.6185% -0.7899% +0.0411%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

cast_dynamic_luma_alpha8_luma_alpha16
                        time:   [7.8329 µs 7.8484 µs 7.8658 µs]
                        change: [+1.6826% +2.6069% +3.4518%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

cast_dynamic_luma_alpha8_luma8
                        time:   [27.516 µs 27.579 µs 27.652 µs]
                        change: [-0.9844% -0.0644% +0.9082%] (p = 0.90 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

cast_dynamic_luma_alpha8_luma16
                        time:   [23.351 µs 23.396 µs 23.447 µs]
                        change: [-1.6613% -0.7873% +0.1536%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

cast_dynamic_luma_alpha16_luma_alpha16
                        time:   [6.7388 µs 6.7536 µs 6.7701 µs]
                        change: [+1.6782% +2.6128% +3.6859%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

cast_dynamic_luma_alpha16_luma8
                        time:   [21.634 µs 21.688 µs 21.751 µs]
                        change: [-0.7914% -0.0336% +0.6721%] (p = 0.94 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

cast_dynamic_luma_alpha16_luma16
                        time:   [13.000 µs 13.027 µs 13.058 µs]
                        change: [+0.1235% +1.0986% +2.0638%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

cast_dynamic_rgba32f_rgb8
                        time:   [211.16 µs 211.68 µs 212.31 µs]
                        change: [-73.168% -72.525% -72.090%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

cast_dynamic_rgba32f_rgba8
                        time:   [220.60 µs 221.45 µs 222.45 µs]
                        change: [-79.622% -79.431% -79.229%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  11 (11.00%) high severe

cast_dynamic_rgba32f_rgb16
                        time:   [227.73 µs 228.23 µs 228.79 µs]
                        change: [-74.802% -72.957% -71.155%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

cast_dynamic_rgba32f_rgba16
                        time:   [224.58 µs 225.19 µs 225.84 µs]
                        change: [-78.398% -78.107% -77.840%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

cast_dynamic_rgba32f_rgb32f
                        time:   [56.053 µs 56.211 µs 56.393 µs]
                        change: [-4.6539% -3.6654% -2.6325%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

cast_dynamic_rgba32f_rgba32f
                        time:   [232.61 µs 237.75 µs 242.48 µs]
                        change: [-9.0033% -6.8416% -4.5868%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 28 outliers among 100 measurements (28.00%)
  15 (15.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe

cast_dynamic_rgba8_l8   time:   [351.94 µs 355.01 µs 359.06 µs]
                        change: [+5.4899% +10.878% +17.032%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) high mild
  15 (15.00%) high severe

cast_dynamic_rgba8_l16  time:   [363.22 µs 364.91 µs 367.05 µs]
                        change: [-3.0519% -1.7687% -0.1828%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

cast_dynamic_rgba8_la16 time:   [818.04 µs 820.30 µs 823.14 µs]
                        change: [-1.8467% -0.6053% +0.4350%] (p = 0.35 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) high mild
  9 (9.00%) high severe

Comment on lines +579 to +583
// This weird casting to u64 and back to u16 is to help the compiler
// optimize RGBA8 -> RGBA16 conversions. Without it, the conversion is
// not properly vectorized and about 30% slower.
let x: u64 = c8.into();
((x << 8) | x) as u16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the hell 😂 I'll take it but maybe we should make a more stable version of it by specializing on Primitive or adding a method converting 4 of these at a time.

@197g 197g merged commit d928046 into image-rs:main Mar 9, 2026
31 checks passed
@RunDevelopment RunDevelopment deleted the faster-from-primitive branch March 9, 2026 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants