Skip to content

perf(vello_cpu): statically dispatch f32 image sampling quality#1343

Merged
tomcur merged 4 commits into
linebender:mainfrom
tomcur:f32-static-quality-dispatch
Jan 15, 2026
Merged

perf(vello_cpu): statically dispatch f32 image sampling quality#1343
tomcur merged 4 commits into
linebender:mainfrom
tomcur:f32-static-quality-dispatch

Conversation

@tomcur
Copy link
Copy Markdown
Member

@tomcur tomcur commented Jan 6, 2026

By using a const generic to dispatch statically between bilinear and bicubic sampling, on x86 we get a 7% timing decrease for bilinear sampling (medium quality) in the f32 pipeline. It appears not to impact timings for bicubic sampling (high quality).

The benchmark was performed by temporarily adding medium and high quality variants of transform::rotate in
sparse_strips/vello_bench/src/fine/image.rs.

bench

Benchmarking fine/image/transform/rotate_medium_f32_scalar: Collecting 100 samples in estimated 5.0177 s (495k itfine/image/transform/rotate_medium_f32_scalar
                        time:   [10.137 µs 10.144 µs 10.151 µs]
                        change: [-7.5548% -7.4067% -7.2867%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking fine/image/transform/rotate_high_f32_scalar: Collecting 100 samples in estimated 5.0365 s (162k iterfine/image/transform/rotate_high_f32_scalar
                        time:   [31.163 µs 31.293 µs 31.556 µs]
                        change: [-0.1547% +0.4795% +1.2307%] (p = 0.19 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe

@nicoburns nicoburns added the C-cpu Applies to the vello_cpu crate label Jan 7, 2026
By using a `const` generic to dispatch statically between bilinear and
bicubic sampling, we get a 7% timing decrease for bilinear sampling
(medium quality) in the f32 pipeline. It appears not to impact bicubic
sampling (high quality).

The benchmark was performed by temporarily adding medium and high
quality variants of `transform::rotate` in
`sparse_strips/vello_bench/src/fine/image.rs`.

```
fine/image/transform/rotate_medium_f32_scalar
                        time:   [10.142 µs 10.149 µs 10.157 µs]
                        change: [-7.5294% -7.3741% -7.2353%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 26 outliers among 100 measurements (26.00%)
  4 (4.00%) low severe
  9 (9.00%) low mild
  6 (6.00%) high mild
  7 (7.00%) high severe

Benchmarking fine/image/transform/rotate_high_f32_scalar: Collecting 100 samples in estimated 5.0357 s (162k iterfine/image/transform/rotate_high_f32_scalar
                        time:   [31.144 µs 31.175 µs 31.215 µs]
                        change: [-0.3454% +0.1490% +0.4994%] (p = 0.57 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) high mild
  12 (12.00%) high severe
```
@tomcur tomcur force-pushed the f32-static-quality-dispatch branch from 93359a5 to bbedcf1 Compare January 15, 2026 13:33
@tomcur tomcur enabled auto-merge January 15, 2026 13:46
@tomcur tomcur added this pull request to the merge queue Jan 15, 2026
Merged via the queue into linebender:main with commit e2ddf9e Jan 15, 2026
17 checks passed
@tomcur tomcur deleted the f32-static-quality-dispatch branch January 15, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-cpu Applies to the vello_cpu crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants