batched f16 conversion #191

johannesvollmer · 2023-01-07T16:10:05Z

(and also fix round up division missing documentation)

@Shnatsel will this approach work in terms of optimization? i had to add a few copy operations for technical reasons

to do:

unit tests
before and after benchmark
add documentation about how to unlock all the optimizations (compiler flags and such)
use newest half dependency
clean up and refactor
improve documentation of the functions

…missing documentation

johannesvollmer · 2023-01-07T16:18:44Z

before merging i want to have unit tests for that function and i want to clean it up, deduplicate the code, make it rusty

johannesvollmer · 2023-01-07T16:36:35Z

also i wonder whether the batch size of 4 can allow the compiler to optimize away all of the chunking logic in HalfFloatSliceExt::convert_from_f32_slice? Unfortunately we need to define a batch size, in order to avoid allocating a temporary buffer on the heap

Shnatsel

The broad strokes look about right.

I'll need to benchmark it and see if this actually improves performance, maybe also inspect the assembly to make sure these things are actually vectorized.

src/block/samples.rs

src/image/read/specific_channels.rs

Shnatsel · 2023-01-07T17:24:45Z

Benchmarks: with half = {git = "https://github.com/starkat99/half-rs", features = ["use-intrinsics"]} in Cargo.toml I'm seeing the conversion time drop so much that it becomes unnoticeable, indistinguishable from the regular f32 to f32 codepath.

The f32 to u32 path doesn't get any faster, however. Maybe autovectorization doesn't kick in. I've tried with -C target-cpu=native for comparison, no difference.

Shnatsel · 2023-01-07T18:27:04Z

ARM has native instructions for casting from f32 to unsigned integers (details). This would explain why I'm seeing good results for f32 to u32 conversion on ARM, even without this change.

I haven't found any mentions of native f32 to u32 casts on x86. ChatGPT mentioned some but I think it's making it up, and when I corrected it, it said this:

Or here's a human discussion of a similar conversion (albeit to u8): https://stackoverflow.com/questions/29856006/sse-intrinsics-convert-32-bit-floats-to-unsigned-8-bit-integers

Shnatsel · 2023-01-07T18:40:39Z

This also regresses the fallback path on half 2.2.0 but not on the latest git; we will need to switch to latest half so that we don't introduce regressions when intrinsics are not available.

Shnatsel · 2023-01-07T18:56:24Z

Yeah, there seems to be no native conversion from f32 to u32 on x86_64 without AVX-512: https://stackoverflow.com/questions/41144668/how-to-efficiently-perform-double-int64-conversions-with-sse-avx

There is a conversion from f32 to i32, but that would truncate large numbers that fit into u32 but not i32, and therefore produce incorrect results (although u32 cannot represent all f32 values in the first place).

Why is u32 is the chosen format, anyway?

Shnatsel

There's an outstanding nit about a comment and we'll need to bump to the latest half once the next release ships, otherwise looks good.

johannesvollmer · 2023-01-07T20:38:58Z

Thanks for the feedback! Conversion from and to u32 are only for completeness. I don't expect it to ever happen in the real world. We should not worry about inferior performance here at all. As you said, it's not even accurate for large numbers.

johannesvollmer · 2023-01-07T20:41:07Z

I'm seeing the conversion time drop so much that it becomes unnoticeable, indistinguishable from the regular f32 to f32 codepath.

Awesome, didn't expect that much of a speed up!

johannesvollmer · 2023-01-07T20:41:42Z

Do we have any regression concerning f32 -> f32?

johannesvollmer · 2023-01-07T20:45:44Z

Why is u32 is the chosen format, anyway?

the purpose of u32 samples is too assign different areas in the image unique IDs. The common case, rgba images, are either full f32 or full f16.

In the cases where u32 is used, it is certainly planned and not converted to any float type.

Shnatsel · 2023-01-07T21:15:12Z

Do we have any regression concerning f32 -> f32?

Nope, it's the exact same on my machine. I guess the buffer does fit entirely into the L1 cache, it's not big.

johannesvollmer · 2023-01-08T13:26:48Z

added a Todo list in the pr text. anything else to add to that list?

Shnatsel · 2023-01-11T14:02:50Z

The necessary changes to half have shipped in version 2.2.1

…atch_conversion

johannesvollmer · 2023-07-02T22:16:26Z

Added more benchmarks, everything looks as expected still. neat!

test read_f16_as_f16_uncompressed_1thread ... bench:  11,665,580 ns/iter (+/- 2,527,286)
test read_f16_as_u32_uncompressed_1thread ... bench:  11,732,710 ns/iter (+/- 957,454)
test read_f16_as_f32_uncompressed_1thread ... bench:  11,661,750 ns/iter (+/- 716,012)
test read_f16_as_f16_zip_nthreads         ... bench:  13,345,020 ns/iter (+/- 1,558,281)
test read_f16_as_f32_zip_nthreads         ... bench:  12,881,160 ns/iter (+/- 4,175,510)
test read_f16_as_f16_zip_1thread          ... bench:  28,832,260 ns/iter (+/- 2,584,587)
test read_f16_as_f32_zip_1thread          ... bench:  26,279,960 ns/iter (+/- 2,138,992)

test read_f32_as_f32_uncompressed_1thread ... bench:  17,843,730 ns/iter (+/- 1,008,381)
test read_f32_as_u32_uncompressed_1thread ... bench:  17,952,880 ns/iter (+/- 2,185,665)
test read_f32_as_f16_uncompressed_1thread ... bench:  17,965,450 ns/iter (+/- 2,524,674)
test read_f32_as_f32_zips_nthreads        ... bench:  26,873,920 ns/iter (+/- 3,032,381)
test read_f32_as_f16_zips_nthreads        ... bench:  26,641,840 ns/iter (+/- 2,400,515)
test read_f32_as_f32_zips_1thread         ... bench: 101,547,150 ns/iter (+/- 6,313,799)
test read_f32_as_f16_zips_1thread         ... bench: 100,998,820 ns/iter (+/- 6,737,638)

previously (without SIMD batching, but with intrinsic conversions)

test read_f16_as_f16_uncompressed_1thread ... bench:  13,896,960 ns/iter (+/- 1,866,398)
test read_f16_as_u32_uncompressed_1thread ... bench:  13,760,660 ns/iter (+/- 583,555)
test read_f16_as_f32_uncompressed_1thread ... bench:  13,805,060 ns/iter (+/- 1,905,708)
test read_f16_as_f16_zip_nthreads         ... bench:  14,468,520 ns/iter (+/- 1,170,083)
test read_f16_as_f32_zip_nthreads         ... bench:  14,479,990 ns/iter (+/- 4,490,935)
test read_f16_as_f16_zip_1thread          ... bench:  29,224,890 ns/iter (+/- 1,293,434)
test read_f16_as_f32_zip_1thread          ... bench:  29,319,380 ns/iter (+/- 826,762)

test read_f32_as_f32_uncompressed_1thread ... bench:  30,926,660 ns/iter (+/- 2,187,303)
test read_f32_as_u32_uncompressed_1thread ... bench:  30,900,850 ns/iter (+/- 4,375,285)
test read_f32_as_f16_uncompressed_1thread ... bench:  30,854,990 ns/iter (+/- 1,294,175)
test read_f32_as_f32_zips_nthreads        ... bench:  48,464,580 ns/iter (+/- 7,056,668)
test read_f32_as_f16_zips_nthreads        ... bench:  48,596,240 ns/iter (+/- 5,171,012)
test read_f32_as_f32_zips_1thread         ... bench: 113,928,800 ns/iter (+/- 7,434,780)
test read_f32_as_f16_zips_1thread         ... bench: 113,377,860 ns/iter (+/- 5,173,657)

src/image/read/specific_channels.rs

src/block/samples.rs

# Conflicts: # Cargo.lock # Cargo.toml

johannesvollmer · 2023-07-04T21:21:03Z

(sorry for not merging yet, I'm abusing this branch to fix the github workflow. the CI should have catched the MSRV breaking change, but it is broken apparently)

johannesvollmer · 2023-07-06T15:58:43Z

Fixed it - now the only question is whether we want to go 2.0.0 and Rust 1.70.0 for this...

Shnatsel · 2023-07-06T16:08:52Z

As it stands, Cargo.lock does require Rust 1.70 but Cargo.toml does not, meaning that downstream users are free to configure the library to use older half with an older MSRV if they need to. I think that's a fair balance. It would be nice to call it out in the README.

johannesvollmer · 2023-07-06T18:38:31Z

If we allow half = "2.3", we should also raise our own rust-version = 1.70.0 in the Cargo.toml, right? Do you mean we do that, and also hint at a workaround? The workaround being our users specify an older version of half and can then compile using rustc 1.59? This makes sense, I didn't think of that, good idea :)

Shnatsel · 2023-07-06T20:22:26Z

You can put half = 2.2 into Cargo.toml, so when someone adds exrs as a dependency it will default to the latest at the time of the installation (currently 2.3.1) but will also allow downgrading to 2.2 if this is needed by the users for MSRV reasons.

And just don't put rust-version in there I guess, so you could have a Cargo.lock for development with the latest half for best performance, and also if anyone wants to run the benchmarks on the repo.

johannesvollmer · 2023-07-06T20:44:19Z

The Cargo.lock is no longer in the repo, as is suggested for Rust libraries. But anyways, the plan still sounds good. I'll find out what the MSRV in the Cargo.toml actually means, and decide whether to put the Rust version into the Cargo.toml. Thanks for your help with this PR :)

johannesvollmer · 2023-07-07T23:48:04Z

Actually, let's merge all of this except for the version upgrade of half. then release a major version with the small performance improvements. then release 2.0.0 with the new version of half, including the new intrinsics, and a new msrv.

the reason being that the batching alone gives us 10% speed improvement (measured with intrinsics active, assuming it will also be relevant without intrinsics)

sorry for all the discussion and for all the strategy changes :D

…ter the next release

johannesvollmer · 2023-07-08T00:19:57Z

cargo-msrv verify succeeds on my local machine... ci seems to be broken still

Shnatsel · 2023-07-08T09:13:48Z

I am convinced that bumping semver for MSRV reasons alone is a bad idea, because now several crates using exr have to all manually upgrade in tandem. If one uses exr 1.x and the other uses 2.x they can no longer interoperate; and anyone can upgrade to 2.x only when everyone has upgraded to 2.x, splitting the ecosystem. Please don't release 2.x just because of an MSRV change.

johannesvollmer · 2023-07-08T10:29:11Z

Hmmm maybe I shouldn't make spontaneous decisions at 3AM :D

Yes, I see your point. On the other hand, most people will have specified the dependency to exr = "1.5.6" or similar in their project. So running cargo update will bump that to anything <2.0.0 without asking, and then break their build if they do not use the absolutely newest Rust version. Is this expected behaviour? I thought it would be nice to explicitly opt-in to that, with more than just cargo update. However, I agree that splitting the ecosystem would not be good either.

johannesvollmer · 2023-07-08T11:02:30Z

I've opened #217 and would like to continue the discussion there, as it seems more appropriate. I yanked version 2.0.0 for now, which I released earlier at 3AM before going to bed, so I have might not put enough thought into it back then, whoopsie

prototype unoptimized batched f16 conversion + fix round up division …

baf90df

…missing documentation

johannesvollmer changed the title ~~prototype unoptimized batched f16 conversion~~ prototype batched f16 conversion Jan 7, 2023

Merge branch 'master' into f16_batch_conversion

c13b24e

rename some stuff and improve error message

7b41d0d

Shnatsel reviewed Jan 7, 2023

View reviewed changes

src/block/samples.rs Outdated Show resolved Hide resolved

src/image/read/specific_channels.rs Outdated Show resolved Hide resolved

use batch size of 16

284da36

Shnatsel previously approved these changes Jan 7, 2023

View reviewed changes

Shnatsel mentioned this pull request Jan 7, 2023

Use 8x SIMD conversions when appropriate starkat99/half-rs#66

Closed

johannesvollmer linked an issue Jan 8, 2023 that may be closed by this pull request

Pixel format conversions are slow #178

Closed

Merge branch 'master' into f16_batch_conversion

b4b9518

johannesvollmer added 2 commits January 20, 2023 21:33

improve comments, update to new half version

9de05cc

Merge remote-tracking branch 'origin/f16_batch_conversion' into f16_b…

27aca4d

…atch_conversion

johannesvollmer dismissed Shnatsel’s stale review via 27aca4d January 20, 2023 20:34

johannesvollmer added 3 commits January 20, 2023 21:49

Merge branch 'master' into f16_batch_conversion

5d96f18

revert an incomplete refactoring

376ae08

refactor batch conversion function to reduce code duplication

6475c87

johannesvollmer marked this pull request as ready for review January 20, 2023 21:45

add more benchmarks

3ece29d

Shnatsel reviewed Jul 2, 2023

View reviewed changes

src/image/read/specific_channels.rs Show resolved Hide resolved

Shnatsel reviewed Jul 2, 2023

View reviewed changes

src/block/samples.rs Show resolved Hide resolved

johannesvollmer added 5 commits July 3, 2023 09:48

Merge branch 'master' into f16_batch_conversion

4236bf3

# Conflicts: # Cargo.lock # Cargo.toml

inline-hint closures

a572e94

undo use inline attribute (experimental, not supported yet)

f559647

use inline syntax for run commands

73a62b9

Merge branch 'master' into f16_batch_conversion

f0052ad

johannesvollmer mentioned this pull request Jul 3, 2023

Cargo resolves different versions of dependencies (with different MSRV) from Cargo 1.60 and up foresterre/cargo-msrv#728

Open

johannesvollmer added 4 commits July 6, 2023 17:07

attempt ci without cache

1d5fd6b

attempt fix ci

980369b

refactor

e9bff52

bump rust version to take advantage of f16 intrinsics

79940c5

johannesvollmer added 2 commits July 8, 2023 01:49

Merge branch 'master' into f16_batch_conversion

3ce3c05

retain backwards compatibility for this pr, only break the release af…

c3cb590

…ter the next release

johannesvollmer changed the title ~~prototype batched f16 conversion~~ batched f16 conversion Jul 8, 2023

Merge branch 'master' into f16_batch_conversion

85a311d

johannesvollmer merged commit 3e0f6cd into master Jul 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batched f16 conversion #191

batched f16 conversion #191

johannesvollmer commented Jan 7, 2023 •

edited

Loading

johannesvollmer commented Jan 7, 2023

johannesvollmer commented Jan 7, 2023

Shnatsel left a comment

Shnatsel commented Jan 7, 2023

Shnatsel commented Jan 7, 2023 •

edited

Loading

Shnatsel commented Jan 7, 2023

Shnatsel commented Jan 7, 2023

Shnatsel left a comment

johannesvollmer commented Jan 7, 2023 •

edited

Loading

johannesvollmer commented Jan 7, 2023

johannesvollmer commented Jan 7, 2023

johannesvollmer commented Jan 7, 2023 •

edited

Loading

Shnatsel commented Jan 7, 2023

johannesvollmer commented Jan 8, 2023 •

edited

Loading

Shnatsel commented Jan 11, 2023

johannesvollmer commented Jul 2, 2023 •

edited

Loading

johannesvollmer commented Jul 4, 2023 •

edited

Loading

johannesvollmer commented Jul 6, 2023

Shnatsel commented Jul 6, 2023

johannesvollmer commented Jul 6, 2023 •

edited

Loading

Shnatsel commented Jul 6, 2023

johannesvollmer commented Jul 6, 2023

johannesvollmer commented Jul 7, 2023 •

edited

Loading

johannesvollmer commented Jul 8, 2023 •

edited

Loading

Shnatsel commented Jul 8, 2023

johannesvollmer commented Jul 8, 2023

johannesvollmer commented Jul 8, 2023

batched f16 conversion #191

batched f16 conversion #191

Conversation

johannesvollmer commented Jan 7, 2023 • edited Loading

johannesvollmer commented Jan 7, 2023

johannesvollmer commented Jan 7, 2023

Shnatsel left a comment

Choose a reason for hiding this comment

Shnatsel commented Jan 7, 2023

Shnatsel commented Jan 7, 2023 • edited Loading

Shnatsel commented Jan 7, 2023

Shnatsel commented Jan 7, 2023

Shnatsel left a comment

Choose a reason for hiding this comment

johannesvollmer commented Jan 7, 2023 • edited Loading

johannesvollmer commented Jan 7, 2023

johannesvollmer commented Jan 7, 2023

johannesvollmer commented Jan 7, 2023 • edited Loading

Shnatsel commented Jan 7, 2023

johannesvollmer commented Jan 8, 2023 • edited Loading

Shnatsel commented Jan 11, 2023

johannesvollmer commented Jul 2, 2023 • edited Loading

johannesvollmer commented Jul 4, 2023 • edited Loading

johannesvollmer commented Jul 6, 2023

Shnatsel commented Jul 6, 2023

johannesvollmer commented Jul 6, 2023 • edited Loading

Shnatsel commented Jul 6, 2023

johannesvollmer commented Jul 6, 2023

johannesvollmer commented Jul 7, 2023 • edited Loading

johannesvollmer commented Jul 8, 2023 • edited Loading

Shnatsel commented Jul 8, 2023

johannesvollmer commented Jul 8, 2023

johannesvollmer commented Jul 8, 2023

johannesvollmer commented Jan 7, 2023 •

edited

Loading

Shnatsel commented Jan 7, 2023 •

edited

Loading

johannesvollmer commented Jan 7, 2023 •

edited

Loading

johannesvollmer commented Jan 7, 2023 •

edited

Loading

johannesvollmer commented Jan 8, 2023 •

edited

Loading

johannesvollmer commented Jul 2, 2023 •

edited

Loading

johannesvollmer commented Jul 4, 2023 •

edited

Loading

johannesvollmer commented Jul 6, 2023 •

edited

Loading

johannesvollmer commented Jul 7, 2023 •

edited

Loading

johannesvollmer commented Jul 8, 2023 •

edited

Loading