Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batched f16 conversion #191

Merged
merged 33 commits into from
Jul 8, 2023
Merged

batched f16 conversion #191

merged 33 commits into from
Jul 8, 2023

Conversation

johannesvollmer
Copy link
Owner

@johannesvollmer johannesvollmer commented Jan 7, 2023

(and also fix round up division missing documentation)

@Shnatsel will this approach work in terms of optimization? i had to add a few copy operations for technical reasons

to do:

  • unit tests
  • before and after benchmark
  • add documentation about how to unlock all the optimizations (compiler flags and such)
  • use newest half dependency
  • clean up and refactor
  • improve documentation of the functions

@johannesvollmer johannesvollmer changed the title prototype unoptimized batched f16 conversion prototype batched f16 conversion Jan 7, 2023
@johannesvollmer
Copy link
Owner Author

before merging i want to have unit tests for that function and i want to clean it up, deduplicate the code, make it rusty

@johannesvollmer
Copy link
Owner Author

also i wonder whether the batch size of 4 can allow the compiler to optimize away all of the chunking logic in HalfFloatSliceExt::convert_from_f32_slice? Unfortunately we need to define a batch size, in order to avoid allocating a temporary buffer on the heap

Copy link
Contributor

@Shnatsel Shnatsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad strokes look about right.

I'll need to benchmark it and see if this actually improves performance, maybe also inspect the assembly to make sure these things are actually vectorized.

src/block/samples.rs Outdated Show resolved Hide resolved
src/image/read/specific_channels.rs Outdated Show resolved Hide resolved
@Shnatsel
Copy link
Contributor

Shnatsel commented Jan 7, 2023

Benchmarks: with half = {git = "https://github.com/starkat99/half-rs", features = ["use-intrinsics"]} in Cargo.toml I'm seeing the conversion time drop so much that it becomes unnoticeable, indistinguishable from the regular f32 to f32 codepath.

The f32 to u32 path doesn't get any faster, however. Maybe autovectorization doesn't kick in. I've tried with -C target-cpu=native for comparison, no difference.

@Shnatsel
Copy link
Contributor

Shnatsel commented Jan 7, 2023

ARM has native instructions for casting from f32 to unsigned integers (details). This would explain why I'm seeing good results for f32 to u32 conversion on ARM, even without this change.

I haven't found any mentions of native f32 to u32 casts on x86. ChatGPT mentioned some but I think it's making it up, and when I corrected it, it said this:

Снимок экрана от 2023-01-07 19-26-25

Or here's a human discussion of a similar conversion (albeit to u8): https://stackoverflow.com/questions/29856006/sse-intrinsics-convert-32-bit-floats-to-unsigned-8-bit-integers

@Shnatsel
Copy link
Contributor

Shnatsel commented Jan 7, 2023

This also regresses the fallback path on half 2.2.0 but not on the latest git; we will need to switch to latest half so that we don't introduce regressions when intrinsics are not available.

@Shnatsel
Copy link
Contributor

Shnatsel commented Jan 7, 2023

Yeah, there seems to be no native conversion from f32 to u32 on x86_64 without AVX-512: https://stackoverflow.com/questions/41144668/how-to-efficiently-perform-double-int64-conversions-with-sse-avx

There is a conversion from f32 to i32, but that would truncate large numbers that fit into u32 but not i32, and therefore produce incorrect results (although u32 cannot represent all f32 values in the first place).

Why is u32 is the chosen format, anyway?

Shnatsel
Shnatsel previously approved these changes Jan 7, 2023
Copy link
Contributor

@Shnatsel Shnatsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an outstanding nit about a comment and we'll need to bump to the latest half once the next release ships, otherwise looks good.

@johannesvollmer
Copy link
Owner Author

johannesvollmer commented Jan 7, 2023

Thanks for the feedback! Conversion from and to u32 are only for completeness. I don't expect it to ever happen in the real world. We should not worry about inferior performance here at all. As you said, it's not even accurate for large numbers.

@johannesvollmer
Copy link
Owner Author

I'm seeing the conversion time drop so much that it becomes unnoticeable, indistinguishable from the regular f32 to f32 codepath.

Awesome, didn't expect that much of a speed up!

@johannesvollmer
Copy link
Owner Author

Do we have any regression concerning f32 -> f32?

@johannesvollmer
Copy link
Owner Author

johannesvollmer commented Jan 7, 2023

Why is u32 is the chosen format, anyway?

the purpose of u32 samples is too assign different areas in the image unique IDs. The common case, rgba images, are either full f32 or full f16.

In the cases where u32 is used, it is certainly planned and not converted to any float type.

@Shnatsel
Copy link
Contributor

Shnatsel commented Jan 7, 2023

Do we have any regression concerning f32 -> f32?

Nope, it's the exact same on my machine. I guess the buffer does fit entirely into the L1 cache, it's not big.

@johannesvollmer
Copy link
Owner Author

johannesvollmer commented Jan 8, 2023

added a Todo list in the pr text. anything else to add to that list?

@Shnatsel
Copy link
Contributor

The necessary changes to half have shipped in version 2.2.1

@johannesvollmer johannesvollmer marked this pull request as ready for review January 20, 2023 21:45
@johannesvollmer
Copy link
Owner Author

johannesvollmer commented Jul 2, 2023

Added more benchmarks, everything looks as expected still. neat!

test read_f16_as_f16_uncompressed_1thread ... bench:  11,665,580 ns/iter (+/- 2,527,286)
test read_f16_as_u32_uncompressed_1thread ... bench:  11,732,710 ns/iter (+/- 957,454)
test read_f16_as_f32_uncompressed_1thread ... bench:  11,661,750 ns/iter (+/- 716,012)
test read_f16_as_f16_zip_nthreads         ... bench:  13,345,020 ns/iter (+/- 1,558,281)
test read_f16_as_f32_zip_nthreads         ... bench:  12,881,160 ns/iter (+/- 4,175,510)
test read_f16_as_f16_zip_1thread          ... bench:  28,832,260 ns/iter (+/- 2,584,587)
test read_f16_as_f32_zip_1thread          ... bench:  26,279,960 ns/iter (+/- 2,138,992)

test read_f32_as_f32_uncompressed_1thread ... bench:  17,843,730 ns/iter (+/- 1,008,381)
test read_f32_as_u32_uncompressed_1thread ... bench:  17,952,880 ns/iter (+/- 2,185,665)
test read_f32_as_f16_uncompressed_1thread ... bench:  17,965,450 ns/iter (+/- 2,524,674)
test read_f32_as_f32_zips_nthreads        ... bench:  26,873,920 ns/iter (+/- 3,032,381)
test read_f32_as_f16_zips_nthreads        ... bench:  26,641,840 ns/iter (+/- 2,400,515)
test read_f32_as_f32_zips_1thread         ... bench: 101,547,150 ns/iter (+/- 6,313,799)
test read_f32_as_f16_zips_1thread         ... bench: 100,998,820 ns/iter (+/- 6,737,638)

previously (without SIMD batching, but with intrinsic conversions)

test read_f16_as_f16_uncompressed_1thread ... bench:  13,896,960 ns/iter (+/- 1,866,398)
test read_f16_as_u32_uncompressed_1thread ... bench:  13,760,660 ns/iter (+/- 583,555)
test read_f16_as_f32_uncompressed_1thread ... bench:  13,805,060 ns/iter (+/- 1,905,708)
test read_f16_as_f16_zip_nthreads         ... bench:  14,468,520 ns/iter (+/- 1,170,083)
test read_f16_as_f32_zip_nthreads         ... bench:  14,479,990 ns/iter (+/- 4,490,935)
test read_f16_as_f16_zip_1thread          ... bench:  29,224,890 ns/iter (+/- 1,293,434)
test read_f16_as_f32_zip_1thread          ... bench:  29,319,380 ns/iter (+/- 826,762)

test read_f32_as_f32_uncompressed_1thread ... bench:  30,926,660 ns/iter (+/- 2,187,303)
test read_f32_as_u32_uncompressed_1thread ... bench:  30,900,850 ns/iter (+/- 4,375,285)
test read_f32_as_f16_uncompressed_1thread ... bench:  30,854,990 ns/iter (+/- 1,294,175)
test read_f32_as_f32_zips_nthreads        ... bench:  48,464,580 ns/iter (+/- 7,056,668)
test read_f32_as_f16_zips_nthreads        ... bench:  48,596,240 ns/iter (+/- 5,171,012)
test read_f32_as_f32_zips_1thread         ... bench: 113,928,800 ns/iter (+/- 7,434,780)
test read_f32_as_f16_zips_1thread         ... bench: 113,377,860 ns/iter (+/- 5,173,657)

@johannesvollmer
Copy link
Owner Author

johannesvollmer commented Jul 4, 2023

(sorry for not merging yet, I'm abusing this branch to fix the github workflow. the CI should have catched the MSRV breaking change, but it is broken apparently)

@johannesvollmer
Copy link
Owner Author

Fixed it - now the only question is whether we want to go 2.0.0 and Rust 1.70.0 for this...

@Shnatsel
Copy link
Contributor

Shnatsel commented Jul 6, 2023

As it stands, Cargo.lock does require Rust 1.70 but Cargo.toml does not, meaning that downstream users are free to configure the library to use older half with an older MSRV if they need to. I think that's a fair balance. It would be nice to call it out in the README.

@johannesvollmer
Copy link
Owner Author

johannesvollmer commented Jul 6, 2023

If we allow half = "2.3", we should also raise our own rust-version = 1.70.0 in the Cargo.toml, right? Do you mean we do that, and also hint at a workaround? The workaround being our users specify an older version of half and can then compile using rustc 1.59? This makes sense, I didn't think of that, good idea :)

@Shnatsel
Copy link
Contributor

Shnatsel commented Jul 6, 2023

You can put half = 2.2 into Cargo.toml, so when someone adds exrs as a dependency it will default to the latest at the time of the installation (currently 2.3.1) but will also allow downgrading to 2.2 if this is needed by the users for MSRV reasons.

And just don't put rust-version in there I guess, so you could have a Cargo.lock for development with the latest half for best performance, and also if anyone wants to run the benchmarks on the repo.

@johannesvollmer
Copy link
Owner Author

The Cargo.lock is no longer in the repo, as is suggested for Rust libraries. But anyways, the plan still sounds good. I'll find out what the MSRV in the Cargo.toml actually means, and decide whether to put the Rust version into the Cargo.toml. Thanks for your help with this PR :)

@johannesvollmer
Copy link
Owner Author

johannesvollmer commented Jul 7, 2023

Actually, let's merge all of this except for the version upgrade of half. then release a major version with the small performance improvements. then release 2.0.0 with the new version of half, including the new intrinsics, and a new msrv.

the reason being that the batching alone gives us 10% speed improvement (measured with intrinsics active, assuming it will also be relevant without intrinsics)

sorry for all the discussion and for all the strategy changes :D

@johannesvollmer johannesvollmer changed the title prototype batched f16 conversion batched f16 conversion Jul 8, 2023
@johannesvollmer
Copy link
Owner Author

johannesvollmer commented Jul 8, 2023

cargo-msrv verify succeeds on my local machine... ci seems to be broken still

@johannesvollmer johannesvollmer merged commit 3e0f6cd into master Jul 8, 2023
@Shnatsel
Copy link
Contributor

Shnatsel commented Jul 8, 2023

I am convinced that bumping semver for MSRV reasons alone is a bad idea, because now several crates using exr have to all manually upgrade in tandem. If one uses exr 1.x and the other uses 2.x they can no longer interoperate; and anyone can upgrade to 2.x only when everyone has upgraded to 2.x, splitting the ecosystem. Please don't release 2.x just because of an MSRV change.

@johannesvollmer
Copy link
Owner Author

Hmmm maybe I shouldn't make spontaneous decisions at 3AM :D

Yes, I see your point. On the other hand, most people will have specified the dependency to exr = "1.5.6" or similar in their project. So running cargo update will bump that to anything <2.0.0 without asking, and then break their build if they do not use the absolutely newest Rust version. Is this expected behaviour? I thought it would be nice to explicitly opt-in to that, with more than just cargo update. However, I agree that splitting the ecosystem would not be good either.

@johannesvollmer
Copy link
Owner Author

I've opened #217 and would like to continue the discussion there, as it seems more appropriate. I yanked version 2.0.0 for now, which I released earlier at 3AM before going to bed, so I have might not put enough thought into it back then, whoopsie

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pixel format conversions are slow
2 participants