-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batched f16 conversion #191
Conversation
…missing documentation
before merging i want to have unit tests for that function and i want to clean it up, deduplicate the code, make it rusty |
also i wonder whether the batch size of 4 can allow the compiler to optimize away all of the chunking logic in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broad strokes look about right.
I'll need to benchmark it and see if this actually improves performance, maybe also inspect the assembly to make sure these things are actually vectorized.
Benchmarks: with The |
ARM has native instructions for casting from I haven't found any mentions of native Or here's a human discussion of a similar conversion (albeit to u8): https://stackoverflow.com/questions/29856006/sse-intrinsics-convert-32-bit-floats-to-unsigned-8-bit-integers |
This also regresses the fallback path on half |
Yeah, there seems to be no native conversion from There is a conversion from Why is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an outstanding nit about a comment and we'll need to bump to the latest half
once the next release ships, otherwise looks good.
Thanks for the feedback! Conversion from and to |
Awesome, didn't expect that much of a speed up! |
Do we have any regression concerning |
the purpose of In the cases where |
Nope, it's the exact same on my machine. I guess the buffer does fit entirely into the L1 cache, it's not big. |
added a Todo list in the pr text. anything else to add to that list? |
The necessary changes to |
Added more benchmarks, everything looks as expected still. neat!
previously (without SIMD batching, but with intrinsic conversions)
|
(sorry for not merging yet, I'm abusing this branch to fix the github workflow. the CI should have catched the MSRV breaking change, but it is broken apparently) |
Fixed it - now the only question is whether we want to go 2.0.0 and Rust 1.70.0 for this... |
As it stands, |
If we allow |
You can put And just don't put |
The |
Actually, let's merge all of this except for the version upgrade of half. then release a major version with the small performance improvements. then release 2.0.0 with the new version of half, including the new intrinsics, and a new msrv. the reason being that the batching alone gives us 10% speed improvement (measured with intrinsics active, assuming it will also be relevant without intrinsics) sorry for all the discussion and for all the strategy changes :D |
|
I am convinced that bumping semver for MSRV reasons alone is a bad idea, because now several crates using |
Hmmm maybe I shouldn't make spontaneous decisions at 3AM :D Yes, I see your point. On the other hand, most people will have specified the dependency to |
I've opened #217 and would like to continue the discussion there, as it seems more appropriate. I yanked version |
(and also fix round up division missing documentation)
@Shnatsel will this approach work in terms of optimization? i had to add a few copy operations for technical reasons
to do:
half
dependency