Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 float NaN handling #21

Merged
merged 49 commits into from
Feb 26, 2023
Merged

🚀 float NaN handling #21

merged 49 commits into from
Feb 26, 2023

Conversation

jvdd
Copy link
Owner

@jvdd jvdd commented Feb 8, 2023

Handle NaNs, closes #16

✔️ no behavior changes in this PR (except for f16)

Previous (and also current) behavior for floats:

  • .argminmax: ignores NaNs (while being even faster for floats)
    => Only "downside" - if data contains ONLY NaNs and/or +inf/-inf this will return 0
    (I believe we can accept this unexpected for now - seems like a very uncommon use case)
  • .nanargminmax: (new function 🎉) returns the index of the first NaN value (instead of ignoring it)
    To realize this functionality, we use the transformation as detailed in 💪 handle NaNs #16 & explored in 🚧 POC - support NaNs for SSE & AVX2 f32 #18

❗ for f16 we do not have an IgnoreNaN implementation yet. (previously .argminmax for f16 corresponded to the ReturnNan case - as we use the ord_transform to efficiently handle non-hardware supported f16).

Changing the "architecture":

  • SIMD ops in stand-alone trait (better w.r.t. coupling & cohesion)
    • create SIMDOps trait & add additional traits (see sketch below)
    • add auto-implement for SIMDCore traits
    • implement the SIMDOps trait for the various data types
      • x86 / x86_64
      • arm / Aarch64
  • implement IgnoreNaN and ReturnNan variant for floats the SIMDInstructionSet structs
    • IgnoreNaN
    • ReturnNaN

Changing the default behavior

  • switch from IgnoreNan to ReturnNan for argminmax
    => we use IgnoreNan as default - see ♻️ change nan default handling behavior to SkipNa #28
    • simd_f16.rs should implement SIMDInstructionSet and not the IgnoreNan structs
    • document header of IgnoreNan & ReturnNan implementations
    • add tests for IgnoreNan
    • add tests for ReturnNan
    • update benches (currently we only bench FloatIgnoreNaN)
      -> first assure that we currently have no regressions - is still the case! - if so, change the FloatIgnoreNan benches to FloatReturnNan (will most likely result in some regressions) and move the FloatIgnoreNan benches to dedicated bench file.
    • update the ArgMinMax trait
  • add scalar support
    • update tests with correct scalar implementation
    • implement SCALARIgnoreNaN
    • double check correct scalar support for f16

Minor TODOs during this (waaay too large) PR:


Overview of the new architecture

image

  • default SIMDInstructionSet struct (e.g., AVX2) its argminmax is "return NaN" (e.g. simd_f32.rs)
  • "ignore NaN" is served in a distinct struct (e.g. AVX2FloatIgnoreNaN)

@codspeed-hq
Copy link

codspeed-hq bot commented Feb 8, 2023

CodSpeed Performance Report

Merging #21 nans_v3 (d53e09c) will not alter performances.

Summary

🔥 0 improvements
❌ 0 regressions
✅ 32 untouched benchmarks

🆕 20 new benchmarks
⁉️ 12 dropped benchmarks

Benchmarks breakdown

Benchmark main nans_v3 Change
🆕 scalar_random_long_f16 N/A 3.3 ms N/A
🆕 sse_random_long_f16 N/A 476.4 µs N/A
🆕 avx2_random_long_f16 N/A 236.5 µs N/A
🆕 impl_random_long_f16 N/A 236.6 µs N/A
🆕 scalar_nanargminmax_f32 N/A 2.2 ms N/A
🆕 sse_nanargminmax_f32 N/A 968 µs N/A
🆕 avx2_nanargminmax_f32 N/A 466.9 µs N/A
🆕 impl_nanargminmax_f32 N/A 467.1 µs N/A
🆕 scalar_random_long_f32 N/A 1.7 ms N/A
🆕 sse_random_long_f32 N/A 712.1 µs N/A
🆕 avx_random_long_f32 N/A 403 µs N/A
🆕 impl_random_long_f32 N/A 403.2 µs N/A
🆕 scalar_nanargminmax_f64 N/A 2.4 ms N/A
🆕 sse_nanargminmax_f64 N/A 2.3 ms N/A
🆕 avx2_nanargminmax_f64 N/A 1.1 ms N/A
🆕 impl_nanargminmax_f64 N/A 1.1 ms N/A
🆕 scalar_random_long_f64 N/A 1.9 ms N/A
🆕 sse_random_long_f64 N/A 1.4 ms N/A
🆕 avx_random_long_f64 N/A 804.6 µs N/A
🆕 impl_random_long_f64 N/A 804.8 µs N/A
⁉️ scalar_random_long_f32 1.7 ms N/A N/A
⁉️ sse_random_long_f32 712.1 µs N/A N/A
⁉️ avx_random_long_f32 402.9 µs N/A N/A
⁉️ impl_random_long_f32 403.2 µs N/A N/A
⁉️ scalar_random_long_f16 2.3 ms N/A N/A
⁉️ sse_random_long_f16 475.6 µs N/A N/A
⁉️ avx2_random_long_f16 235.6 µs N/A N/A
⁉️ impl_random_long_f16 235.8 µs N/A N/A
⁉️ scalar_random_long_f64 1.9 ms N/A N/A
⁉️ sse_random_long_f64 1.4 ms N/A N/A
⁉️ avx_random_long_f64 804.6 µs N/A N/A
⁉️ impl_random_long_f64 804.8 µs N/A N/A

@jvdd jvdd mentioned this pull request Feb 9, 2023
@varon
Copy link
Contributor

varon commented Feb 9, 2023

Thanks so much for tackling this, @jvdd ! Any smaller tasks I can help with?

@varon
Copy link
Contributor

varon commented Feb 9, 2023

Gave it a shot at filling in the API values for NEON + Arm64.
https://github.com/varon/argminmax/tree/neon-nan-v3

You can fish the exact commit out here - 728e310

Feel free to cherry-pick this into your PR if you judge it useful!

@jvdd
Copy link
Owner Author

jvdd commented Feb 9, 2023

Thx @varon, I appreciate your help!

I'll first try to merge PR #23 (which does some major refactoring in terms of traits & structs) - should make the codebase a lot more flexible (separates floats from the other datatypes without any real code overhead). I'll document this tomorrow. (I fear that this merge will result in quite a lot merge conflicts with your PR - my apologies for this :/)

Once PR #23 is merged, I do not plan to change anything to the traits / structs (& implementation) of ints & uints. So implementing the ARM/Aarch64 SIMD for those dtypes should then be quite safe :)

@jvdd jvdd mentioned this pull request Feb 9, 2023
♻️ major refactoring
src/simd/simd_u32.rs Outdated Show resolved Hide resolved
@jvdd
Copy link
Owner Author

jvdd commented Feb 11, 2023

What I tried in commit 07a5e66 does not work. You cant pass const as attribute to derive :/

Related issue rust-lang/rust#52393

@varon
Copy link
Contributor

varon commented Feb 25, 2023

Any action I can help with?

@jvdd
Copy link
Owner Author

jvdd commented Feb 26, 2023

Hey @varon, after I have reviewed my own code today, I believe this PR will finally be ready for merging! 🎉

If you'd like to help out, there are a couple of things you could do:

  • Give feedback on the architecture. I'm still learning Rust, so I'm not entirely sure whether I've improved the architecture - to be honest I do think it might require a second iteration (in a separate PR). Your input would be much appreciated.
  • Review the code. If you're short on time, you can skip the benchmarks and tests. Any feedback you can provide would be extremely helpful.

Thanks in advance for your help!

Comment on lines +25 to +26
# rstest = { version = "0.16", default-features = false}
# rstest_reuse = "0.5"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is something I experimented with and will use in a future PR (parameterizing the tests)

Comment on lines +37 to +41
# TODO: support this
# [[bench]]
# name = "bench_f16_ignore_nan"
# harness = false
# required-features = ["half"]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is currently not supported as we use the ord_transform to provide SIMD support for the non-hardware supported f16 datatype (see #1)

data
}

// TODO: rename _random_long_ to _nanargminmax_
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do this in a separate PR (cleaning up the benchmarks; renaming + removing unused benchmarks)

benches/bench_f64_return_nan.rs Outdated Show resolved Hide resolved
Comment on lines +44 to +47
// TODO: split this up
// pub trait NaNArgMinMax {
// fn nanargminmax(&self) -> (usize, usize);
// }
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is for a future pull request

src/lib.rs Outdated
Comment on lines 211 to 212
// Scalar is faster for 64-bit numbers
// TODO: double check this (observed different things for new float implementation)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /benches/results indicate that for most CPUs this is indeed faster! => will look into this in a separate PR

);
(minmax_tuple.0, minmax_tuple.2)
}
// TODO: previously we had dedicated non x86_64 code for f16 (see below)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revisit this in a separate PR

@varon
Copy link
Contributor

varon commented Feb 26, 2023

Will review shortly!

Copy link
Owner Author

@jvdd jvdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🙃

@jvdd
Copy link
Owner Author

jvdd commented Feb 26, 2023

Excellent timing @varon! I just finished my review, so this is the perfect moment to jump in.

Copy link
Contributor

@varon varon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some general comments - overall it seems really solid.

As a singular point, I would try to make the relations between the modules and types really clear in the code. For instance, in each of the data-type implementations, refer to the trait they implement, then in that trait explain how it fits into the overall packaging/system, and how it's generated, etc.

That will help users to navigate the codebase significantly easier, because there's useful breadcrumbs explaining where to look to go up/down the abstraction hierarchy.

The only other structural comment I would suggest is trying to migrate the test code out of the implementation classes. It makes them seem considerably more intimidating than they actually are. Ultimately where that's placed is opinion/rust best practices, which I can't say I'm familiar with, but it was quite a surprise to come across them there.

Lastly, as a total aside, have you looked at doing a CUDA/RocM implementation here? I suspect all of the copying back/forth of the results would probably be slower than this, but maybe with the right subdivision algorithm and keeping the data GPU-side, it could be possible to only copy that over once.

However, in the case of Metal, especially on the new apple chips, they're using unified memory - this would mean that there's no requirements to copy that over, and likely would be dramatically faster than the NEON-based instructions.

src/simd/generic.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/simd/config.rs Outdated Show resolved Hide resolved
src/simd/config.rs Show resolved Hide resolved
src/simd/generic.rs Outdated Show resolved Hide resolved
src/simd/generic.rs Outdated Show resolved Hide resolved
src/simd/generic.rs Outdated Show resolved Hide resolved
Comment on lines +139 to +141
for _ in 0..arr.len() / LANE_SIZE - 1 {
// Increment the index
new_index = Self::_mm_add(new_index, Self::INDEX_INCREMENT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance question:

How does this look if we iterate on each of these arrays in separate for loops? Would that not increase our cache hits? We can't make assumptions about the locality of the data, but doing it this way means we're operating on what likely could be data with better locality.

This is for consideration, but if you do know the answer/have tried it, maybe throw in a comment explaining why that approach isn't faster than this.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very relevant question!

To assure that we are on the same page - you are suggesting to instead of iterating over the entire array in a single loop, it might be more efficient to perform some sort of "chunked" iteration (e.g. 4*LANE_SIZE) in an inner loop?
I can see how something like this can potentially decrease cache misses (as smaller "chunks" that fully fit in cache can be reused in the inner loop).

Although I did not explore exactly what you described here, I did try loop unrolling - with no avail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, let me clarify.

In this loop, we're reading from 3 different arrays at once for every step, with one for loop.

What I am suggesting may be faster is to iterate over each array separately (i.e. use 2x separate for-loops), as this would maximise our chances of getting cache & prefetch hits.

for _ in 0..arr.len() / LANE_SIZE - 1 {
    // do stuff on index high
    ...
}

for _ in 0..arr.len() / LANE_SIZE - 1 {
    // do stuff on index low
    ...
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see!

Guess assessing the potential performance gain of this can be done rather quickly - I'll analyze the cache misses some time next week :)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I further thought about this and quickly ran some experiments (benchmarks + some basic perf analysis). I did not observe any performance gains (only consistent performance degrations).

My 5 cents regarding this:

  • the code iterates over just one (and the same) array
  • all the other variables are SIMD registers (that should be present in memory?)

=> splitting the code up in 2 for loops will thus read twice the same data (which was already the bottleneck of this code) + there is some additional overhead as the index increment is now also performed twice (instead of once in the single loop implementation).

@varon
Copy link
Contributor

varon commented Feb 26, 2023

Also - Would love if you can drop me an email so I can get in touch outside of GitHub and hopefully get some closer collaboration/easier contact in the future.

You can reach me at varon-github@outlook.com.

@jvdd jvdd mentioned this pull request Feb 26, 2023
3 tasks
@jvdd
Copy link
Owner Author

jvdd commented Feb 26, 2023

Thank you @varon for your feedback on the pull request! Here is an answer addressing your comments:

  1. Regarding making the relations between the modules and types clear in the code, I agree that this is an important aspect of code organization and understandability. To address this, I've documented all the methods of the SIMD traits and provided a detailed explanation to the SIMD structs (e.g., SSE) including the traits they implement and where you can find this code.
  2. Moving the unit test code out of the implementation files is a good suggestion. I've created a separate issue for this (Improve unit tests #30), as imo this PR already encompasses quite some large changes
  3. Regarding implementing the algorithm using CUDA/RocM, I think this is a very interesting idea that is worth exploring. Especially for apple silicon (with the unified memory). However, on other systems, I am afraid as well that the data transfer time between the CPU & GPU may present a significant bottleneck. I'll give this some more thought and create an issue with some more details. (Btw, I do not have access to any Apple silicon device, so it's rather unlikely that I will be able to develop this in the near future - I tested the aarch64/arm implementation on a raspberry pi 3 🙃)

Once again, thank you for your feedback and suggestions! It was quite fun implementing this with your support / feedback :)


I'll send you an email shortly.

@jvdd jvdd merged commit 29ad172 into main Feb 26, 2023
@jvdd jvdd deleted the nans_v3 branch March 26, 2023 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test infinity support 💪 handle NaNs
2 participants