🚀 float NaN handling #21

jvdd · 2023-02-08T17:01:27Z

Handle NaNs, closes #16

✔️ no behavior changes in this PR (except for f16)

Previous (and also current) behavior for floats:

.argminmax: ignores NaNs (while being even faster for floats)
=> Only "downside" - if data contains ONLY NaNs and/or +inf/-inf this will return 0
(I believe we can accept this unexpected for now - seems like a very uncommon use case)
.nanargminmax: (new function 🎉) returns the index of the first NaN value (instead of ignoring it)
To realize this functionality, we use the transformation as detailed in 💪 handle NaNs #16 & explored in 🚧 POC - support NaNs for SSE & AVX2 f32 #18

❗ for f16 we do not have an IgnoreNaN implementation yet. (previously .argminmax for f16 corresponded to the ReturnNan case - as we use the ord_transform to efficiently handle non-hardware supported f16).

Changing the "architecture":

SIMD ops in stand-alone trait (better w.r.t. coupling & cohesion)
- create SIMDOps trait & add additional traits (see sketch below)
- add auto-implement for SIMDCore traits
- implement the SIMDOps trait for the various data types
  - x86 / x86_64
  - arm / Aarch64
implement IgnoreNaN and ReturnNan variant for floats the SIMDInstructionSet structs
- IgnoreNaN
- ReturnNaN

Changing the default behavior

Minor TODOs during this (waaay too large) PR:

try to get rid of overhead for signed & unsigned integers (added to ignore NaNs in the first SIMD Vec, see commit edd083c)
add tests for infinities, resolves Test infinity support #20
🤔 decide what we will default to: argminmax / nanargminmax? => see ♻️ change nan default handling behavior to SkipNa #28

Overview of the new architecture

default SIMDInstructionSet struct (e.g., AVX2) its argminmax is "return NaN" (e.g. simd_f32.rs)
"ignore NaN" is served in a distinct struct (e.g. AVX2FloatIgnoreNaN)

codspeed-hq · 2023-02-08T17:04:52Z

CodSpeed Performance Report

Merging #21 nans_v3 (d53e09c) will not alter performances.

Summary

🔥 0 improvements
❌ 0 regressions
✅ 32 untouched benchmarks

🆕 20 new benchmarks
⁉️ 12 dropped benchmarks

Benchmarks breakdown

	Benchmark	`main`	`nans_v3`	Change
🆕	`scalar_random_long_f16`	N/A	3.3 ms	N/A
🆕	`sse_random_long_f16`	N/A	476.4 µs	N/A
🆕	`avx2_random_long_f16`	N/A	236.5 µs	N/A
🆕	`impl_random_long_f16`	N/A	236.6 µs	N/A
🆕	`scalar_nanargminmax_f32`	N/A	2.2 ms	N/A
🆕	`sse_nanargminmax_f32`	N/A	968 µs	N/A
🆕	`avx2_nanargminmax_f32`	N/A	466.9 µs	N/A
🆕	`impl_nanargminmax_f32`	N/A	467.1 µs	N/A
🆕	`scalar_random_long_f32`	N/A	1.7 ms	N/A
🆕	`sse_random_long_f32`	N/A	712.1 µs	N/A
🆕	`avx_random_long_f32`	N/A	403 µs	N/A
🆕	`impl_random_long_f32`	N/A	403.2 µs	N/A
🆕	`scalar_nanargminmax_f64`	N/A	2.4 ms	N/A
🆕	`sse_nanargminmax_f64`	N/A	2.3 ms	N/A
🆕	`avx2_nanargminmax_f64`	N/A	1.1 ms	N/A
🆕	`impl_nanargminmax_f64`	N/A	1.1 ms	N/A
🆕	`scalar_random_long_f64`	N/A	1.9 ms	N/A
🆕	`sse_random_long_f64`	N/A	1.4 ms	N/A
🆕	`avx_random_long_f64`	N/A	804.6 µs	N/A
🆕	`impl_random_long_f64`	N/A	804.8 µs	N/A
⁉️	`scalar_random_long_f32`	1.7 ms	N/A	N/A
⁉️	`sse_random_long_f32`	712.1 µs	N/A	N/A
⁉️	`avx_random_long_f32`	402.9 µs	N/A	N/A
⁉️	`impl_random_long_f32`	403.2 µs	N/A	N/A
⁉️	`scalar_random_long_f16`	2.3 ms	N/A	N/A
⁉️	`sse_random_long_f16`	475.6 µs	N/A	N/A
⁉️	`avx2_random_long_f16`	235.6 µs	N/A	N/A
⁉️	`impl_random_long_f16`	235.8 µs	N/A	N/A
⁉️	`scalar_random_long_f64`	1.9 ms	N/A	N/A
⁉️	`sse_random_long_f64`	1.4 ms	N/A	N/A
⁉️	`avx_random_long_f64`	804.6 µs	N/A	N/A
⁉️	`impl_random_long_f64`	804.8 µs	N/A	N/A

varon · 2023-02-09T20:39:17Z

Thanks so much for tackling this, @jvdd ! Any smaller tasks I can help with?

…alues

varon · 2023-02-09T21:51:09Z

Gave it a shot at filling in the API values for NEON + Arm64.
https://github.com/varon/argminmax/tree/neon-nan-v3

You can fish the exact commit out here - 728e310

Feel free to cherry-pick this into your PR if you judge it useful!

jvdd · 2023-02-09T23:27:51Z

Thx @varon, I appreciate your help!

I'll first try to merge PR #23 (which does some major refactoring in terms of traits & structs) - should make the codebase a lot more flexible (separates floats from the other datatypes without any real code overhead). I'll document this tomorrow. (I fear that this merge will result in quite a lot merge conflicts with your PR - my apologies for this :/)

Once PR #23 is merged, I do not plan to change anything to the traits / structs (& implementation) of ints & uints. So implementing the ARM/Aarch64 SIMD for those dtypes should then be quite safe :)

♻️ major refactoring

src/simd/simd_u32.rs

src/simd/simd_f64_return_nans.rs

jvdd · 2023-02-11T13:46:41Z

What I tried in commit 07a5e66 does not work. You cant pass const as attribute to derive :/

Related issue rust-lang/rust#52393

… varon-neon-nan-v3

♻️ change nan default handling behavior to SkipNa

varon · 2023-02-25T21:43:04Z

Any action I can help with?

jvdd · 2023-02-26T06:54:06Z

Hey @varon, after I have reviewed my own code today, I believe this PR will finally be ready for merging! 🎉

If you'd like to help out, there are a couple of things you could do:

Give feedback on the architecture. I'm still learning Rust, so I'm not entirely sure whether I've improved the architecture - to be honest I do think it might require a second iteration (in a separate PR). Your input would be much appreciated.
Review the code. If you're short on time, you can skip the benchmarks and tests. Any feedback you can provide would be extremely helpful.

Thanks in advance for your help!

jvdd · 2023-02-26T07:06:08Z

Cargo.toml

+# rstest = { version = "0.16", default-features = false}
+# rstest_reuse = "0.5"


Is something I experimented with and will use in a future PR (parameterizing the tests)

jvdd · 2023-02-26T07:11:35Z

Cargo.toml

+# TODO: support this
+# [[bench]]
+# name = "bench_f16_ignore_nan"
+# harness = false
+# required-features = ["half"]


Is currently not supported as we use the ord_transform to provide SIMD support for the non-hardware supported f16 datatype (see #1)

jvdd · 2023-02-26T07:31:57Z

benches/bench_f16_return_nan.rs

+    data
+}
+
+// TODO: rename _random_long_ to _nanargminmax_


Will do this in a separate PR (cleaning up the benchmarks; renaming + removing unused benchmarks)

benches/bench_f64_return_nan.rs

jvdd · 2023-02-26T07:47:23Z

src/lib.rs

+// TODO: split this up
+// pub trait NaNArgMinMax {
+//     fn nanargminmax(&self) -> (usize, usize);
+// }


Is for a future pull request

jvdd · 2023-02-26T07:56:26Z

src/lib.rs

                            // Scalar is faster for 64-bit numbers
+                            // TODO: double check this (observed different things for new float implementation)


The /benches/results indicate that for most CPUs this is indeed faster! => will look into this in a separate PR

jvdd · 2023-02-26T08:07:46Z

src/scalar/scalar_f16.rs

-    );
-    (minmax_tuple.0, minmax_tuple.2)
-}
+// TODO: previously we had dedicated non x86_64 code for f16 (see below)


Will revisit this in a separate PR

varon · 2023-02-26T12:58:08Z

Will review shortly!

jvdd

LGTM 🙃

jvdd · 2023-02-26T13:01:36Z

Excellent timing @varon! I just finished my review, so this is the perfect moment to jump in.

varon

I added some general comments - overall it seems really solid.

As a singular point, I would try to make the relations between the modules and types really clear in the code. For instance, in each of the data-type implementations, refer to the trait they implement, then in that trait explain how it fits into the overall packaging/system, and how it's generated, etc.

That will help users to navigate the codebase significantly easier, because there's useful breadcrumbs explaining where to look to go up/down the abstraction hierarchy.

The only other structural comment I would suggest is trying to migrate the test code out of the implementation classes. It makes them seem considerably more intimidating than they actually are. Ultimately where that's placed is opinion/rust best practices, which I can't say I'm familiar with, but it was quite a surprise to come across them there.

Lastly, as a total aside, have you looked at doing a CUDA/RocM implementation here? I suspect all of the copying back/forth of the results would probably be slower than this, but maybe with the right subdivision algorithm and keeping the data GPU-side, it could be possible to only copy that over once.

However, in the case of Metal, especially on the new apple chips, they're using unified memory - this would mean that there's no requirements to copy that over, and likely would be dramatically faster than the NEON-based instructions.

src/simd/generic.rs

src/lib.rs

src/simd/config.rs

src/simd/generic.rs

varon · 2023-02-26T13:30:19Z

src/simd/generic.rs

+        for _ in 0..arr.len() / LANE_SIZE - 1 {
+            // Increment the index
+            new_index = Self::_mm_add(new_index, Self::INDEX_INCREMENT);


Performance question:

How does this look if we iterate on each of these arrays in separate for loops? Would that not increase our cache hits? We can't make assumptions about the locality of the data, but doing it this way means we're operating on what likely could be data with better locality.

This is for consideration, but if you do know the answer/have tried it, maybe throw in a comment explaining why that approach isn't faster than this.

Very relevant question!

To assure that we are on the same page - you are suggesting to instead of iterating over the entire array in a single loop, it might be more efficient to perform some sort of "chunked" iteration (e.g. 4*LANE_SIZE) in an inner loop?
I can see how something like this can potentially decrease cache misses (as smaller "chunks" that fully fit in cache can be reused in the inner loop).

Although I did not explore exactly what you described here, I did try loop unrolling - with no avail.

Ah, let me clarify.

In this loop, we're reading from 3 different arrays at once for every step, with one for loop.

What I am suggesting may be faster is to iterate over each array separately (i.e. use 2x separate for-loops), as this would maximise our chances of getting cache & prefetch hits.

for _ in 0..arr.len() / LANE_SIZE - 1 { // do stuff on index high ... } for _ in 0..arr.len() / LANE_SIZE - 1 { // do stuff on index low ... }

Oh I see!

Guess assessing the potential performance gain of this can be done rather quickly - I'll analyze the cache misses some time next week :)

I further thought about this and quickly ran some experiments (benchmarks + some basic perf analysis). I did not observe any performance gains (only consistent performance degrations).

My 5 cents regarding this:

the code iterates over just one (and the same) array

all the other variables are SIMD registers (that should be present in memory?)

=> splitting the code up in 2 for loops will thus read twice the same data (which was already the bottleneck of this code) + there is some additional overhead as the index increment is now also performed twice (instead of once in the single loop implementation).

varon · 2023-02-26T13:42:03Z

Also - Would love if you can drop me an email so I can get in touch outside of GitHub and hopefully get some closer collaboration/easier contact in the future.

You can reach me at varon-github@outlook.com.

jvdd · 2023-02-26T18:50:37Z

Thank you @varon for your feedback on the pull request! Here is an answer addressing your comments:

Regarding making the relations between the modules and types clear in the code, I agree that this is an important aspect of code organization and understandability. To address this, I've documented all the methods of the SIMD traits and provided a detailed explanation to the SIMD structs (e.g., SSE) including the traits they implement and where you can find this code.
Moving the unit test code out of the implementation files is a good suggestion. I've created a separate issue for this (Improve unit tests #30), as imo this PR already encompasses quite some large changes
Regarding implementing the algorithm using CUDA/RocM, I think this is a very interesting idea that is worth exploring. Especially for apple silicon (with the unified memory). However, on other systems, I am afraid as well that the data transfer time between the CPU & GPU may present a significant bottleneck. I'll give this some more thought and create an issue with some more details. (Btw, I do not have access to any Apple silicon device, so it's rather unlikely that I will be able to develop this in the near future - I tested the aarch64/arm implementation on a raspberry pi 3 🙃)

Once again, thank you for your feedback and suggestions! It was quite fun implementing this with your support / feedback :)

I'll send you an email shortly.

♻️ prepare for float NaN handling

edd083c

jvdd added 2 commits February 8, 2023 19:32

🧹 separate float & (u)int in ArgMinMax trait macro

878dbdc

🙈 add temporary switched f16

ec9dd42

jvdd mentioned this pull request Feb 9, 2023

Make datatypes opt-in #22

Closed

♻️ major refactoring

4c27423

Adjust NEON simd types to support new expanded NaN supporting trait v…

728e310

…alues

🔥 getting there

718a0ee

🙈 quickfix benches

a046743

jvdd mentioned this pull request Feb 9, 2023

♻️ major refactoring #23

Merged

Merge pull request #23 from jvdd/nans_v4

97adc4f

♻️ major refactoring

jvdd commented Feb 10, 2023

View reviewed changes

src/simd/simd_u32.rs Show resolved Hide resolved

jvdd commented Feb 10, 2023

View reviewed changes

src/simd/simd_u32.rs Outdated Show resolved Hide resolved

jvdd added 2 commits February 10, 2023 09:55

🙈 enable correct target feature for u32 AVX2

b58d0d6

🤔 add simd f64 return nans + cleanup code

ab36c08

jvdd commented Feb 10, 2023

View reviewed changes

src/simd/simd_f64_return_nans.rs Outdated Show resolved Hide resolved

💡 account for missing srai 64-bit instruction on SSE & AVX

ddad0a5

🧹 revert prev commit :/

4cf61d8

jvdd force-pushed the nans_v3 branch from 07a5e66 to 4cf61d8 Compare February 11, 2023 14:05

jvdd added 7 commits February 11, 2023 16:16

🙏 pave the path towards ReturnNan default argminmax for floats

599d247

🧹 formatting

4b81161

Merge branch 'neon-nan-v3' of https://github.com/varon/argminmax into…

1b6231c

… varon-neon-nan-v3

🙏 resolve Aarch64 implements merge conflicts on x86_64

c3ddfeb

🍕 update NEON SIMD to latest changes

a3f7c4b

🧹 fix cargo clippy warnings

3324014

🐛 check for correct SIMD feature avx -> avx2

56c09cd

jvdd added 2 commits February 17, 2023 11:37

✨ add benches for nanargminmax

1d95552

🐛 minor bugfix in benches

859ac95

varon mentioned this pull request Feb 17, 2023

🚧 POC - support NaNs for SSE & AVX2 f32 #18

Closed

1 task

jvdd and others added 7 commits February 24, 2023 15:42

🧹 num_traits Float -> FloatCore

14f5070

🚲 default behavior = ignore NaNs

dd698cf

⌛ update benches

4fcc780

🧹

c493e39

🐛 call correct implementation

2e28b19

🐛 call correct implementation for f16

439b58b

Merge pull request #28 from jvdd/nans_change_default

049e9d3

♻️ change nan default handling behavior to SkipNa

jvdd commented Feb 26, 2023

View reviewed changes

benches/bench_f64_return_nan.rs Outdated Show resolved Hide resolved

jvdd commented Feb 26, 2023

View reviewed changes

🔍 own code review!

84b6518

varon approved these changes Feb 26, 2023

View reviewed changes

jvdd mentioned this pull request Feb 26, 2023

Improve unit tests #30

Closed

3 tasks

🖊️ code review

d53e09c

jvdd merged commit 29ad172 into main Feb 26, 2023

jvdd deleted the nans_v3 branch March 26, 2023 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 float NaN handling #21

🚀 float NaN handling #21

jvdd commented Feb 8, 2023 •

edited

Loading

codspeed-hq bot commented Feb 8, 2023 •

edited

Loading

varon commented Feb 9, 2023

varon commented Feb 9, 2023

jvdd commented Feb 9, 2023

jvdd commented Feb 11, 2023

varon commented Feb 25, 2023

jvdd commented Feb 26, 2023 •

edited

Loading

jvdd Feb 26, 2023

jvdd Feb 26, 2023

jvdd Feb 26, 2023

jvdd Feb 26, 2023

jvdd Feb 26, 2023

jvdd Feb 26, 2023

varon commented Feb 26, 2023

jvdd left a comment

jvdd commented Feb 26, 2023

varon left a comment

varon Feb 26, 2023

jvdd Feb 26, 2023

varon Feb 26, 2023

jvdd Feb 26, 2023

jvdd Feb 27, 2023

varon commented Feb 26, 2023

jvdd commented Feb 26, 2023

		# rstest = { version = "0.16", default-features = false}
		# rstest_reuse = "0.5"

		// Scalar is faster for 64-bit numbers
		// TODO: double check this (observed different things for new float implementation)

🚀 float NaN handling #21

🚀 float NaN handling #21

Conversation

jvdd commented Feb 8, 2023 • edited Loading

Overview of the new architecture

codspeed-hq bot commented Feb 8, 2023 • edited Loading

Summary

Benchmarks breakdown

varon commented Feb 9, 2023

varon commented Feb 9, 2023

jvdd commented Feb 9, 2023

jvdd commented Feb 11, 2023

varon commented Feb 25, 2023

jvdd commented Feb 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varon commented Feb 26, 2023

jvdd left a comment

Choose a reason for hiding this comment

jvdd commented Feb 26, 2023

varon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varon commented Feb 26, 2023

jvdd commented Feb 26, 2023

jvdd commented Feb 8, 2023 •

edited

Loading

codspeed-hq bot commented Feb 8, 2023 •

edited

Loading

jvdd commented Feb 26, 2023 •

edited

Loading