Auto vectorized vs packed_simd in arrow #255

sundy-li · 2021-08-05T14:15:46Z

I found that if the primitive array has no null values, Auto vectorized can outperforms manual simds.

arrow2-sum 2^13 u32     time:   [545.01 ns 546.26 ns 547.72 ns]                                 
                        change: [-0.2234% +0.2358% +0.8664%] (p = 0.41 > 0.05)
                        No change in performance detected.


arrow2-sum 2^13 u32 nosimd                                                                            
                        time:   [316.59 ns 317.36 ns 318.20 ns]
                        change: [+0.4290% +0.8618% +1.2727%] (p = 0.00 < 0.05)
                        Change within noise threshold.

arrow2-sum null 2^13 u32                                                                             
                        time:   [4.3197 us 4.3290 us 4.3394 us]
                        change: [-0.1470% +0.3333% +0.8057%] (p = 0.19 > 0.05)


arrow2-sum null 2^13 u32 nosimd                                                                             
                        time:   [10.149 us 10.168 us 10.190 us]
                        change: [-0.8429% -0.4038% +0.1192%] (p = 0.11 > 0.05)

codes:

fn bench_sum(arr_a: &PrimitiveArray<u32>) {
    sum(criterion::black_box(arr_a)).unwrap();
}

fn bench_sum_nosimd(arr_a: &PrimitiveArray<u32>) {
    nosimd_sum_u32(criterion::black_box(arr_a)).unwrap();
}

fn nosimd_sum_u32(arr_a: &PrimitiveArray<u32>) -> Option<u32> {
    let mut sum = 0;
    if arr_a.null_count() == 0 {
        arr_a.values().as_slice().iter().for_each(|f| {
            sum += *f;
        });
    } else {
        if let Some(c) = arr_a.validity() {
            arr_a
                .values()
                .as_slice()
                .iter()
                .zip(c.iter())
                .for_each(|(f, is_null)| sum += *f * (is_null as u32));
        }
    }
    // disable optimize
    assert!(sum > 0);
    return Some(sum);
}

Currently I did not use Godbolt to see the assembly codes...

jorgecarleitao · 2021-08-05T15:45:16Z

yeap, it has been a battle. I actually have not used packed_simd for a while, but the null case was so important and I was unable to hit the right instructions, and so ended up adding it.

sundy-li · 2021-08-05T22:54:51Z

Ok, so is that possible to change the nonnull case branch to auto vectorized version?

jorgecarleitao · 2021-08-10T22:35:51Z

it is faster; it is simpler => definitely :)

leiysky · 2021-08-11T10:04:23Z

Hi, @jorgecarleitao . I have a question about the comptibility of vectorization here.

Since there are many kinds of SIMD instruction sets(e.g. SSE, AVX, FMA), which are coupled with microarchitecture(e.g. Intel Skylake, AMD Zen2). If we only do simple cross compilation, that is, only specifying target architecture, we may not utilize with SIMD well.

AFAIK, this issue is usually solved by function multiversioning.

In C++ world, there are some approaches like GCC target attribute, which can generate multiple versions of a function(typically with different SIMD instruction sets) and dispatch them during load-time.

And I noticed that there is a multiversion crate https://docs.rs/multiversion/0.6.1/multiversion/, but I haven't tested it yet.

Is it possible to support this in arrow2?

Dandandan · 2021-08-11T10:07:43Z

Hey @leiysky

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

See for an example here:
https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/aggregate/sum.rs#L22

leiysky · 2021-08-11T10:21:01Z

Hey @leiysky

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

See for an example here:

https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/aggregate/sum.rs#L22

Nice!

I only read the code here, and find there seems no special handling.

https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/arithmetics/basic/add.rs

Sorry for my misunderstanding.

sundy-li · 2021-08-11T10:29:17Z

I only read the code here, and find there seems no special handling.

I had some doubt before, I think it may not work in the platform without avx support. But I have not tested about it.

ritchie46 · 2021-08-12T13:56:00Z

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

To utilize avx at runtime. This still has to be compiled with a machine that supports this right? Or can it cross compile for different targets?

leiysky · 2021-08-14T10:10:55Z

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

To utilize avx at runtime. This still has to be compiled with a machine that supports this right? Or can it cross compile for different targets?

Multiversioning allows you to define targets(e.g.avx, sse) for a function, then compiler will always produce specified versions of the function, and dispatch them at loadtime(not runtime, with which it can achieve zero-overhead).

Dandandan · 2021-08-14T10:23:55Z

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

To utilize avx at runtime. This still has to be compiled with a machine that supports this right? Or can it cross compile for different targets?

Rust can cross compile to a different target architecture if you like, but this code only generates it when compiling it for the specified target. E.g. x86_64+avx creates different compiled versions only when compiling with x86_64 as target but won't do that for aarch64 or x86 (it wouldn't make sense as the code would be invalid).
When it matches the target it will include multiple versions and will do detection at the first call of the function.

jorgecarleitao added no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog question Further information is requested labels Sep 18, 2021

Repository owner locked and limited conversation to collaborators Sep 30, 2021

jorgecarleitao closed this as completed Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Auto vectorized vs packed_simd in arrow #255

Auto vectorized vs packed_simd in arrow #255

sundy-li commented Aug 5, 2021 •

edited

Loading

jorgecarleitao commented Aug 5, 2021

sundy-li commented Aug 5, 2021

jorgecarleitao commented Aug 10, 2021

leiysky commented Aug 11, 2021

Dandandan commented Aug 11, 2021

leiysky commented Aug 11, 2021

sundy-li commented Aug 11, 2021

ritchie46 commented Aug 12, 2021

leiysky commented Aug 14, 2021

Dandandan commented Aug 14, 2021 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Auto vectorized vs packed_simd in arrow #255

Auto vectorized vs packed_simd in arrow #255

Comments

sundy-li commented Aug 5, 2021 • edited Loading

jorgecarleitao commented Aug 5, 2021

sundy-li commented Aug 5, 2021

jorgecarleitao commented Aug 10, 2021

leiysky commented Aug 11, 2021

Dandandan commented Aug 11, 2021

leiysky commented Aug 11, 2021

sundy-li commented Aug 11, 2021

ritchie46 commented Aug 12, 2021

leiysky commented Aug 14, 2021

Dandandan commented Aug 14, 2021 • edited Loading

This issue was moved to a discussion.

sundy-li commented Aug 5, 2021 •

edited

Loading

Dandandan commented Aug 14, 2021 •

edited

Loading