Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Auto vectorized vs packed_simd in arrow #255

Closed
sundy-li opened this issue Aug 5, 2021 · 10 comments
Closed

Auto vectorized vs packed_simd in arrow #255

sundy-li opened this issue Aug 5, 2021 · 10 comments
Labels
no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog question Further information is requested

Comments

@sundy-li
Copy link
Collaborator

sundy-li commented Aug 5, 2021

I found that if the primitive array has no null values, Auto vectorized can outperforms manual simds.

arrow2-sum 2^13 u32     time:   [545.01 ns 546.26 ns 547.72 ns]                                 
                        change: [-0.2234% +0.2358% +0.8664%] (p = 0.41 > 0.05)
                        No change in performance detected.


arrow2-sum 2^13 u32 nosimd                                                                            
                        time:   [316.59 ns 317.36 ns 318.20 ns]
                        change: [+0.4290% +0.8618% +1.2727%] (p = 0.00 < 0.05)
                        Change within noise threshold.

arrow2-sum null 2^13 u32                                                                             
                        time:   [4.3197 us 4.3290 us 4.3394 us]
                        change: [-0.1470% +0.3333% +0.8057%] (p = 0.19 > 0.05)


arrow2-sum null 2^13 u32 nosimd                                                                             
                        time:   [10.149 us 10.168 us 10.190 us]
                        change: [-0.8429% -0.4038% +0.1192%] (p = 0.11 > 0.05)

codes:

fn bench_sum(arr_a: &PrimitiveArray<u32>) {
    sum(criterion::black_box(arr_a)).unwrap();
}

fn bench_sum_nosimd(arr_a: &PrimitiveArray<u32>) {
    nosimd_sum_u32(criterion::black_box(arr_a)).unwrap();
}

fn nosimd_sum_u32(arr_a: &PrimitiveArray<u32>) -> Option<u32> {
    let mut sum = 0;
    if arr_a.null_count() == 0 {
        arr_a.values().as_slice().iter().for_each(|f| {
            sum += *f;
        });
    } else {
        if let Some(c) = arr_a.validity() {
            arr_a
                .values()
                .as_slice()
                .iter()
                .zip(c.iter())
                .for_each(|(f, is_null)| sum += *f * (is_null as u32));
        }
    }
    // disable optimize
    assert!(sum > 0);
    return Some(sum);
}

Currently I did not use Godbolt to see the assembly codes...

@jorgecarleitao
Copy link
Owner

yeap, it has been a battle. I actually have not used packed_simd for a while, but the null case was so important and I was unable to hit the right instructions, and so ended up adding it.

@sundy-li
Copy link
Collaborator Author

sundy-li commented Aug 5, 2021

Ok, so is that possible to change the nonnull case branch to auto vectorized version?

@jorgecarleitao
Copy link
Owner

it is faster; it is simpler => definitely :)

@leiysky
Copy link

leiysky commented Aug 11, 2021

Hi, @jorgecarleitao . I have a question about the comptibility of vectorization here.

Since there are many kinds of SIMD instruction sets(e.g. SSE, AVX, FMA), which are coupled with microarchitecture(e.g. Intel Skylake, AMD Zen2). If we only do simple cross compilation, that is, only specifying target architecture, we may not utilize with SIMD well.

AFAIK, this issue is usually solved by function multiversioning.

In C++ world, there are some approaches like GCC target attribute, which can generate multiple versions of a function(typically with different SIMD instruction sets) and dispatch them during load-time.

And I noticed that there is a multiversion crate https://docs.rs/multiversion/0.6.1/multiversion/, but I haven't tested it yet.

Is it possible to support this in arrow2?

@Dandandan
Copy link
Collaborator

Hey @leiysky

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

See for an example here:
https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/aggregate/sum.rs#L22

@leiysky
Copy link

leiysky commented Aug 11, 2021

Hey @leiysky

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

See for an example here:

https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/aggregate/sum.rs#L22

Nice!

I only read the code here, and find there seems no special handling.

https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/arithmetics/basic/add.rs

Sorry for my misunderstanding.

@sundy-li
Copy link
Collaborator Author

I only read the code here, and find there seems no special handling.

I had some doubt before, I think it may not work in the platform without avx support. But I have not tested about it.

@ritchie46
Copy link
Collaborator

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

To utilize avx at runtime. This still has to be compiled with a machine that supports this right? Or can it cross compile for different targets?

@leiysky
Copy link

leiysky commented Aug 14, 2021

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

To utilize avx at runtime. This still has to be compiled with a machine that supports this right? Or can it cross compile for different targets?

Multiversioning allows you to define targets(e.g.avx, sse) for a function, then compiler will always produce specified versions of the function, and dispatch them at loadtime(not runtime, with which it can achieve zero-overhead).

@Dandandan
Copy link
Collaborator

Dandandan commented Aug 14, 2021

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

To utilize avx at runtime. This still has to be compiled with a machine that supports this right? Or can it cross compile for different targets?

Rust can cross compile to a different target architecture if you like, but this code only generates it when compiling it for the specified target. E.g. x86_64+avx creates different compiled versions only when compiling with x86_64 as target but won't do that for aarch64 or x86 (it wouldn't make sense as the code would be invalid).
When it matches the target it will include multiple versions and will do detection at the first call of the function.

@jorgecarleitao jorgecarleitao added no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog question Further information is requested labels Sep 18, 2021
Repository owner locked and limited conversation to collaborators Sep 30, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants