Replaced own allocator by `std::Vec`. #385

jorgecarleitao · 2021-09-06T16:19:11Z

This PR moves the custom allocator to a feature gate cache_aligned, replacing it by std::Vec in the default, full, etc.

This allows users to create Buffer and MutableBuffer directly from a Vec at zero cost (MutableBuffer becomes a thin wrapper of Vec). This opens a whole range of possibilities related to crates that assume that the backing container is a Vec.

Note that this is fully compatible with arrow spec; we just do not follow the recommendation that the allocations follow 64 bytes (which neither this nor arrow-rs was following, since we have been using 128 byte alignments in x86_64). I was unable to observe any performance difference between 128-byte, 64-byte and std's (i.e. aligned with T) allocations.

This is backward incompatible, see #449 for details. Closes #449.

codecov · 2021-09-06T16:32:07Z

Codecov Report

Merging #385 (0995cbc) into main (db92910) will decrease coverage by 0.61%.
The diff coverage is 30.22%.

@@            Coverage Diff             @@
##             main     #385      +/-   ##
==========================================
- Coverage   80.64%   80.03%   -0.62%     
==========================================
  Files         372      371       -1     
  Lines       22691    22709      +18     
==========================================
- Hits        18300    18175     -125     
- Misses       4391     4534     +143

Impacted Files	Coverage Δ
arrow-pyarrow-integration-testing/src/lib.rs	`0.00% <ø> (ø)`
src/alloc/mod.rs	`0.00% <ø> (-93.03%)`	⬇️
src/types/bit_chunk.rs	`48.00% <ø> (ø)`
src/types/mod.rs	`22.22% <0.00%> (-6.10%)`	⬇️
src/vec.rs	`0.00% <0.00%> (ø)`
src/buffer/bytes.rs	`53.84% <50.00%> (+0.27%)`	⬆️
src/buffer/mutable.rs	`86.46% <90.00%> (-1.84%)`	⬇️
src/bitmap/bitmap_ops.rs	`95.69% <100.00%> (ø)`
src/buffer/immutable.rs	`95.34% <100.00%> (+0.22%)`	⬆️
tests/it/array/utf8/mutable.rs	`100.00% <100.00%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db92910...0995cbc. Read the comment docs.

jorgecarleitao · 2021-09-06T20:52:44Z

Added bench for addition of two arrays,

add 2^10                time:   [364.46 ns 368.20 ns 373.48 ns]                     
                        change: [-19.711% -18.123% -16.664%] (p = 0.00 < 0.05)
add 2^12                time:   [1.5722 us 1.5812 us 1.5923 us]                      
                        change: [-21.703% -21.215% -20.647%] (p = 0.00 < 0.05)
add 2^14                time:   [5.7770 us 5.7963 us 5.8227 us]                      
                        change: [-18.389% -18.078% -17.705%] (p = 0.00 < 0.05)
add 2^16                time:   [44.481 us 44.532 us 44.588 us]                      
                        change: [-25.319% -24.865% -24.493%] (p = 0.00 < 0.05)
add 2^18                time:   [277.21 us 277.83 us 278.68 us]                     
                        change: [+3.3622% +3.6555% +3.9314%] (p = 0.00 < 0.05)
add 2^20                time:   [1.1558 ms 1.1600 ms 1.1647 ms]
                        change: [-0.6701% +0.2683% +1.1049%] (p = 0.57 > 0.05)

other arithmetics benches show no differences.

ritchie46 · 2021-09-07T06:08:02Z

Added bench for addition of two arrays,

add 2^10                time:   [364.46 ns 368.20 ns 373.48 ns]                     
                        change: [-19.711% -18.123% -16.664%] (p = 0.00 < 0.05)
add 2^12                time:   [1.5722 us 1.5812 us 1.5923 us]                      
                        change: [-21.703% -21.215% -20.647%] (p = 0.00 < 0.05)
add 2^14                time:   [5.7770 us 5.7963 us 5.8227 us]                      
                        change: [-18.389% -18.078% -17.705%] (p = 0.00 < 0.05)
add 2^16                time:   [44.481 us 44.532 us 44.588 us]                      
                        change: [-25.319% -24.865% -24.493%] (p = 0.00 < 0.05)
add 2^18                time:   [277.21 us 277.83 us 278.68 us]                     
                        change: [+3.3622% +3.6555% +3.9314%] (p = 0.00 < 0.05)
add 2^20                time:   [1.1558 ms 1.1600 ms 1.1647 ms]
                        change: [-0.6701% +0.2683% +1.1049%] (p = 0.57 > 0.05)

other arithmetics benches show no differences.

How should I read this? Is positive change longer runtime?

Dandandan · 2021-09-07T09:04:24Z

Great cleanup!

My thoughts:

If manual alignment is still be preferred, it would be possible to implement/using a custom allocator (https://doc.rust-lang.org/std/alloc/trait.Allocator.html) that rounds up to use 64/128 byte alignment (still nightly unfortunately) and use it for the underlying Vec (and just use the global alloc in the implementation)
Also AFAIK some (global) allocators, like mimalloc and mimalloc also make some choices w.r.t. alignment, so might be good to do some benchmarking with those as well.

ritchie46 · 2021-09-07T10:13:08Z

If manual alignment is still be preferred, it would be possible to implement/using a custom allocator (https://doc.rust-lang.org/std/alloc/trait.Allocator.html) that rounds up to use 64/128 byte alignment (still nightly unfortunately) and use it for the underlying Vec (and just use the global alloc in the implementation)

IMO having the same alignment as the std vec is really worth something. This allows zero copy moving in an out of arrow, improvin interop with the rust landcape.

I also did some benchmarks on polars and the db-benchmark groupby questions (this is with mimalloc):

* 64 byte aligned allocator
    - query time 33.52 s    
    - compilation time 4m 50s
* vec backed allocator
    - query time 33.47 s
    - compilation time 4m 06s

So the runtime difference seem to be negligible for that benchmark. The compilation times are getting better. 🙂

src/buffer/mutable.rs

Dandandan · 2021-09-07T10:50:57Z

If manual alignment is still be preferred, it would be possible to implement/using a custom allocator (https://doc.rust-lang.org/std/alloc/trait.Allocator.html) that rounds up to use 64/128 byte alignment (still nightly unfortunately) and use it for the underlying Vec (and just use the global alloc in the implementation)

IMO having the same alignment as the std vec is really worth something. This allows zero copy moving in an out of arrow, improvin interop with the rust landcape.
* 64 byte aligned allocator
    - query time 33.52 s    
    - compilation time 4m 50s

My suggestion above still would be to use a Vec (with the same benefits as you mention) but providing a custom Allocator implementation to the Vec (second type parameter to Vec: https://doc.rust-lang.org/std/vec/struct.Vec.html# ) to round it up to be a multiple of 64/128 bytes and then just use the global allocator as currently is done. AFAIK interoperability for Vec doesn't rely on alignment of the Vec, the Allocator is free to allocate more than requested.
This would be only necessary when there is still a convincing argument to do this alignment at all.

For some use cases (like DataFusion) there might be a good argument to remove the 128 byte alignment, e.g. to remove some memory overhead for small batches.

ritchie46 · 2021-09-07T13:22:32Z

My suggestion above still would be to use a Vec (with the same benefits as you mention) but providing a custom Allocator implementation to the Vec (second type parameter to Vec: https://doc.rust-lang.org/std/vec/struct.Vec.html# ) to round it up to be a multiple of 64/128 bytes and then just use the global allocator as currently is done. AFAIK interoperability for Vec doesn't rely on alignment of the Vec, the Allocator is free to allocate more than requested.
This would be only necessary when there is still a convincing argument to do this alignment at all.

You mean, keep the same allocator generic and default to the standard one? As I understand it, deallocating with a different alignment than the allocation is UB. This would indeed not happen if we are generic over the allocator, same as Vec.

jorgecarleitao · 2021-09-07T13:28:17Z

I think @Dandandan is suggesting switching to Vec and, in nightly (or in the future), offer an implementation of the trait std::Allocator in case people would like to use the 64/128 byte allocation.

In other words, there is no loss of generality here for nightly if we offer the 128 byte implementation of std::Allocator. Then users can use Vec with custom if they want to keep the alignment or use Vec for usual allocation, and Vec will handle all the details for us (since Vec is generic over the allocator)

ritchie46 · 2021-09-07T13:43:01Z

I think @Dandandan is suggesting switching to Vec and, in nightly (or in the future), offer an implementation of the trait std::Allocator in case people would like to use the 64/128 byte allocation.
In other words, there is no loss of generality here for nightly if we offer the 128 byte implementation of std::Allocator. Then users can use Vec with custom if they want to keep the alignment or use Vec for usual allocation, and Vec will handle all the details for us (since Vec is generic over the allocator)

Check! 👍

Dandandan · 2021-09-07T14:06:52Z

You mean, keep the same allocator generic and default to the standard one? As I understand it, deallocating with a different alignment than the allocation is UB. This would indeed not happen if we are generic over the allocator, same as Vec.

I think @Dandandan is suggesting switching to Vec and, in nightly (or in the future), offer an implementation of the trait std::Allocator in case people would like to use the 64/128 byte allocation.
In other words, there is no loss of generality here for nightly if we offer the 128 byte implementation of std::Allocator. Then users can use Vec with custom if they want to keep the alignment or use Vec for usual allocation, and Vec will handle all the details for us (since Vec is generic over the allocator)

Yes, the proposal in that case would be to use Vec<T, ArrowAllocator> for the Vec to be used by arrow2

e.g. :

struct ArrowAllocator;

impl Allocator for ArrowAllocator {
    fn allocate(&self, layout: Layout) -> Result<NonNull<[u8]>, AllocError> {
       // round up to desired alignment
       let custom_layout = Layout::from_size_align(layout.size, round_up_to_128(layout.align))?;
       // use global allocator to actually allocate the data, with changed aligment
       let alloc = alloc::alloc(custom_layout);
       // return as slice 
       Ok(from_raw_parts(..))
    }
    // other methods
    ...
}

type ArrowVec<T> = Vec<T, ArrowAllocator>;

You can now use the same API for this ArrowVec as Vec for building the arrow buffers, with some minor changes.

ritchie46 · 2021-09-07T14:15:30Z

Yes, the proposal in that case would be to use Vec<T, ArrowAllocator> for the Vec to be used by arrow2

I think it'd be better to have MutableBuffer<T, Allocator> in such a case, where implementors (e.g. DataFusion) could define their own allocator.

In the setup you propose a Vec<T> would have a different allocation than a Vec<T, ArrowAllocator>, meaning that zero copy of Vec<T> into arrow would not be possible.

Dandandan · 2021-09-07T14:36:04Z

Yes, the proposal in that case would be to use Vec<T, ArrowAllocator> for the Vec to be used by arrow2

I think it'd be better to have MutableBuffer<T, Allocator> in such a case, where implementors (e.g. DataFusion) could define their own allocator.

In the setup you propose a Vec<T> would have a different allocation than a Vec<T, ArrowAllocator>, meaning that zero copy of Vec<T> into arrow would not be possible.

Yes, good suggestion.
I'm not sure, but it might be possible to do it without making the allocator generic/public by using some methods in the same (experimental) API like https://doc.rust-lang.org/std/vec/struct.Vec.html#method.from_raw_parts_in which allow to change the allocator of the new vec (without copy).

elferherrera · 2021-09-14T18:54:02Z

I just found out about this one from @ritchie46. This is a very nice clean up.
Out of curiosity, how will it work with foreign arrays? I thought that was one of the reasons for having its own buffer

jorgecarleitao · 2021-09-15T03:16:32Z

Out of curiosity, how will it work with foreign arrays? I thought that was one of the reasons for having its own buffer

It works because Buffer continues to be immutable and we still wrap everything under Bytes. The layout becomes

struct Buffer<T> {
     data: Arc<Bytes<T>>
     offset: usize
}

/// Mode of deallocating memory regions
pub enum Deallocation {
    /// Native deallocation
    Native(usize),
    // Foreign interface, via a callback
    Foreign(Arc<ffi::ArrowArray>),
}

pub struct Bytes<T: NativeType> {
    ptr: NonNull<T>,
    len: usize,
    deallocation: Deallocation,
}

impl<T: NativeType> Drop for Bytes<T> {
    #[inline]
    fn drop(&mut self) {
        match &self.deallocation {
            Deallocation::Native(capacity) => {
                unsafe { Vec::from_raw_parts(self.ptr, self.len, *capacity) };
            }
            // foreign interface knows how to deallocate itself.
            Deallocation::Foreign(_) => (),
        }
    }
}

i.e. replace alloc by Vec's own allocation strategy.

What we lose are allocations along cache lines (e.g. 64 bytes).

There are indications that, when adding two arrays, there is a penalty in not having them aligned along cache lines.

elferherrera · 2021-09-15T08:18:16Z

I see... you are creating a vector here

Deallocation::Native(capacity) => {
   unsafe { Vec::from_raw_parts(self.ptr, self.len, *capacity) };
}

and then you let rust drop it.

I'll dig deeper later once it has been merged into master. Nice work indeed

jorgecarleitao · 2021-09-23T21:54:50Z

I have taken an alternative, less drastic approach in the latest push: made the aligned allocation feature gated under cache_aligned.

Essentially:

cache_aligned: we use our custom allocator incompatible with Vec
not cache_aligned: we use Vec, [Mutable]Buffer implements from_vec(data: Vec<T>) and Vec<T> implements From<MutableBuffer<T>>
neither default nor full use cache_aligned, which means that people must opt-in to aligned allocations.

jorgecarleitao · 2021-09-25T08:37:00Z

With MIRI on CI for both branches, we are ready to merge. If anyone has any objection, go ahead :P

jorgecarleitao added the feature A new feature label Sep 6, 2021

jorgecarleitao changed the title ~~Replaced custom allocator by std::Vec.~~ Replaced own allocator by std::Vec. Sep 6, 2021

jorgecarleitao force-pushed the vec branch from 4315a94 to 1731a09 Compare September 6, 2021 20:50

Dandandan reviewed Sep 7, 2021

View reviewed changes

src/buffer/mutable.rs Outdated Show resolved Hide resolved

jorgecarleitao mentioned this pull request Sep 21, 2021

Question about design mbrobbel/narrow#15

Closed

jorgecarleitao force-pushed the vec branch 3 times, most recently from 06473b1 to 7fb641f Compare September 23, 2021 21:51

jorgecarleitao force-pushed the vec branch 4 times, most recently from 9927ebe to 47e143c Compare September 25, 2021 07:42

jorgecarleitao force-pushed the vec branch from 47e143c to fd39e3d Compare September 25, 2021 08:58

jorgecarleitao force-pushed the vec branch from fd39e3d to 306e57a Compare September 25, 2021 08:58

jorgecarleitao added 4 commits September 25, 2021 20:24

Added bench of arithmetics.

77e153d

Removed custom allocator.

f87ba22

Made Vec optional.

1bf9224

Added MIRI checks and coverage for custom allocator

0995cbc

jorgecarleitao force-pushed the vec branch from 306e57a to 0995cbc Compare September 26, 2021 04:47

jorgecarleitao merged commit 235b7f5 into main Sep 26, 2021

jorgecarleitao deleted the vec branch September 26, 2021 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaced own allocator by `std::Vec`. #385

Replaced own allocator by `std::Vec`. #385

jorgecarleitao commented Sep 6, 2021 •

edited

codecov bot commented Sep 6, 2021 •

edited

jorgecarleitao commented Sep 6, 2021

ritchie46 commented Sep 7, 2021

Dandandan commented Sep 7, 2021

ritchie46 commented Sep 7, 2021

Dandandan commented Sep 7, 2021

ritchie46 commented Sep 7, 2021

jorgecarleitao commented Sep 7, 2021 •

edited

ritchie46 commented Sep 7, 2021

Dandandan commented Sep 7, 2021 •

edited

ritchie46 commented Sep 7, 2021

Dandandan commented Sep 7, 2021 •

edited

elferherrera commented Sep 14, 2021

jorgecarleitao commented Sep 15, 2021

elferherrera commented Sep 15, 2021 •

edited

jorgecarleitao commented Sep 23, 2021 •

edited

jorgecarleitao commented Sep 25, 2021

Replaced own allocator by std::Vec. #385

Replaced own allocator by std::Vec. #385

Conversation

jorgecarleitao commented Sep 6, 2021 • edited

codecov bot commented Sep 6, 2021 • edited

Codecov Report

jorgecarleitao commented Sep 6, 2021

ritchie46 commented Sep 7, 2021

Dandandan commented Sep 7, 2021

ritchie46 commented Sep 7, 2021

Dandandan commented Sep 7, 2021

ritchie46 commented Sep 7, 2021

jorgecarleitao commented Sep 7, 2021 • edited

ritchie46 commented Sep 7, 2021

Dandandan commented Sep 7, 2021 • edited

ritchie46 commented Sep 7, 2021

Dandandan commented Sep 7, 2021 • edited

elferherrera commented Sep 14, 2021

jorgecarleitao commented Sep 15, 2021

elferherrera commented Sep 15, 2021 • edited

jorgecarleitao commented Sep 23, 2021 • edited

jorgecarleitao commented Sep 25, 2021

Replaced own allocator by `std::Vec`. #385

Replaced own allocator by `std::Vec`. #385

jorgecarleitao commented Sep 6, 2021 •

edited

codecov bot commented Sep 6, 2021 •

edited

jorgecarleitao commented Sep 7, 2021 •

edited

Dandandan commented Sep 7, 2021 •

edited

Dandandan commented Sep 7, 2021 •

edited

elferherrera commented Sep 15, 2021 •

edited

jorgecarleitao commented Sep 23, 2021 •

edited