Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Replaced own allocator by std::Vec. #385

Merged
merged 4 commits into from Sep 26, 2021
Merged

Replaced own allocator by std::Vec. #385

merged 4 commits into from Sep 26, 2021

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Sep 6, 2021

This PR moves the custom allocator to a feature gate cache_aligned, replacing it by std::Vec in the default, full, etc.

This allows users to create Buffer and MutableBuffer directly from a Vec at zero cost (MutableBuffer becomes a thin wrapper of Vec). This opens a whole range of possibilities related to crates that assume that the backing container is a Vec.

Note that this is fully compatible with arrow spec; we just do not follow the recommendation that the allocations follow 64 bytes (which neither this nor arrow-rs was following, since we have been using 128 byte alignments in x86_64). I was unable to observe any performance difference between 128-byte, 64-byte and std's (i.e. aligned with T) allocations.

This is backward incompatible, see #449 for details. Closes #449.

@jorgecarleitao jorgecarleitao added the feature A new feature label Sep 6, 2021
@jorgecarleitao jorgecarleitao changed the title Replaced custom allocator by std::Vec. Replaced own allocator by std::Vec. Sep 6, 2021
@codecov
Copy link

codecov bot commented Sep 6, 2021

Codecov Report

Merging #385 (0995cbc) into main (db92910) will decrease coverage by 0.61%.
The diff coverage is 30.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #385      +/-   ##
==========================================
- Coverage   80.64%   80.03%   -0.62%     
==========================================
  Files         372      371       -1     
  Lines       22691    22709      +18     
==========================================
- Hits        18300    18175     -125     
- Misses       4391     4534     +143     
Impacted Files Coverage Δ
arrow-pyarrow-integration-testing/src/lib.rs 0.00% <ø> (ø)
src/alloc/mod.rs 0.00% <ø> (-93.03%) ⬇️
src/types/bit_chunk.rs 48.00% <ø> (ø)
src/types/mod.rs 22.22% <0.00%> (-6.10%) ⬇️
src/vec.rs 0.00% <0.00%> (ø)
src/buffer/bytes.rs 53.84% <50.00%> (+0.27%) ⬆️
src/buffer/mutable.rs 86.46% <90.00%> (-1.84%) ⬇️
src/bitmap/bitmap_ops.rs 95.69% <100.00%> (ø)
src/buffer/immutable.rs 95.34% <100.00%> (+0.22%) ⬆️
tests/it/array/utf8/mutable.rs 100.00% <100.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db92910...0995cbc. Read the comment docs.

@jorgecarleitao
Copy link
Owner Author

Added bench for addition of two arrays,

add 2^10                time:   [364.46 ns 368.20 ns 373.48 ns]                     
                        change: [-19.711% -18.123% -16.664%] (p = 0.00 < 0.05)
add 2^12                time:   [1.5722 us 1.5812 us 1.5923 us]                      
                        change: [-21.703% -21.215% -20.647%] (p = 0.00 < 0.05)
add 2^14                time:   [5.7770 us 5.7963 us 5.8227 us]                      
                        change: [-18.389% -18.078% -17.705%] (p = 0.00 < 0.05)
add 2^16                time:   [44.481 us 44.532 us 44.588 us]                      
                        change: [-25.319% -24.865% -24.493%] (p = 0.00 < 0.05)
add 2^18                time:   [277.21 us 277.83 us 278.68 us]                     
                        change: [+3.3622% +3.6555% +3.9314%] (p = 0.00 < 0.05)
add 2^20                time:   [1.1558 ms 1.1600 ms 1.1647 ms]
                        change: [-0.6701% +0.2683% +1.1049%] (p = 0.57 > 0.05)

other arithmetics benches show no differences.

@ritchie46
Copy link
Collaborator

Added bench for addition of two arrays,

add 2^10                time:   [364.46 ns 368.20 ns 373.48 ns]                     
                        change: [-19.711% -18.123% -16.664%] (p = 0.00 < 0.05)
add 2^12                time:   [1.5722 us 1.5812 us 1.5923 us]                      
                        change: [-21.703% -21.215% -20.647%] (p = 0.00 < 0.05)
add 2^14                time:   [5.7770 us 5.7963 us 5.8227 us]                      
                        change: [-18.389% -18.078% -17.705%] (p = 0.00 < 0.05)
add 2^16                time:   [44.481 us 44.532 us 44.588 us]                      
                        change: [-25.319% -24.865% -24.493%] (p = 0.00 < 0.05)
add 2^18                time:   [277.21 us 277.83 us 278.68 us]                     
                        change: [+3.3622% +3.6555% +3.9314%] (p = 0.00 < 0.05)
add 2^20                time:   [1.1558 ms 1.1600 ms 1.1647 ms]
                        change: [-0.6701% +0.2683% +1.1049%] (p = 0.57 > 0.05)

other arithmetics benches show no differences.

How should I read this? Is positive change longer runtime?

@Dandandan
Copy link
Collaborator

Great cleanup!

My thoughts:

  • If manual alignment is still be preferred, it would be possible to implement/using a custom allocator (https://doc.rust-lang.org/std/alloc/trait.Allocator.html) that rounds up to use 64/128 byte alignment (still nightly unfortunately) and use it for the underlying Vec (and just use the global alloc in the implementation)
  • Also AFAIK some (global) allocators, like mimalloc and mimalloc also make some choices w.r.t. alignment, so might be good to do some benchmarking with those as well.

@ritchie46
Copy link
Collaborator

If manual alignment is still be preferred, it would be possible to implement/using a custom allocator (https://doc.rust-lang.org/std/alloc/trait.Allocator.html) that rounds up to use 64/128 byte alignment (still nightly unfortunately) and use it for the underlying Vec (and just use the global alloc in the implementation)

IMO having the same alignment as the std vec is really worth something. This allows zero copy moving in an out of arrow, improvin interop with the rust landcape.

I also did some benchmarks on polars and the db-benchmark groupby questions (this is with mimalloc):

* 64 byte aligned allocator
    - query time 33.52 s    
    - compilation time 4m 50s
* vec backed allocator
    - query time 33.47 s
    - compilation time 4m 06s

So the runtime difference seem to be negligible for that benchmark. The compilation times are getting better. 🙂

src/buffer/mutable.rs Outdated Show resolved Hide resolved
@Dandandan
Copy link
Collaborator

If manual alignment is still be preferred, it would be possible to implement/using a custom allocator (https://doc.rust-lang.org/std/alloc/trait.Allocator.html) that rounds up to use 64/128 byte alignment (still nightly unfortunately) and use it for the underlying Vec (and just use the global alloc in the implementation)

IMO having the same alignment as the std vec is really worth something. This allows zero copy moving in an out of arrow, improvin interop with the rust landcape.

* 64 byte aligned allocator
    - query time 33.52 s    
    - compilation time 4m 50s

My suggestion above still would be to use a Vec (with the same benefits as you mention) but providing a custom Allocator implementation to the Vec (second type parameter to Vec: https://doc.rust-lang.org/std/vec/struct.Vec.html# ) to round it up to be a multiple of 64/128 bytes and then just use the global allocator as currently is done. AFAIK interoperability for Vec doesn't rely on alignment of the Vec, the Allocator is free to allocate more than requested.
This would be only necessary when there is still a convincing argument to do this alignment at all.

For some use cases (like DataFusion) there might be a good argument to remove the 128 byte alignment, e.g. to remove some memory overhead for small batches.

@ritchie46
Copy link
Collaborator

My suggestion above still would be to use a Vec (with the same benefits as you mention) but providing a custom Allocator implementation to the Vec (second type parameter to Vec: https://doc.rust-lang.org/std/vec/struct.Vec.html# ) to round it up to be a multiple of 64/128 bytes and then just use the global allocator as currently is done. AFAIK interoperability for Vec doesn't rely on alignment of the Vec, the Allocator is free to allocate more than requested.
This would be only necessary when there is still a convincing argument to do this alignment at all.

You mean, keep the same allocator generic and default to the standard one? As I understand it, deallocating with a different alignment than the allocation is UB. This would indeed not happen if we are generic over the allocator, same as Vec.

@jorgecarleitao
Copy link
Owner Author

jorgecarleitao commented Sep 7, 2021

I think @Dandandan is suggesting switching to Vec and, in nightly (or in the future), offer an implementation of the trait std::Allocator in case people would like to use the 64/128 byte allocation.

In other words, there is no loss of generality here for nightly if we offer the 128 byte implementation of std::Allocator. Then users can use Vec with custom if they want to keep the alignment or use Vec for usual allocation, and Vec will handle all the details for us (since Vec is generic over the allocator)

@ritchie46
Copy link
Collaborator

I think @Dandandan is suggesting switching to Vec and, in nightly (or in the future), offer an implementation of the trait std::Allocator in case people would like to use the 64/128 byte allocation.
In other words, there is no loss of generality here for nightly if we offer the 128 byte implementation of std::Allocator. Then users can use Vec with custom if they want to keep the alignment or use Vec for usual allocation, and Vec will handle all the details for us (since Vec is generic over the allocator)

Check! 👍

@Dandandan
Copy link
Collaborator

Dandandan commented Sep 7, 2021

You mean, keep the same allocator generic and default to the standard one? As I understand it, deallocating with a different alignment than the allocation is UB. This would indeed not happen if we are generic over the allocator, same as Vec.

I think @Dandandan is suggesting switching to Vec and, in nightly (or in the future), offer an implementation of the trait std::Allocator in case people would like to use the 64/128 byte allocation.
In other words, there is no loss of generality here for nightly if we offer the 128 byte implementation of std::Allocator. Then users can use Vec with custom if they want to keep the alignment or use Vec for usual allocation, and Vec will handle all the details for us (since Vec is generic over the allocator)

Yes, the proposal in that case would be to use Vec<T, ArrowAllocator> for the Vec to be used by arrow2

e.g. :

struct ArrowAllocator;

impl Allocator for ArrowAllocator {
    fn allocate(&self, layout: Layout) -> Result<NonNull<[u8]>, AllocError> {
       // round up to desired alignment
       let custom_layout = Layout::from_size_align(layout.size, round_up_to_128(layout.align))?;
       // use global allocator to actually allocate the data, with changed aligment
       let alloc = alloc::alloc(custom_layout);
       // return as slice 
       Ok(from_raw_parts(..))
    }
    // other methods
    ...
}

type ArrowVec<T> = Vec<T, ArrowAllocator>;

You can now use the same API for this ArrowVec as Vec for building the arrow buffers, with some minor changes.

@ritchie46
Copy link
Collaborator

Yes, the proposal in that case would be to use Vec<T, ArrowAllocator> for the Vec to be used by arrow2

I think it'd be better to have MutableBuffer<T, Allocator> in such a case, where implementors (e.g. DataFusion) could define their own allocator.

In the setup you propose a Vec<T> would have a different allocation than a Vec<T, ArrowAllocator>, meaning that zero copy of Vec<T> into arrow would not be possible.

@Dandandan
Copy link
Collaborator

Dandandan commented Sep 7, 2021

Yes, the proposal in that case would be to use Vec<T, ArrowAllocator> for the Vec to be used by arrow2

I think it'd be better to have MutableBuffer<T, Allocator> in such a case, where implementors (e.g. DataFusion) could define their own allocator.

In the setup you propose a Vec<T> would have a different allocation than a Vec<T, ArrowAllocator>, meaning that zero copy of Vec<T> into arrow would not be possible.

Yes, good suggestion.
I'm not sure, but it might be possible to do it without making the allocator generic/public by using some methods in the same (experimental) API like https://doc.rust-lang.org/std/vec/struct.Vec.html#method.from_raw_parts_in which allow to change the allocator of the new vec (without copy).

@elferherrera
Copy link
Contributor

I just found out about this one from @ritchie46. This is a very nice clean up.
Out of curiosity, how will it work with foreign arrays? I thought that was one of the reasons for having its own buffer

@jorgecarleitao
Copy link
Owner Author

Out of curiosity, how will it work with foreign arrays? I thought that was one of the reasons for having its own buffer

It works because Buffer continues to be immutable and we still wrap everything under Bytes. The layout becomes

struct Buffer<T> {
     data: Arc<Bytes<T>>
     offset: usize
}

/// Mode of deallocating memory regions
pub enum Deallocation {
    /// Native deallocation
    Native(usize),
    // Foreign interface, via a callback
    Foreign(Arc<ffi::ArrowArray>),
}

pub struct Bytes<T: NativeType> {
    ptr: NonNull<T>,
    len: usize,
    deallocation: Deallocation,
}

impl<T: NativeType> Drop for Bytes<T> {
    #[inline]
    fn drop(&mut self) {
        match &self.deallocation {
            Deallocation::Native(capacity) => {
                unsafe { Vec::from_raw_parts(self.ptr, self.len, *capacity) };
            }
            // foreign interface knows how to deallocate itself.
            Deallocation::Foreign(_) => (),
        }
    }
}

i.e. replace alloc by Vec's own allocation strategy.

What we lose are allocations along cache lines (e.g. 64 bytes).

There are indications that, when adding two arrays, there is a penalty in not having them aligned along cache lines.

@elferherrera
Copy link
Contributor

elferherrera commented Sep 15, 2021

I see... you are creating a vector here

Deallocation::Native(capacity) => {
   unsafe { Vec::from_raw_parts(self.ptr, self.len, *capacity) };
}

and then you let rust drop it.

I'll dig deeper later once it has been merged into master. Nice work indeed

@jorgecarleitao
Copy link
Owner Author

jorgecarleitao commented Sep 23, 2021

I have taken an alternative, less drastic approach in the latest push: made the aligned allocation feature gated under cache_aligned.

Essentially:

  • cache_aligned: we use our custom allocator incompatible with Vec
  • not cache_aligned: we use Vec, [Mutable]Buffer implements from_vec(data: Vec<T>) and Vec<T> implements From<MutableBuffer<T>>
  • neither default nor full use cache_aligned, which means that people must opt-in to aligned allocations.

@jorgecarleitao
Copy link
Owner Author

With MIRI on CI for both branches, we are ready to merge. If anyone has any objection, go ahead :P

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Removed ALIGNMENT invariant from [Mutable]Buffer
4 participants