Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added buffer interoperability with arrow-rs #1437

Merged
merged 3 commits into from Mar 22, 2023

Conversation

tustvold
Copy link
Contributor

As part of #1429 we want to provide an interoperability story between arrow2 and arrow-rs.

The original proposal involved porting arrow-rs and arrow2 to have a common base array representation. This was to preserve the original spirit of @jorgecarleitao 's proposal in apache/arrow-rs#1176 (comment). However, doing this in an incremental fashion whilst not introducing performance regressions or major breaking changes is complicated and extremely time consuming.

Taking a step-back, all we really want is a reasonably fast way to convert between array representations, to facilitate interoperability and potentially incremental migration of codebases. Whilst perhaps less "pure", simply providing a safe API to convert between ArrayData and Box<dyn arrow2::Array> is likely sufficient.

The major things this would change are:

  • Conversion is no longer just a memmove, but likely on a similar order of magnitude
  • The converted arrays would not be convertible back to Vec as they would be opaque allocations

However, it would allow us to provide an interoperability story in a matter of days instead of weeks/months.

In this vein, this PR adds zero-copy conversion between the buffer representations, as this is all that is really necessary to permit this. The rest of the conversion logic is fairly mechanical, I already have it mostly implemented but wanted to get feedback first.

}

pub(crate) type Bytes<T> = foreign_vec::ForeignVec<BytesAllocator, T>;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the real meat of the conversion

@@ -14,6 +15,7 @@ pub trait NativeType:
+ Send
+ Sync
+ Sized
+ RefUnwindSafe
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary to ensure that arrow2::Bytes<T> is RefUnwindSafe which is important to arrow::Buffer

@codecov
Copy link

codecov bot commented Mar 17, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02 🎉

Comparison is base (db87f71) 83.76% compared to head (a952e10) 83.78%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1437      +/-   ##
==========================================
+ Coverage   83.76%   83.78%   +0.02%     
==========================================
  Files         375      376       +1     
  Lines       41024    41074      +50     
==========================================
+ Hits        34364    34415      +51     
+ Misses       6660     6659       -1     
Impacted Files Coverage Δ
src/types/native.rs 68.46% <ø> (ø)
src/bitmap/immutable.rs 88.42% <100.00%> (+1.21%) ⬆️
src/buffer/immutable.rs 87.38% <100.00%> (+0.72%) ⬆️
src/buffer/mod.rs 100.00% <100.00%> (ø)
src/ffi/array.rs 83.12% <100.00%> (ø)

... and 4 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@alamb
Copy link
Collaborator

alamb commented Mar 17, 2023

Is there anything about this API that would preclude an (eventual) unification of the underlying buffer types? If not, it then seems quite reasonable to me to introduce an (optional) migration path and then work on the unifying buffer types to get Vec convertibility over time / as resources allow

@tustvold
Copy link
Contributor Author

Is there anything about this API that would preclude an (eventual) unification of the underlying buffer types

No, although if this approach is given the green light, it is unclear that such a unification would be worth the fairly significant effort, I certainly would not be intending to undertake it.

@tustvold
Copy link
Contributor Author

Integration test failure does not appear to be related to this PR

@ritchie46
Copy link
Collaborator

I do think it is a regression if we cannot get back to Vec anymore, In polars we convert back sometimes. Could we make this a feature gate? I could feature gate that behavior out in polars as well.

Whilst perhaps less "pure", simply providing a safe API to convert between ArrayData and Box is likely sufficient.

Couldn't we already do this with arrow FFI spec? What are the pro's and cons against this route? As we would still need to compile both libraries if we convert between the two.

@tustvold
Copy link
Contributor Author

tustvold commented Mar 21, 2023

if we cannot get back to Vec anymore

It's only you can't go back to Vec from an array created initially by the other library and then converted, i.e. the conversion loses the ability to go back to a vec. Arrow-rs arrays created from vec can still be converted back, and the same for arrow2

ffi

The conversion is safe and ergonomic, ffi is neither 😅

This approach should also be marginally faster as it doesn't need to marshal back and forth from the c data layout (which may need to recompute null buffers)

still need to compile both

You only need to compile an extremely small part of arrow-rs, it won't register in the compile times at all

@alamb
Copy link
Collaborator

alamb commented Mar 21, 2023

It's only you can't go back to Vec from an array created initially by the other library and then converted, i.e. the conversion loses the ability to go back to a vec. Arrow-rs arrays created from vec can still be converted back, and the same for arrow2

Maybe we can add a test demonstrating going back/forth to vec (and when it doesn't work) as a way to document the limitiation?

@ritchie46
Copy link
Collaborator

It's only you can't go back to Vec from an array created initially by the other library and then converted, i.e. the conversion loses the ability to go back to a vec. Arrow-rs arrays created from vec can still be converted back, and the same for arrow2

Right, I misunderstood that part. In that case this looks great! 👍

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes sense to me -- thank you @tustvold

However, I am not super familiar with the arrow2 codebase so I defer to @ritchie46 / @jorgecarleitao / @sundy-li for final approval

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed and this is very ingenious and well implemented. Thank you @tustvold !

My only minor concern is that because arrow-buffer bumps major version every 2 weeks, we need to update this repo every 2 weeks, but this is only a procedural issue as the crate is not changing much.

Thank you again, @tustvold 🙇

@jorgecarleitao jorgecarleitao added the feature A new feature label Mar 22, 2023
@jorgecarleitao jorgecarleitao changed the title Buffer interoperability with arrow-rs (#1429) Added buffer interoperability with arrow-rs (#1429) Mar 22, 2023
@jorgecarleitao jorgecarleitao merged commit 1073211 into jorgecarleitao:main Mar 22, 2023
@jorgecarleitao jorgecarleitao changed the title Added buffer interoperability with arrow-rs (#1429) Added buffer interoperability with arrow-rs Mar 22, 2023
@alamb
Copy link
Collaborator

alamb commented Mar 22, 2023

My only minor concern is that because arrow-buffer bumps major version every 2 weeks, we need to update this repo every 2 weeks, but this is only a procedural issue as the crate is not changing much.

We might be able to publish new versions of arrow2 with minor (e.g. 0.16.1) with just version updates if that turns out to be an issue. I think bumping dependents is semantically compatible

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants