-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pyarrow Tensor dtype #5272
Comments
Hi ! We're using the Arrow format for the datasets, and PyArrow tensors are not part of the Arrow format AFAIK:
source: apache/arrow#4802 (comment) |
Hey @franz101 & @lhoestq! |
The work stalled a little because it was not clear where TensorArray would live. However Arrow community recently agreed to make a well-known-extension-type document and I would like apache/arrow#8510 to land there and add an implementation to C++/Python + another language. Is that something you would find beneficial to you? |
that is a great update, thank you. |
TensorArray sounds great ! Looking forward to it :) We've had our own ExtensionArray for fixed shape tensors for a while now, hoping to see something more standardized by the arrow community. Also super interested in the extension array for tensors of different sizes cc @mariosasko |
FixedShapeTensor ExtensionType was merged and will be in Arrow 12.0.0 (release is planned mid April). |
@rok Thanks for keeping us updated! I think it's best to introduce a new feature type that would use this extension type under the hood. I'll create an issue to discuss the design with the community in the coming days. Also, is there a tentative time frame for the variable-shape Tensor extension type? |
@mariosasko please tag me in the discussion, perhaps I can contribute. As for the variable shape tensor array - I'd be interested in working on it but didn't see much interest in community yet. Are you saying |
pyarrow 12 is out 🎉, will have a look if I can work on it for the ExtensionArray |
I think these two issues need to be fixed first on the Arrow side before adding the tensor feature type here: apache/arrow#35573 and apache/arrow#35599. @rok We've had a couple of requests for supporting variable-shape tensors on the forum/GH, but I did not manage to find the concrete issues using the search. TF/TFDS (and PyTorch with the |
That does make sense indeed. We should probably also be careful about memory layout to enable zero-copy interface to TF/PyTorch. |
So there is no way we can use pyarrow.Tensor ? |
Not with with the Arrow format, and therefore not in |
There is also an open issue to enable the conversion of |
We started a mailing list discussion about potential |
…ns, implemented using ExtensionType (#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See #24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: #24868 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…ns, implemented using ExtensionType (#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See #24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: #24868 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…mensions, implemented using ExtensionType (apache#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See apache#24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: apache#24868 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…mensions, implemented using ExtensionType (apache#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See apache#24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: apache#24868 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…mensions, implemented using ExtensionType (apache#37166) ### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See apache#24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: apache#24868 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Feature request
I was going the discussion of converting tensors to lists.
Is there a way to leverage pyarrow's Tensors for nested arrays / embeddings?
For example:
Apache docs
Maybe this belongs into the pyarrow features / repo.
Motivation
Working with big data, we need to make sure to use the best data structures and IO out there
Your contribution
Can try to a PR if code changes necessary
The text was updated successfully, but these errors were encountered: