Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pyarrow Tensor dtype #5272

Open
franz101 opened this issue Nov 20, 2022 · 16 comments
Open

Use pyarrow Tensor dtype #5272

franz101 opened this issue Nov 20, 2022 · 16 comments
Labels
enhancement New feature or request

Comments

@franz101
Copy link

franz101 commented Nov 20, 2022

Feature request

I was going the discussion of converting tensors to lists.
Is there a way to leverage pyarrow's Tensors for nested arrays / embeddings?

For example:

import pyarrow as pa
import numpy as np
x = np.array([[2, 2, 4], [4, 5, 100]], np.int32)
pa.Tensor.from_numpy(x, dim_names=["dim1","dim2"])

Apache docs

Maybe this belongs into the pyarrow features / repo.

Motivation

Working with big data, we need to make sure to use the best data structures and IO out there

Your contribution

Can try to a PR if code changes necessary

@franz101 franz101 added the enhancement New feature or request label Nov 20, 2022
@lhoestq
Copy link
Member

lhoestq commented Nov 21, 2022

Hi ! We're using the Arrow format for the datasets, and PyArrow tensors are not part of the Arrow format AFAIK:

There is no direct support in the arrow columnar format to store Tensors as column values.

source: apache/arrow#4802 (comment)

@franz101
Copy link
Author

franz101 commented Nov 21, 2022

@wesm @rok its been around three years. any updates, regarding dataset arrow tensor support? 🙏 I know you must be very busy, would appreciate to learn what is the state of art. I saw the PR is still open #8510

@rok
Copy link

rok commented Nov 21, 2022

Hey @franz101 & @lhoestq!
There is a plan and a PR to create an ExtensionArray of Tensors of equal sizes as well as a plan to do the same for Tensors of different sizes ARROW-8714.

@rok
Copy link

rok commented Nov 21, 2022

The work stalled a little because it was not clear where TensorArray would live. However Arrow community recently agreed to make a well-known-extension-type document and I would like apache/arrow#8510 to land there and add an implementation to C++/Python + another language. Is that something you would find beneficial to you?

@franz101
Copy link
Author

franz101 commented Nov 21, 2022

that is a great update, thank you.
it looks like this feature would benefit datasets implementation of ArrayExtensionArray. Is that correct @eladsegal @lhoestq?

@lhoestq
Copy link
Member

lhoestq commented Nov 21, 2022

TensorArray sounds great ! Looking forward to it :)

We've had our own ExtensionArray for fixed shape tensors for a while now, hoping to see something more standardized by the arrow community.

Also super interested in the extension array for tensors of different sizes cc @mariosasko

@rok
Copy link

rok commented Apr 4, 2023

FixedShapeTensor ExtensionType was merged and will be in Arrow 12.0.0 (release is planned mid April).

@mariosasko
Copy link
Collaborator

@rok Thanks for keeping us updated! I think it's best to introduce a new feature type that would use this extension type under the hood. I'll create an issue to discuss the design with the community in the coming days.

Also, is there a tentative time frame for the variable-shape Tensor extension type?

@rok
Copy link

rok commented Apr 7, 2023

@mariosasko please tag me in the discussion, perhaps I can contribute.

As for the variable shape tensor array - I'd be interested in working on it but didn't see much interest in community yet. Are you saying huggingface/datasets could use it?

@franz101
Copy link
Author

franz101 commented May 3, 2023

pyarrow 12 is out 🎉, will have a look if I can work on it for the ExtensionArray

@mariosasko
Copy link
Collaborator

I think these two issues need to be fixed first on the Arrow side before adding the tensor feature type here: apache/arrow#35573 and apache/arrow#35599.

@rok We've had a couple of requests for supporting variable-shape tensors on the forum/GH, but I did not manage to find the concrete issues using the search. TF/TFDS (and PyTorch with the nested_tensor API) support them, so it makes sense for us to do the same eventually (the Ray project has an extension type to support this case)

@rok
Copy link

rok commented May 16, 2023

@rok We've had a couple of requests for supporting variable-shape tensors on the forum/GH, but I did not manage to find the concrete issues using the search. TF/TFDS (and PyTorch with the nested_tensor API) support them, so it makes sense for us to do the same eventually (the Ray project has an extension type to support this case)

That does make sense indeed. We should probably also be careful about memory layout to enable zero-copy interface to TF/PyTorch.

@hfawaz
Copy link
Contributor

hfawaz commented Jun 29, 2023

So there is no way we can use pyarrow.Tensor ?

@lhoestq
Copy link
Member

lhoestq commented Jun 29, 2023

Not with with the Arrow format, and therefore not in datasets. But they released a new FixedShapeTensorArray to store tensors in Arrow format. We plan to support this in datasets at one point !

@AlenkaF
Copy link

AlenkaF commented Jul 4, 2023

There is also an open issue to enable the conversion of pyarrow.Tensor to pyarrow.FixedShapeTensorType: apache/arrow#35068. This way one could indirectly use pyarrow.Tensor in Arrow format.

@rok
Copy link

rok commented Aug 17, 2023

We started a mailing list discussion about potential VariableShapeTensor extension array, please check it out and give feedback. For more details here's also a PR apache/arrow#37166.

jorisvandenbossche added a commit to apache/arrow that referenced this issue Oct 11, 2023
…ns, implemented using ExtensionType (#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See #24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: #24868

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
raulcd pushed a commit to apache/arrow that referenced this issue Oct 11, 2023
…ns, implemented using ExtensionType (#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See #24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: #24868

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…mensions, implemented using ExtensionType (apache#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See apache#24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: apache#24868

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…mensions, implemented using ExtensionType (apache#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See apache#24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: apache#24868

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…mensions, implemented using ExtensionType (apache#37166)

### Rationale for this change

For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`.
See apache#24868 and huggingface/datasets#5272

### What changes are included in this PR?

This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

This introduces new extension type to the user.
* Closes: apache#24868

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants