-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StringSplit operator #5371
StringSplit operator #5371
Conversation
Signed-off-by: Aditya Goel <agoel4512@gmail.com>
@@ -212,6 +212,7 @@ | |||
from onnx.reference.ops.op_squeeze import Squeeze_1, Squeeze_11, Squeeze_13 | |||
from onnx.reference.ops.op_stft import STFT | |||
from onnx.reference.ops.op_string_normalizer import StringNormalizer | |||
from onnx.reference.ops.op_string_split import StringSplit |
Check notice
Code scanning / CodeQL
Unused import Note
dbc9d61
to
ac60c4d
Compare
I am a bit unclear about how the output is encoded. Unfortunately, ONNX does not support the type Tensor of Sequence (of Tensors). It supports only Tensors of primitive types (unstructured types). It seems to me that one possible encoding is that of RaggedTensors using two tensors: a padded-split-output that adds a new dimension to the tensor (of size = max-split-size), which will need to pad with empty strings in the end for strings that do not have max-split-size elements, and a second tensor that specifies the number of splits for each input. Would that make sense? |
The encoding I was proposing essentially has as many nested sequences as the input tensor has dimensions. So for input with shape
That makes sense to me and was something I also considered. The only potential issue with this approach would be that it could require runtimes to allocate fairly large output tensors if even only one input element happens to have significantly more "splits" than the others. Not sure whether this is a necessary concern to have in this space however? (potentially some fusing could be of help for common cases like StringSplit then Gather/Slice) |
Another compact representation would be to have a single tensor(string) representing all the split words, and another int64 tensor to indicate which words belong to which input. Eg., this is what onnxruntime extensions seems to do. Other variations are possible, but maybe compatibility with onnxruntime extensions would help. |
My understanding is that this is a variant of the COO format for sparse matrices. We can assume that the input is a 1D tensor of strings, and the output is a 1D tensor of strings, and a 2D tensor of int64 encoding the coordinates of each output word. |
My understanding is that this approach returns three tensors, but perhaps @xadupre will be able to correct me if anything I say here is inaccurate. As input it takes a 1D string tensor of size
We could adapt this approach to support inputs of > 1 dimension as well - the indices tensor can just return an A few examples:
This would be a compact representation but we can still have > 1D inputs. What do you think @gramalingam? |
Actually thinking about it a bit more, I'm not sure how you could possibly do a fairly common operation like "StringSplit then take the last substring" using this representation. You can do this quite easily with the approach mentioned before returning a padded-split-output and max-split-size tensor (you can use a Gather type operator using the outputs from StringSplit). I'd prefer to go with this approach so the operator is sufficiently usable over the more compact representation @gramalingam. |
StringSplit in onnxruntime-extensions was added to convert text models from tensorflow. You can see how they are used in tensorflow-onnx here: https://github.com/onnx/tensorflow-onnx/blob/main/tf2onnx/custom_opsets/string_ops.py. We chose this output format because it was easier to convert tensorflow models with it. Tensorflow is using a RaggedTensor, a specific container. We decided to used the existing container to represent this container. However, while doing that, shape is not easy to propagate and it is lost with the current shape inference algorithm. A unique tensor of strings is easier but it would have N rows and C columns where C is the greater number of tokens in a string. Having a specific container lets the runtime implement it in its own way and possibly improve it on the long term with less impact from the user point of view. Should we add a new kind of tensor? |
Could you please elaborate on what you mean by "new kind"? If this is an entirely new container type (on par with Tensor, Sequence) within ONNX, my opinion would be no. It isn't clear to me that the need for this is that strong and it would place undue burden on backends as well as require new ops to go between it and the existing Tensor for insufficient gain.
I would be happy with the (N, C) tensor. I can also see StringSplit + Gather and StringSplit + Slice fusion being possible on the runtime side which will lower memory use and would be happy to contribute. |
(N, C) tensor is ok when the batch is small. When it is bigger, it could use lots of memory if it is not sparse. A new container could be a RaggedTensor, a sparse tensor of strings, or others. My question is more an open question. Do we need to have shape_inference working for StringSplit... |
73a4c51
to
af48d1b
Compare
My thoughts on the above discussion: adding a new type/container is a non-trivial effort and will require more discussion (with the broader community, as well). In fact, we already have a sparse-tensor as a type (which is more general-purpose than RaggedTensor), but no operators support it. One option for this may be to go the same way as quantization is handled via the QDQ approach (introduce ops to transform between a sparse and dense representation, and use ops defined only for dense tensors, and rely on backends to recognize sparse-computations and replace them with a specialized implementation when available.) I personally feel that the (N, max-split) padded tensor is a reasonable choice (not perfect, but ok). In the DNN world, at least, the key is always easy parallelization, and the dense representation makes that easier. |
7b9e158
to
ec27ae1
Compare
ec27ae1
to
8b6e330
Compare
Signed-off-by: Aditya Goel <agoel4512@gmail.com>
Signed-off-by: Aditya Goel <agoel4512@gmail.com>
0e96940
to
b095353
Compare
Description
This PR introduces the
StringSplit
operator, as originally discussed in #5341.StringSplit
takes a string tensor as input and splits each element based on adelimiter
attribute and amaxsplit
attribute. The operator returns asequence
oftensors
such that the output sequence has the same "shape" as the input tensor. Further details can be found in #5341.The
delimiter: string
attribute denotes the substring on which we will split each input string. Themaxsplit: int
attribute allows the user to control how many splits are done maximally, from left to right. If left unset then it does not apply.Examples are as follows:
This directly implements tf.strings.split and numpy.char.split, which return RaggedTensors and arrays of list objects respectively to deal with the variability in the number of splits but otherwise exhibit identical behaviour.
Motivation and Context
Closes #5341