Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringSplit operator #5371

Merged
merged 4 commits into from
Aug 2, 2023
Merged

Conversation

adityagoel4512
Copy link
Contributor

@adityagoel4512 adityagoel4512 commented Jun 27, 2023

Description

This PR introduces the StringSplit operator, as originally discussed in #5341.

StringSplit takes a string tensor as input and splits each element based on a delimiter attribute and a maxsplit attribute. The operator returns a sequence of tensors such that the output sequence has the same "shape" as the input tensor. Further details can be found in #5341.

The delimiter: string attribute denotes the substring on which we will split each input string. The maxsplit: int attribute allows the user to control how many splits are done maximally, from left to right. If left unset then it does not apply.

Examples are as follows:

StringSplit(["hello world", "a b c"], delimiter=" ", maxsplit=None)
==> output: seq[tensor[string]] = [["hello", "world"], ["a", "b", "c"]]
StringSplit(["hello world", "a b c"], delimiter=" ", maxsplit=1)
==> output: seq[tensor[string]] = [["hello", "world"], ["a", "b c"]]
StringSplit(["eggs,milk,potatoes"], delimiter=",", maxsplit=None)
==> output: seq[tensor[string]] = [["eggs", "milk", "potatoes"]]
StringSplit([["hello world", "def.net"], ["o n n x", "the quick brown fox"]], delimiter=" ", maxsplit=2)
==> output: seq[seq[tensor[string]]] = [
    [["hello", "world"], ["def.net"]],
    [["o", "n", "n x"], ["the", "quick", "brown fox"]]
]

This directly implements tf.strings.split and numpy.char.split, which return RaggedTensors and arrays of list objects respectively to deal with the variability in the number of splits but otherwise exhibit identical behaviour.

Motivation and Context

Closes #5341

Signed-off-by: Aditya Goel <agoel4512@gmail.com>
@adityagoel4512 adityagoel4512 requested review from a team as code owners June 27, 2023 22:05
@@ -212,6 +212,7 @@
from onnx.reference.ops.op_squeeze import Squeeze_1, Squeeze_11, Squeeze_13
from onnx.reference.ops.op_stft import STFT
from onnx.reference.ops.op_string_normalizer import StringNormalizer
from onnx.reference.ops.op_string_split import StringSplit

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'StringSplit' is not used.
onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved
@gramalingam
Copy link
Contributor

I am a bit unclear about how the output is encoded. Unfortunately, ONNX does not support the type Tensor of Sequence (of Tensors). It supports only Tensors of primitive types (unstructured types). It seems to me that one possible encoding is that of RaggedTensors using two tensors: a padded-split-output that adds a new dimension to the tensor (of size = max-split-size), which will need to pad with empty strings in the end for strings that do not have max-split-size elements, and a second tensor that specifies the number of splits for each input.

Would that make sense?

@gramalingam gramalingam added the operator Issues related to ONNX operators label Jul 5, 2023
@adityagoel4512
Copy link
Contributor Author

adityagoel4512 commented Jul 6, 2023

I am a bit unclear about how the output is encoded. Unfortunately, ONNX does not support the type Tensor of Sequence (of Tensors). It supports only Tensors of primitive types (unstructured types).

The encoding I was proposing essentially has as many nested sequences as the input tensor has dimensions. So for input with shape (2,3,1) we get output a type sequence(sequence(sequence(tensor[str]))). That way we never need a tensor of sequence. It is a way to get around the fact that we cannot vary the number of elements in the final dimension of the input tensor.

It seems to me that one possible encoding is that of RaggedTensors using two tensors: a padded-split-output that adds a new dimension to the tensor (of size = max-split-size), which will need to pad with empty strings in the end for strings that do not have max-split-size elements, and a second tensor that specifies the number of splits for each input.

Would that make sense?

That makes sense to me and was something I also considered. The only potential issue with this approach would be that it could require runtimes to allocate fairly large output tensors if even only one input element happens to have significantly more "splits" than the others. Not sure whether this is a necessary concern to have in this space however?

(potentially some fusing could be of help for common cases like StringSplit then Gather/Slice)

@gramalingam
Copy link
Contributor

The only potential issue with this approach would be that it could require runtimes to allocate fairly large output tensors if even only one input element happens to have significantly more "splits" than the others. Not sure whether this is a necessary concern to have in this space however

Another compact representation would be to have a single tensor(string) representing all the split words, and another int64 tensor to indicate which words belong to which input. Eg., this is what onnxruntime extensions seems to do. Other variations are possible, but maybe compatibility with onnxruntime extensions would help.

@gramalingam
Copy link
Contributor

My understanding is that this is a variant of the COO format for sparse matrices. We can assume that the input is a 1D tensor of strings, and the output is a 1D tensor of strings, and a 2D tensor of int64 encoding the coordinates of each output word.

@adityagoel4512
Copy link
Contributor Author

Another compact representation would be to have a single tensor(string) representing all the split words, and another int64 tensor to indicate which words belong to which input. Eg., this is what onnxruntime extensions seems to do.

My understanding is that this approach returns three tensors, but perhaps @xadupre will be able to correct me if anything I say here is inaccurate. As input it takes a 1D string tensor of size (N,) and outputs the following:

  1. a 1D string tensor with all the substrings from the splits of shape (M,)
  2. an indices int tensor of shape (M,2). The first column is an index indicating what input string this corresponds to (between 0 and N-1). The second column is an index into the string output tensor indicating what substring this represents (between 0 and M-1).
  3. a maxsplit int tensor output of shape (2,) with the length of the input tensor and the maximum number of splits for an input. I think we can safely drop this.

We could adapt this approach to support inputs of > 1 dimension as well - the indices tensor can just return an (M, input tensor rank) shape tensor instead where each row corresponds to the index in the input tensor the substring at the same row is from.

A few examples:

StringSplit(["a,b,c", "d,e,f"], ",") 
=> 
substrings: ["a", "b", "c", "d", "e", "f"]
indices: [[0], [0], [0], [1], [1], [1]] 
explanation:

- "a", "b", and "c" come from the string at index [0] in the input tensor.
- "d", "e" and "f" come from the string at index [1]

StringSplit([["a,b", "d"], ["d,e,f", "g,h"]], ",") 
=> 
substrings: ["a", "b", "d", "d", "e", "f", "g", "h"]
indices: [[0, 0], [0, 0], [0, 1], [1, 0], [1, 0], [1, 0], [1, 1], [1, 1]]
explanation:

- "a", "b" come from index [0, 0] in the input
- the first "d" comes from [0, 1]
- the second "d" and "e", "f" comes from [1, 0]
- "g", "h" come from [1, 1] in the input

This would be a compact representation but we can still have > 1D inputs. What do you think @gramalingam?

@adityagoel4512
Copy link
Contributor Author

adityagoel4512 commented Jul 9, 2023

Actually thinking about it a bit more, I'm not sure how you could possibly do a fairly common operation like "StringSplit then take the last substring" using this representation.

You can do this quite easily with the approach mentioned before returning a padded-split-output and max-split-size tensor (you can use a Gather type operator using the outputs from StringSplit).

I'd prefer to go with this approach so the operator is sufficiently usable over the more compact representation @gramalingam.

@xadupre
Copy link
Contributor

xadupre commented Jul 10, 2023

StringSplit in onnxruntime-extensions was added to convert text models from tensorflow. You can see how they are used in tensorflow-onnx here: https://github.com/onnx/tensorflow-onnx/blob/main/tf2onnx/custom_opsets/string_ops.py. We chose this output format because it was easier to convert tensorflow models with it. Tensorflow is using a RaggedTensor, a specific container. We decided to used the existing container to represent this container. However, while doing that, shape is not easy to propagate and it is lost with the current shape inference algorithm. A unique tensor of strings is easier but it would have N rows and C columns where C is the greater number of tokens in a string. Having a specific container lets the runtime implement it in its own way and possibly improve it on the long term with less impact from the user point of view. Should we add a new kind of tensor?

@adityagoel4512
Copy link
Contributor Author

adityagoel4512 commented Jul 10, 2023

Should we add a new kind of tensor?

Could you please elaborate on what you mean by "new kind"?

If this is an entirely new container type (on par with Tensor, Sequence) within ONNX, my opinion would be no. It isn't clear to me that the need for this is that strong and it would place undue burden on backends as well as require new ops to go between it and the existing Tensor for insufficient gain.

it would have N rows and C columns where C is the greater number of tokens

I would be happy with the (N, C) tensor.

I can also see StringSplit + Gather and StringSplit + Slice fusion being possible on the runtime side which will lower memory use and would be happy to contribute.

@xadupre
Copy link
Contributor

xadupre commented Jul 10, 2023

(N, C) tensor is ok when the batch is small. When it is bigger, it could use lots of memory if it is not sparse. A new container could be a RaggedTensor, a sparse tensor of strings, or others. My question is more an open question. Do we need to have shape_inference working for StringSplit...

@gramalingam
Copy link
Contributor

My thoughts on the above discussion: adding a new type/container is a non-trivial effort and will require more discussion (with the broader community, as well). In fact, we already have a sparse-tensor as a type (which is more general-purpose than RaggedTensor), but no operators support it. One option for this may be to go the same way as quantization is handled via the QDQ approach (introduce ops to transform between a sparse and dense representation, and use ops defined only for dense tensors, and rely on backends to recognize sparse-computations and replace them with a specialized implementation when available.)

I personally feel that the (N, max-split) padded tensor is a reasonable choice (not perfect, but ok). In the DNN world, at least, the key is always easy parallelization, and the dense representation makes that easier.

onnx/defs/text/defs.cc Outdated Show resolved Hide resolved
onnx/defs/text/defs.cc Outdated Show resolved Hide resolved
onnx/defs/text/defs.cc Outdated Show resolved Hide resolved
onnx/defs/text/defs.cc Outdated Show resolved Hide resolved
Signed-off-by: Aditya Goel <agoel4512@gmail.com>
Signed-off-by: Aditya Goel <agoel4512@gmail.com>
@gramalingam gramalingam added this pull request to the merge queue Aug 2, 2023
Merged via the queue into onnx:main with commit e724cc3 Aug 2, 2023
33 checks passed
@adityagoel4512 adityagoel4512 deleted the string_split_operator branch August 6, 2023 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operator Issues related to ONNX operators
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New Operator: StringSplit
3 participants