StringSplit operator #5371

adityagoel4512 · 2023-06-27T22:05:08Z

Description

This PR introduces the StringSplit operator, as originally discussed in #5341.

StringSplit takes a string tensor as input and splits each element based on a delimiter attribute and a maxsplit attribute. The operator returns a sequence of tensors such that the output sequence has the same "shape" as the input tensor. Further details can be found in #5341.

The delimiter: string attribute denotes the substring on which we will split each input string. The maxsplit: int attribute allows the user to control how many splits are done maximally, from left to right. If left unset then it does not apply.

Examples are as follows:

StringSplit(["hello world", "a b c"], delimiter=" ", maxsplit=None)
==> output: seq[tensor[string]] = [["hello", "world"], ["a", "b", "c"]]
StringSplit(["hello world", "a b c"], delimiter=" ", maxsplit=1)
==> output: seq[tensor[string]] = [["hello", "world"], ["a", "b c"]]
StringSplit(["eggs,milk,potatoes"], delimiter=",", maxsplit=None)
==> output: seq[tensor[string]] = [["eggs", "milk", "potatoes"]]
StringSplit([["hello world", "def.net"], ["o n n x", "the quick brown fox"]], delimiter=" ", maxsplit=2)
==> output: seq[seq[tensor[string]]] = [
    [["hello", "world"], ["def.net"]],
    [["o", "n", "n x"], ["the", "quick", "brown fox"]]
]

This directly implements tf.strings.split and numpy.char.split, which return RaggedTensors and arrays of list objects respectively to deal with the variability in the number of splits but otherwise exhibit identical behaviour.

Motivation and Context

Closes #5341

Signed-off-by: Aditya Goel <agoel4512@gmail.com>

onnx/reference/ops/_op_list.py

@@ -212,6 +212,7 @@
 from onnx.reference.ops.op_squeeze import Squeeze_1, Squeeze_11, Squeeze_13
 from onnx.reference.ops.op_stft import STFT
 from onnx.reference.ops.op_string_normalizer import StringNormalizer
+from onnx.reference.ops.op_string_split import StringSplit


onnx/defs/nn/defs.cc

gramalingam · 2023-07-05T18:27:22Z

I am a bit unclear about how the output is encoded. Unfortunately, ONNX does not support the type Tensor of Sequence (of Tensors). It supports only Tensors of primitive types (unstructured types). It seems to me that one possible encoding is that of RaggedTensors using two tensors: a padded-split-output that adds a new dimension to the tensor (of size = max-split-size), which will need to pad with empty strings in the end for strings that do not have max-split-size elements, and a second tensor that specifies the number of splits for each input.

Would that make sense?

adityagoel4512 · 2023-07-06T08:37:02Z

I am a bit unclear about how the output is encoded. Unfortunately, ONNX does not support the type Tensor of Sequence (of Tensors). It supports only Tensors of primitive types (unstructured types).

The encoding I was proposing essentially has as many nested sequences as the input tensor has dimensions. So for input with shape (2,3,1) we get output a type sequence(sequence(sequence(tensor[str]))). That way we never need a tensor of sequence. It is a way to get around the fact that we cannot vary the number of elements in the final dimension of the input tensor.

It seems to me that one possible encoding is that of RaggedTensors using two tensors: a padded-split-output that adds a new dimension to the tensor (of size = max-split-size), which will need to pad with empty strings in the end for strings that do not have max-split-size elements, and a second tensor that specifies the number of splits for each input.

Would that make sense?

That makes sense to me and was something I also considered. The only potential issue with this approach would be that it could require runtimes to allocate fairly large output tensors if even only one input element happens to have significantly more "splits" than the others. Not sure whether this is a necessary concern to have in this space however?

(potentially some fusing could be of help for common cases like StringSplit then Gather/Slice)

gramalingam · 2023-07-07T00:16:32Z

The only potential issue with this approach would be that it could require runtimes to allocate fairly large output tensors if even only one input element happens to have significantly more "splits" than the others. Not sure whether this is a necessary concern to have in this space however

Another compact representation would be to have a single tensor(string) representing all the split words, and another int64 tensor to indicate which words belong to which input. Eg., this is what onnxruntime extensions seems to do. Other variations are possible, but maybe compatibility with onnxruntime extensions would help.

gramalingam · 2023-07-07T00:21:25Z

My understanding is that this is a variant of the COO format for sparse matrices. We can assume that the input is a 1D tensor of strings, and the output is a 1D tensor of strings, and a 2D tensor of int64 encoding the coordinates of each output word.

adityagoel4512 · 2023-07-09T10:55:29Z

Another compact representation would be to have a single tensor(string) representing all the split words, and another int64 tensor to indicate which words belong to which input. Eg., this is what onnxruntime extensions seems to do.

My understanding is that this approach returns three tensors, but perhaps @xadupre will be able to correct me if anything I say here is inaccurate. As input it takes a 1D string tensor of size (N,) and outputs the following:

a 1D string tensor with all the substrings from the splits of shape (M,)
an indices int tensor of shape (M,2). The first column is an index indicating what input string this corresponds to (between 0 and N-1). The second column is an index into the string output tensor indicating what substring this represents (between 0 and M-1).
a maxsplit int tensor output of shape (2,) with the length of the input tensor and the maximum number of splits for an input. I think we can safely drop this.

We could adapt this approach to support inputs of > 1 dimension as well - the indices tensor can just return an (M, input tensor rank) shape tensor instead where each row corresponds to the index in the input tensor the substring at the same row is from.

A few examples:

StringSplit(["a,b,c", "d,e,f"], ",") 
=> 
substrings: ["a", "b", "c", "d", "e", "f"]
indices: [[0], [0], [0], [1], [1], [1]] 
explanation:

- "a", "b", and "c" come from the string at index [0] in the input tensor.
- "d", "e" and "f" come from the string at index [1]

StringSplit([["a,b", "d"], ["d,e,f", "g,h"]], ",") 
=> 
substrings: ["a", "b", "d", "d", "e", "f", "g", "h"]
indices: [[0, 0], [0, 0], [0, 1], [1, 0], [1, 0], [1, 0], [1, 1], [1, 1]]
explanation:

- "a", "b" come from index [0, 0] in the input
- the first "d" comes from [0, 1]
- the second "d" and "e", "f" comes from [1, 0]
- "g", "h" come from [1, 1] in the input

This would be a compact representation but we can still have > 1D inputs. What do you think @gramalingam?

adityagoel4512 · 2023-07-09T11:05:02Z

Actually thinking about it a bit more, I'm not sure how you could possibly do a fairly common operation like "StringSplit then take the last substring" using this representation.

You can do this quite easily with the approach mentioned before returning a padded-split-output and max-split-size tensor (you can use a Gather type operator using the outputs from StringSplit).

I'd prefer to go with this approach so the operator is sufficiently usable over the more compact representation @gramalingam.

xadupre · 2023-07-10T10:34:18Z

StringSplit in onnxruntime-extensions was added to convert text models from tensorflow. You can see how they are used in tensorflow-onnx here: https://github.com/onnx/tensorflow-onnx/blob/main/tf2onnx/custom_opsets/string_ops.py. We chose this output format because it was easier to convert tensorflow models with it. Tensorflow is using a RaggedTensor, a specific container. We decided to used the existing container to represent this container. However, while doing that, shape is not easy to propagate and it is lost with the current shape inference algorithm. A unique tensor of strings is easier but it would have N rows and C columns where C is the greater number of tokens in a string. Having a specific container lets the runtime implement it in its own way and possibly improve it on the long term with less impact from the user point of view. Should we add a new kind of tensor?

adityagoel4512 · 2023-07-10T11:21:04Z

Should we add a new kind of tensor?

Could you please elaborate on what you mean by "new kind"?

If this is an entirely new container type (on par with Tensor, Sequence) within ONNX, my opinion would be no. It isn't clear to me that the need for this is that strong and it would place undue burden on backends as well as require new ops to go between it and the existing Tensor for insufficient gain.

it would have N rows and C columns where C is the greater number of tokens

I would be happy with the (N, C) tensor.

I can also see StringSplit + Gather and StringSplit + Slice fusion being possible on the runtime side which will lower memory use and would be happy to contribute.

xadupre · 2023-07-10T12:28:02Z

(N, C) tensor is ok when the batch is small. When it is bigger, it could use lots of memory if it is not sparse. A new container could be a RaggedTensor, a sparse tensor of strings, or others. My question is more an open question. Do we need to have shape_inference working for StringSplit...

gramalingam · 2023-07-19T18:02:03Z

My thoughts on the above discussion: adding a new type/container is a non-trivial effort and will require more discussion (with the broader community, as well). In fact, we already have a sparse-tensor as a type (which is more general-purpose than RaggedTensor), but no operators support it. One option for this may be to go the same way as quantization is handled via the QDQ approach (introduce ops to transform between a sparse and dense representation, and use ops defined only for dense tensors, and rely on backends to recognize sparse-computations and replace them with a specialized implementation when available.)

I personally feel that the (N, max-split) padded tensor is a reasonable choice (not perfect, but ok). In the DNN world, at least, the key is always easy parallelization, and the dense representation makes that easier.

onnx/defs/text/defs.cc

onnx/reference/ops/op_string_split.py

onnx/defs/text/defs.cc

Signed-off-by: Aditya Goel <agoel4512@gmail.com>

StringSplit operator

b233ec2

Signed-off-by: Aditya Goel <agoel4512@gmail.com>

adityagoel4512 requested review from a team as code owners June 27, 2023 22:05

github-advanced-security bot found potential problems Jun 27, 2023

View reviewed changes

adityagoel4512 force-pushed the string_split_operator branch from dbc9d61 to ac60c4d Compare June 28, 2023 18:09

adityagoel4512 commented Jun 28, 2023

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

gramalingam added the operator Issues related to ONNX operators label Jul 5, 2023

cbourjau mentioned this pull request Jul 14, 2023

Move StringNormalizer into onnx/defs/text/defs.cc #5420

Closed

adityagoel4512 force-pushed the string_split_operator branch from 73a4c51 to af48d1b Compare July 15, 2023 22:09

gramalingam reviewed Jul 19, 2023

View reviewed changes

onnx/defs/text/defs.cc Outdated Show resolved Hide resolved

gramalingam reviewed Jul 19, 2023

View reviewed changes

onnx/defs/text/defs.cc Outdated Show resolved Hide resolved

gramalingam reviewed Jul 19, 2023

View reviewed changes

onnx/defs/text/defs.cc Show resolved Hide resolved

gramalingam reviewed Jul 25, 2023

View reviewed changes

onnx/defs/text/defs.cc Outdated Show resolved Hide resolved

gramalingam reviewed Jul 26, 2023

View reviewed changes

onnx/reference/ops/op_string_split.py Show resolved Hide resolved

gramalingam reviewed Jul 26, 2023

View reviewed changes

onnx/defs/text/defs.cc Outdated Show resolved Hide resolved

adityagoel4512 force-pushed the string_split_operator branch from 7b9e158 to ec27ae1 Compare July 28, 2023 00:59

adityagoel4512 requested a review from a team as a code owner July 28, 2023 00:59

adityagoel4512 force-pushed the string_split_operator branch from ec27ae1 to 8b6e330 Compare July 28, 2023 01:13

gramalingam approved these changes Jul 28, 2023

View reviewed changes

StringSplit Operator

b2a8750

Signed-off-by: Aditya Goel <agoel4512@gmail.com>

Merge remote-tracking branch 'origin' into string_split_operator

b095353

Signed-off-by: Aditya Goel <agoel4512@gmail.com>

adityagoel4512 force-pushed the string_split_operator branch from 0e96940 to b095353 Compare August 2, 2023 18:01

Merge branch 'main' into string_split_operator

dfb8e85

gramalingam added this pull request to the merge queue Aug 2, 2023

Merged via the queue into onnx:main with commit e724cc3 Aug 2, 2023
33 checks passed

adityagoel4512 deleted the string_split_operator branch August 6, 2023 10:32

liqunfu mentioned this pull request Sep 18, 2023

Add compute kernel for StringSplit microsoft/onnxruntime#17596

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StringSplit operator #5371

StringSplit operator #5371

adityagoel4512 commented Jun 27, 2023 •

edited

Loading

gramalingam commented Jul 5, 2023

adityagoel4512 commented Jul 6, 2023 •

edited

Loading

gramalingam commented Jul 7, 2023

gramalingam commented Jul 7, 2023

adityagoel4512 commented Jul 9, 2023

adityagoel4512 commented Jul 9, 2023 •

edited

Loading

xadupre commented Jul 10, 2023

adityagoel4512 commented Jul 10, 2023 •

edited

Loading

xadupre commented Jul 10, 2023

gramalingam commented Jul 19, 2023

StringSplit operator #5371

StringSplit operator #5371

Conversation

adityagoel4512 commented Jun 27, 2023 • edited Loading

Description

Motivation and Context

gramalingam commented Jul 5, 2023

adityagoel4512 commented Jul 6, 2023 • edited Loading

gramalingam commented Jul 7, 2023

gramalingam commented Jul 7, 2023

adityagoel4512 commented Jul 9, 2023

adityagoel4512 commented Jul 9, 2023 • edited Loading

xadupre commented Jul 10, 2023

adityagoel4512 commented Jul 10, 2023 • edited Loading

xadupre commented Jul 10, 2023

gramalingam commented Jul 19, 2023

adityagoel4512 commented Jun 27, 2023 •

edited

Loading

adityagoel4512 commented Jul 6, 2023 •

edited

Loading

adityagoel4512 commented Jul 9, 2023 •

edited

Loading

adityagoel4512 commented Jul 10, 2023 •

edited

Loading