You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How can I use ONNX operators to do string replacements with regular expressions?
I have a preprocessing step in my ML pipeline that I would like to allow to persist using an ONNX operator construction. This step takes a string as input and makes a number of replacements using regular expressions. However, I can't seem to find a way to do this with the existing set of operators that exist in the ONNX standard because every option that I've conceived of has failed.
Idea 1: Use a Regular Expression Operator
Problem: No such regular expression operator exists.
Idea 2: Evaluate the string sequentially, character by character and hard code some logic which will, in effect, produce the same output as a regex operator would.
Problem: This would require comparisons between characters within the string and predefined constants. But OnnxEqual does not support comparisons between strings.
Idea 3: Translate the input string, character by character, to its ASCII decimal equivalent and then perform idea 2.
Problem:OnnxCast cannot cast non-strictly numeric strings to a type compatible with OnnxEqual and there is no translation operator in the vein of GNU tr available.
Idea 4: Use the OnnxUnique operator and it's inverse_indicies property to make a lookup table of integers representing each character in the string, then perform idea 2.
Problem: This requires prepending a key string \t\n\r !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_``abcdefghijklmnopqrstuvwxyz{|}~ to the beginning of the input string (so that the numerical values found by OnnxUnique's inverse_indicies property have a consistent definition) and splitting the input string into a sequence of tensors of one character each. Unfortunately, OnnxSplit errors when trying to split a string tensor (see code example below), and OnnxSequenceInsert does not append strings into a single element tensor, just a sequence of single element tensors into a single tensor with multiple elements.
How is one to properly manipulate strings with the available ONNX operators?
Further information
Relevant Area: operators
Is this issue related to a specific model? Model name: N/A Model opset: 17
Notes
Using the sub-project sklearn-onnx, OnnxSplit errors when trying to split a string with one element:
Hi @NolantheNerd
that s not the 1st time people asks for better string support in onnx: cf: #1474 #3016
but not much has been done.
Anyway, you should better convert your string input to utf32 and then create a tensor of long u32 so that each chars have the same size and can easily fit into a tensor. But, still, no regex op in onnx atm I guess.
Thanks for getting back to me @WilliamTambellini. It would be quite the set of hoops to jump through to convert the string input to u32s, perform a regex operation, then convert back to strings for an eventual TFIDF operation. Thank you for your suggestion, but I may just need to approach this problem with an alternernative to ONNX.
Question
How can I use ONNX operators to do string replacements with regular expressions?
I have a preprocessing step in my ML pipeline that I would like to allow to persist using an ONNX operator construction. This step takes a string as input and makes a number of replacements using regular expressions. However, I can't seem to find a way to do this with the existing set of operators that exist in the ONNX standard because every option that I've conceived of has failed.
Idea 1: Use a Regular Expression Operator
Problem: No such regular expression operator exists.
Idea 2: Evaluate the string sequentially, character by character and hard code some logic which will, in effect, produce the same output as a regex operator would.
Problem: This would require comparisons between characters within the string and predefined constants. But
OnnxEqual
does not support comparisons between strings.Idea 3: Translate the input string, character by character, to its ASCII decimal equivalent and then perform idea 2.
Problem:
OnnxCast
cannot cast non-strictly numeric strings to a type compatible withOnnxEqual
and there is no translation operator in the vein of GNU tr available.Idea 4: Use the
OnnxUnique
operator and it'sinverse_indicies
property to make a lookup table of integers representing each character in the string, then perform idea 2.Problem: This requires prepending a key string
\t\n\r !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_``abcdefghijklmnopqrstuvwxyz{|}~
to the beginning of the input string (so that the numerical values found byOnnxUnique
'sinverse_indicies
property have a consistent definition) and splitting the input string into a sequence of tensors of one character each. Unfortunately,OnnxSplit
errors when trying to split a string tensor (see code example below), andOnnxSequenceInsert
does not append strings into a single element tensor, just a sequence of single element tensors into a single tensor with multiple elements.How is one to properly manipulate strings with the available ONNX operators?
Further information
Relevant Area: operators
Is this issue related to a specific model?
Model name: N/A
Model opset: 17
Notes
Using the sub-project
sklearn-onnx
,OnnxSplit
errors when trying to split a string with one element:Gives:
The text was updated successfully, but these errors were encountered: