Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX Operators for String Replacements #4450

Closed
NolantheNerd opened this issue Aug 17, 2022 · 2 comments
Closed

ONNX Operators for String Replacements #4450

NolantheNerd opened this issue Aug 17, 2022 · 2 comments
Labels
question Questions about ONNX

Comments

@NolantheNerd
Copy link

Question

How can I use ONNX operators to do string replacements with regular expressions?

I have a preprocessing step in my ML pipeline that I would like to allow to persist using an ONNX operator construction. This step takes a string as input and makes a number of replacements using regular expressions. However, I can't seem to find a way to do this with the existing set of operators that exist in the ONNX standard because every option that I've conceived of has failed.

Idea 1: Use a Regular Expression Operator

Problem: No such regular expression operator exists.

Idea 2: Evaluate the string sequentially, character by character and hard code some logic which will, in effect, produce the same output as a regex operator would.

Problem: This would require comparisons between characters within the string and predefined constants. But OnnxEqual does not support comparisons between strings.

Idea 3: Translate the input string, character by character, to its ASCII decimal equivalent and then perform idea 2.

Problem: OnnxCast cannot cast non-strictly numeric strings to a type compatible with OnnxEqual and there is no translation operator in the vein of GNU tr available.

Idea 4: Use the OnnxUnique operator and it's inverse_indicies property to make a lookup table of integers representing each character in the string, then perform idea 2.

Problem: This requires prepending a key string \t\n\r !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_``abcdefghijklmnopqrstuvwxyz{|}~ to the beginning of the input string (so that the numerical values found by OnnxUnique's inverse_indicies property have a consistent definition) and splitting the input string into a sequence of tensors of one character each. Unfortunately, OnnxSplit errors when trying to split a string tensor (see code example below), and OnnxSequenceInsert does not append strings into a single element tensor, just a sequence of single element tensors into a single tensor with multiple elements.

How is one to properly manipulate strings with the available ONNX operators?

Further information

  • Relevant Area: operators

  • Is this issue related to a specific model?
    Model name: N/A
    Model opset: 17

Notes

Using the sub-project sklearn-onnx, OnnxSplit errors when trying to split a string with one element:

import re
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from skl2onnx import to_onnx, update_registered_converter
from skl2onnx.common.data_types import StringTensorType
from skl2onnx.algebra.onnx_ops import OnnxSplit, OnnxConstant
from onnxruntime import InferenceSession

class MyTransformer(BaseEstimator, TransformerMixin):
    def fit_transform(self, X, y=None):
        return re.sub("(?<=[0-9])m ", " meters ", X)

def shape_function(operator):
    input = StringTensorType([1])
    output = StringTensorType([None, 1])
    operator.inputs[0].type = input
    operator.outputs[0].type = output

def converter_function(scope, operator, container):
    op = operator.raw_operator
    opv = container.target_opset
    out = operator.outputs

    X = operator.inputs[0]

    one_tensor = OnnxConstant(value_int=1, op_version=opv)
    string_tensor = OnnxConstant(value_strings=["ab"], op_version=opv)
    string_split_tensor = OnnxSplit(string_tensor, one_tensor, op_version=opv, output_names=out[:1])

    string_split_tensor.add_to(scope, container)

update_registered_converter(MyTransformer, "MyTransformer", shape_function, converter_function)
my_transformer = MyTransformer()
onnx_model = to_onnx(my_transformer, initial_types=[["X", StringTensorType([None, 1])]])

test_string = "The Empire State Building is 443m tall."
sess = InferenceSession(onnx_model.SerializeToString())
output = sess.run(None, {"X": np.array([test_string])})

Gives:

2022-08-16 12:35:46.235861185 [W:onnxruntime:, graph.cc:106 MergeShapeInfo] Error merging shape info for output. 'variable' source:{1} target:{,1}. Falling back to lenient
merge.
2022-08-16 12:35:46.237767860 [E:onnxruntime:, inference_session.cc:1530 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/optimizer/optimizer_execution_frame.cc:75 onnxruntime::OptimizerExecutionFrame::Info::Info(const std::vector<const onnxruntime::Node*>&, const InitializedTensorSet&, const onnxruntime::Path&,
const onnxruntime::IExecutionProvider&, const std::function<bool(const std::__cxx11::basic_string<char>&)>&) [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : string tensor can not use pre-allocated buffer
@NolantheNerd NolantheNerd added the question Questions about ONNX label Aug 17, 2022
@NolantheNerd NolantheNerd changed the title Regex for String Replacements ONNX Operators for String Replacements Aug 17, 2022
@WilliamTambellini
Copy link

Hi @NolantheNerd
that s not the 1st time people asks for better string support in onnx: cf:
#1474
#3016
but not much has been done.
Anyway, you should better convert your string input to utf32 and then create a tensor of long u32 so that each chars have the same size and can easily fit into a tensor. But, still, no regex op in onnx atm I guess.

@NolantheNerd
Copy link
Author

Thanks for getting back to me @WilliamTambellini. It would be quite the set of hoops to jump through to convert the string input to u32s, perform a regex operation, then convert back to strings for an eventual TFIDF operation. Thank you for your suggestion, but I may just need to approach this problem with an alternernative to ONNX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions about ONNX
Projects
None yet
Development

No branches or pull requests

2 participants