[QNN Quant] Add preprocessing option to transpose graph inputs/outputs to channel-last#19731
Merged
Merged
Conversation
jywu-msft
approved these changes
Mar 2, 2024
zz002
pushed a commit
to zz002/onnxruntime
that referenced
this pull request
Mar 7, 2024
…s to channel-last (microsoft#19731) ### Description Adds the optional parameters `inputs_to_make_channel_last` and `outputs_to_make_channel_last` to the `qnn_preprocess_model()` function. ```python """ inputs_to_make_channel_last: List of graph input names to transpose to be "channel-last". For example, if "input0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change input0's shape to (N, D1, D2, ..., Dn, C) and add a transpose node after it. Original: input0 (N, C, D1, D2, ..., Dn) --> <Nodes> Updated: input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes> This can potentially improve inference latency for QDQ models running on QNN EP because the additional transpose node may allow other transpose nodes inserted during ORT layout transformation to cancel out. outputs_to_make_channel_last: List of graph output names to transpose to be "channel-last". For example, if "output0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change output0's shape to (N, D1, D2, ..., Dn, C) and add a transpose node before it. Original: <Nodes> --> output0 (N, C, D1, D2, ..., Dn) Updated: <Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C) This can potentially improve inference latency for QDQ models running on QNN EP because the additional transpose node may allow other transpose nodes inserted during ORT layout transformation to cancel out. """ ``` **NOTE: If you use these options with the quantization scripts, you'll have to make sure your data_reader feeds in transposed input data. It won't happen automatically.** ### Motivation and Context Native QNN operators use the channel-last data layout, but ONNX uses channel-first. To bridge the gap, ORT's layout transformer inserts transposes around layout-sensitive nodes and updates their domain to indicate that they now operate on channel-last data. The transpose optimizer is able to remove most of these inserted transposes, but not all transposes can always be removed (i.e., some could remain at the graph's inputs and outputs). We've found that these extra transpose nodes can significantly degrade inference latency on QNN EP. One workaround (provided by this PR) is to add _additional_ transpose nodes at the graph inputs or outputs. These additional nodes can often help the ORT transpose optimizer cancel out any remaining transpose nodes, which significantly improves latency. Additionally, it may make more sense for some kinds of inputs to just be in channel-last form (e.g., images), avoiding the need to pre-transpose of the input data before inference. Example at the input: ``` Original: input0 (N, C, D1, D2, ..., Dn) --> <Nodes> Updated: input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes> ``` Example at the output: ``` Original: <Nodes> --> output0 (N, C, D1, D2, ..., Dn) Updated: <Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C) ```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds the optional parameters
inputs_to_make_channel_lastandoutputs_to_make_channel_lastto theqnn_preprocess_model()function.NOTE: If you use these options with the quantization scripts, you'll have to make sure your data_reader feeds in transposed input data. It won't happen automatically.
Motivation and Context
Native QNN operators use the channel-last data layout, but ONNX uses channel-first. To bridge the gap, ORT's layout transformer inserts transposes around layout-sensitive nodes and updates their domain to indicate that they now operate on channel-last data. The transpose optimizer is able to remove most of these inserted transposes, but not all transposes can always be removed (i.e., some could remain at the graph's inputs and outputs).
We've found that these extra transpose nodes can significantly degrade inference latency on QNN EP. One workaround (provided by this PR) is to add additional transpose nodes at the graph inputs or outputs. These additional nodes can often help the ORT transpose optimizer cancel out any remaining transpose nodes, which significantly improves latency.
Additionally, it may make more sense for some kinds of inputs to just be in channel-last form (e.g., images), avoiding the need to pre-transpose of the input data before inference.
Example at the input:
Example at the output: