[QNN Quant] Add preprocessing option to transpose graph inputs/outputs to channel-last by adrianlizarraga · Pull Request #19731 · microsoft/onnxruntime

adrianlizarraga · 2024-03-01T00:29:33Z

Description

Adds the optional parameters inputs_to_make_channel_last and outputs_to_make_channel_last to the qnn_preprocess_model() function.

"""
 inputs_to_make_channel_last: List of graph input names to transpose to be "channel-last". For example,
      if "input0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change input0's
      shape to (N, D1, D2, ..., Dn, C) and add a transpose node after it.
      Original:
          input0 (N, C, D1, D2, ..., Dn) --> <Nodes>
      Updated:
          input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes>
      This can potentially improve inference latency for QDQ models running on QNN EP because the
      additional transpose node may allow other transpose nodes inserted during ORT layout transformation
      to cancel out.
 outputs_to_make_channel_last: List of graph output names to transpose to be "channel-last". For example,
      if "output0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change output0's
      shape to (N, D1, D2, ..., Dn, C) and add a transpose node before it.
      Original:
          <Nodes> --> output0 (N, C, D1, D2, ..., Dn)
      Updated:
          <Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C)
      This can potentially improve inference latency for QDQ models running on QNN EP because the
      additional transpose node may allow other transpose nodes inserted during ORT layout transformation
      to cancel out.
"""

NOTE: If you use these options with the quantization scripts, you'll have to make sure your data_reader feeds in transposed input data. It won't happen automatically.

Motivation and Context

Native QNN operators use the channel-last data layout, but ONNX uses channel-first. To bridge the gap, ORT's layout transformer inserts transposes around layout-sensitive nodes and updates their domain to indicate that they now operate on channel-last data. The transpose optimizer is able to remove most of these inserted transposes, but not all transposes can always be removed (i.e., some could remain at the graph's inputs and outputs).

We've found that these extra transpose nodes can significantly degrade inference latency on QNN EP. One workaround (provided by this PR) is to add additional transpose nodes at the graph inputs or outputs. These additional nodes can often help the ORT transpose optimizer cancel out any remaining transpose nodes, which significantly improves latency.

Additionally, it may make more sense for some kinds of inputs to just be in channel-last form (e.g., images), avoiding the need to pre-transpose of the input data before inference.

Example at the input:

Original:
    input0 (N, C, D1, D2, ..., Dn) --> <Nodes>
Updated:
    input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes>

Example at the output:

Original:
   <Nodes> --> output0 (N, C, D1, D2, ..., Dn)
Updated:
   <Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C)

…ts to channel-last.

…s to channel-last (microsoft#19731) ### Description Adds the optional parameters `inputs_to_make_channel_last` and `outputs_to_make_channel_last` to the `qnn_preprocess_model()` function. ```python """ inputs_to_make_channel_last: List of graph input names to transpose to be "channel-last". For example, if "input0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change input0's shape to (N, D1, D2, ..., Dn, C) and add a transpose node after it. Original: input0 (N, C, D1, D2, ..., Dn) --> <Nodes> Updated: input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes> This can potentially improve inference latency for QDQ models running on QNN EP because the additional transpose node may allow other transpose nodes inserted during ORT layout transformation to cancel out. outputs_to_make_channel_last: List of graph output names to transpose to be "channel-last". For example, if "output0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change output0's shape to (N, D1, D2, ..., Dn, C) and add a transpose node before it. Original: <Nodes> --> output0 (N, C, D1, D2, ..., Dn) Updated: <Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C) This can potentially improve inference latency for QDQ models running on QNN EP because the additional transpose node may allow other transpose nodes inserted during ORT layout transformation to cancel out. """ ``` **NOTE: If you use these options with the quantization scripts, you'll have to make sure your data_reader feeds in transposed input data. It won't happen automatically.** ### Motivation and Context Native QNN operators use the channel-last data layout, but ONNX uses channel-first. To bridge the gap, ORT's layout transformer inserts transposes around layout-sensitive nodes and updates their domain to indicate that they now operate on channel-last data. The transpose optimizer is able to remove most of these inserted transposes, but not all transposes can always be removed (i.e., some could remain at the graph's inputs and outputs). We've found that these extra transpose nodes can significantly degrade inference latency on QNN EP. One workaround (provided by this PR) is to add _additional_ transpose nodes at the graph inputs or outputs. These additional nodes can often help the ORT transpose optimizer cancel out any remaining transpose nodes, which significantly improves latency. Additionally, it may make more sense for some kinds of inputs to just be in channel-last form (e.g., images), avoiding the need to pre-transpose of the input data before inference. Example at the input: ``` Original: input0 (N, C, D1, D2, ..., Dn) --> <Nodes> Updated: input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes> ``` Example at the output: ``` Original: <Nodes> --> output0 (N, C, D1, D2, ..., Dn) Updated: <Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C) ```

adrianlizarraga added 2 commits February 29, 2024 16:27

[QNN Quant] Add preprocessing option to transpose inputs and/or outpu…

88ef984

…ts to channel-last.

Add preliminary test

0357110

jywu-msft requested review from HectorSVC and jywu-msft March 1, 2024 02:19

adrianlizarraga added 3 commits February 29, 2024 18:34

Update test

07d1566

Merge branch 'main' into adrianl/quant-qnn-preproc-channel-last-io

f66a4a6

Clean up and improve tests

8f32b92

adrianlizarraga marked this pull request as ready for review March 1, 2024 10:12

jywu-msft approved these changes Mar 2, 2024

View reviewed changes

jywu-msft merged commit 2d79052 into main Mar 2, 2024

jywu-msft deleted the adrianl/quant-qnn-preproc-channel-last-io branch March 2, 2024 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QNN Quant] Add preprocessing option to transpose graph inputs/outputs to channel-last#19731

[QNN Quant] Add preprocessing option to transpose graph inputs/outputs to channel-last#19731
jywu-msft merged 5 commits into
mainfrom
adrianl/quant-qnn-preproc-channel-last-io

adrianlizarraga commented Mar 1, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adrianlizarraga commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adrianlizarraga commented Mar 1, 2024 •

edited

Loading