Enable inference with a merged decoder in `ORTModelForCausalLM` #647

JingyaHuang · 2022-12-27T12:00:20Z

What does this PR do?

Enable the use of merged decoders in ORT modeling.

Check if it works for large proto, and add a saving option.
Enable loading, saving, and automatical merging in ORTModelDecoder.
Enable inference with a merged model in ORTDecoder (New input use_cache + dummy inputs for past_key_values)
Enable IOBinding for merged model (bind the new input use_cache)
Tests.

To discuss:

Where should the merging be applied? Shall it be automatically applied?

In current logic, if use_merged=True, the merging will be automatically inferred and applied if necessary. But maybe in exporter, we can also add an option of merging.

HuggingFaceDocBuilderDev · 2022-12-27T12:32:55Z

The documentation is not available anymore as the PR was closed or merged.

optimum/onnxruntime/modeling_decoder.py

optimum/onnxruntime/io_binding/io_binding_helper.py

optimum/onnxruntime/modeling_ort.py

fxmarty · 2023-01-04T19:47:56Z

optimum/onnxruntime/modeling_decoder.py

@@ -263,20 +288,57 @@ def prepare_io_binding(

        return io_binding, output_shapes, output_buffers

+    def prepare_inputs_for_merged(


Can you add a short description of this method?

optimum/onnxruntime/modeling_decoder.py

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

JingyaHuang

Some questions

optimum/commands/export/onnx.py

JingyaHuang · 2023-02-14T10:25:18Z

optimum/exporters/onnx/base.py

@@ -507,6 +517,8 @@ def __init__(
                f"use_past = {use_past} is different than use_present_in_outputs = {use_present_in_outputs}, the value "
                "of use_present_in_outputs value will be used for the outputs."
            )
+        self.is_merged = False
+        self.use_cache_branch = None


What's the difference between use_cache_branch and use_past and use_past_in_inputs ? I mean that use_cache_branch must for the case of merged decoder, but why do we need to distinguish them?

And does use_cache_branch urges use_past=True?

does use_cache_branch urges use_past=True?

Yes, in other cases use_cache_branch does not make sense.

About the difference on use_past and use_past_in_inputs, it seems like code legacy that could be simplified. Or I miss something @michaelbenayoun ?

use_cache_branch is a flag indicating that for the merged decoder case, we use the cache branch of the controlflow. This flag is used in several places:

use_past is the legacy here.
Basically you have two "use past":

use_past_in_inputs: inputs will have past key values

use_present_in_outputs: outputs will have past key values

If you set only use_past, it sets both.

Co-authored-by: Jingya HUANG <44135271+JingyaHuang@users.noreply.github.com>

JingyaHuang

Other questions for you @fxmarty

optimum/exporters/onnx/constants.py

optimum/exporters/onnx/base.py

optimum/exporters/onnx/config.py

JingyaHuang

LGTM! Thanks a lot for helping to wrap up the PR.

optimum/onnxruntime/modeling_decoder.py

optimum/exporters/onnx/config.py

Co-authored-by: Jingya HUANG <44135271+JingyaHuang@users.noreply.github.com>

optimum/exporters/onnx/base.py

michaelbenayoun · 2023-02-14T17:17:12Z

optimum/exporters/onnx/base.py

@@ -507,6 +517,8 @@ def __init__(
                f"use_past = {use_past} is different than use_present_in_outputs = {use_present_in_outputs}, the value "
                "of use_present_in_outputs value will be used for the outputs."
            )
+        self.is_merged = False
+        self.use_cache_branch = None


use_past is the legacy here.
Basically you have two "use past":

use_past_in_inputs: inputs will have past key values

use_present_in_outputs: outputs will have past key values

If you set only use_past, it sets both.

optimum/onnx/graph_transformations.py

optimum/onnxruntime/base.py

optimum/onnxruntime/modeling_decoder.py

optimum/onnxruntime/modeling_seq2seq.py

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

JingyaHuang added 3 commits December 27, 2022 11:55

Add save option

ec049f4

Add test for saving

d899e34

Fix test path

eb0d2ef

JingyaHuang added 3 commits December 30, 2022 14:42

Allow str path for merging

a8b98b5

Add Path and remove merged names

d3a9a1d

Merge branch 'main' into enable-merged-modeling

e399e8e

JingyaHuang changed the title ~~Enable merged decoder in ORTModel~~ Enable inference with a merged decoder in ORTModelForCausalLM Jan 2, 2023

JingyaHuang added 10 commits January 2, 2023 23:32

Finish adapting ORTModelDecoder

b76d3a1

Prepare extra inputs

04ff464

do not store merged in place

af3461b

Support I/O binding for merged

a1d422c

Extend to multiple patterns

0a5dd30

Add test for inference

a7ec6ef

Fix test

85603ee

update test

2a8f3ca

Merge branch 'main' into enable-merged-modeling

167ae30

Remove prints

b5fe0a3

JingyaHuang marked this pull request as ready for review January 4, 2023 18:10

JingyaHuang requested review from michaelbenayoun, fxmarty and mht-sharma January 4, 2023 18:10