The exported ONNX model of Qwen/Qwen1.5-0.5B-Chat does not produce a cache-enabled model. #1747

anilmartha · 2024-03-07T20:11:57Z

System Info

transformers-4.38.2
optimum-1.17.1

Who can help?

Hi @michaelbenayoun,

I have exported Qwen/Qwen1.5-0.5B-Chat model with text-generation-with-past. When running the exported ONNX model with the ORTModelForCausalLM class, the following error is observed.
File "/proj/mldata/users/anilm/repos/qwen/run.py", line 11, in
model = ORTModelForCausalLM.from_pretrained("Qwen1.5-0.5B-Chat")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/proj/mldata/users/anilm/workspace/AIE/miniconda/envs/py311/lib/python3.11/site-packages/optimum/onnxruntime/modeling_ort.py", line 662, in from_pretrained
return super().from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/proj/mldata/users/anilm/workspace/AIE/miniconda/envs/py311/lib/python3.11/site-packages/optimum/modeling_base.py", line 399, in from_pretrained
return from_pretrained_method(
^^^^^^^^^^^^^^^^^^^^^^^
File "/proj/mldata/users/anilm/workspace/AIE/miniconda/envs/py311/lib/python3.11/site-packages/optimum/onnxruntime/modeling_decoder.py", line 559, in _from_pretrained
return init_cls(
^^^^^^^^^
File "/proj/mldata/users/anilm/workspace/AIE/miniconda/envs/py311/lib/python3.11/site-packages/optimum/onnxruntime/modeling_decoder.py", line 169, in init
raise ValueError(
ValueError: use_cache was set to True but the loaded model only supports use_cache=False. Please load your current model with use_cache=False or export the original model once again with use_cache=True when calling the from_pretrained method. To export your model, simply set export=True

I have added the custom export script below.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

from optimum.exporters.onnx import main_export

from transformers import AutoConfig

from optimum.exporters.onnx.config import TextDecoderOnnxConfig,TextDecoderWithPositionIdsOnnxConfig
from optimum.exporters.onnx.base import ConfigBehavior
from optimum.utils import NormalizedTextConfig, DummyPastKeyValuesGenerator
from typing import Dict
import os
import shutil


class QwenDummyPastKeyValuesGenerator(DummyPastKeyValuesGenerator):

    def generate(self, input_name: str, framework: str = "pt"):
        past_key_shape = (
            self.batch_size,
            self.num_attention_heads,
            self.hidden_size // self.num_attention_heads,
            self.sequence_length,
        )
        past_value_shape = (
            self.batch_size,
            self.num_attention_heads,
            self.sequence_length,
            self.hidden_size // self.num_attention_heads,
        )
        return [
            (
                self.random_float_tensor(past_key_shape, framework=framework),
                self.random_float_tensor(past_value_shape, framework=framework),
            )
            for _ in range(self.num_layers)
        ]

class CustomQwenOnnxConfig(TextDecoderOnnxConfig):
    DUMMY_INPUT_GENERATOR_CLASSES = (QwenDummyPastKeyValuesGenerator,) + TextDecoderOnnxConfig.DUMMY_INPUT_GENERATOR_CLASSES
    DUMMY_PKV_GENERATOR_CLASS = QwenDummyPastKeyValuesGenerator

    DEFAULT_ONNX_OPSET = 15  # aten::tril operator requires opset>=14
    NORMALIZED_CONFIG_CLASS = NormalizedTextConfig


    def add_past_key_values(self, inputs_or_outputs: Dict[str, Dict[int, str]], direction: str):
    
         if direction not in ["inputs", "outputs"]:
             raise ValueError(f'direction must either be "inputs" or "outputs", but {direction} was given')

         if direction == "inputs":
             decoder_sequence_name = "past_sequence_length"
             name = "past_key_values"
         else:
             decoder_sequence_name = "past_sequence_length + 1"
             name = "present"

         for i in range(self._normalized_config.num_layers):
             inputs_or_outputs[f"{name}.{i}.key"] = {0: "batch_size", 3: decoder_sequence_name}
             inputs_or_outputs[f"{name}.{i}.value"] = {0: "batch_size", 2: decoder_sequence_name}


model_id = "Qwen/Qwen1.5-0.5B-Chat"
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)


onnx_config = CustomQwenOnnxConfig(
    config=config,
    task="text-generation",
    use_past=True,
    use_past_in_inputs=False,
)
onnx_config_with_past = CustomQwenOnnxConfig(config, task="text-generation", use_past=True)

custom_onnx_configs = {
    "model": onnx_config_with_past,
}


main_export(
    model_id,
    output="Qwen1.5-0.5B-Chat",
    task="text-generation-with-past",
    trust_remote_code=True,
    custom_onnx_configs=custom_onnx_configs,
    no_post_process=True,
    opset=15
)
### Running 
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
from optimum.utils import NormalizedTextConfig, NormalizedConfigManager
NormalizedConfigManager._conf['qwen2'] = NormalizedTextConfig

import torch
model_id = "Qwen/Qwen1.5-0.5B-Chat"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = ORTModelForCausalLM.from_pretrained("Qwen1.5-0.5B-Chat")

Expected behavior

I am generating the model with text-generation-with-past, and it should work seamlessly.

The text was updated successfully, but these errors were encountered:

fxmarty · 2024-03-20T07:46:13Z

H @anilmartha, thank you for the issue. #1746 should be merged today which should make it as straightforward as:

optimum-cli export onnx --model Qwen/Qwen1.5-0.5B-Chat qwen_onnx

Now regarding your code: Qwen/Qwen1.5-0.5B-Chat does not seem to use a custom modeling code anymore (maybe it was in the past), as qwen2 model type is now supported in Transformers natively.

So you do not need to use trust_remote_code=True. Also, the PKV generator is not correct. Qwen2 is similar to llama and:

from optimum.exporters.onnx import main_export
from transformers import AutoConfig
from optimum.exporters.onnx.model_configs import LlamaOnnxConfig

class CustomQwenOnnxConfig(LlamaOnnxConfig):
    pass

model_id = "fxmarty/tiny-dummy-qwen2"
config = AutoConfig.from_pretrained(model_id)

onnx_config_with_past = CustomQwenOnnxConfig(config, task="text-generation", use_past=True)

custom_onnx_configs = {
    "model": onnx_config_with_past,
}

main_export(
    model_id,
    output="Qwen1.5-0.5B-Chat",
    task="text-generation-with-past",
    custom_onnx_configs=custom_onnx_configs,
)

just works.

Note that #1746 is needed to have the exported model work with ORTModelForCausalLM.

MrRace · 2024-04-10T13:14:58Z

@fxmarty
I used the following command to export the ONNX format model:

optimum-cli export onnx --model /share_model_zoo/LLM/Qwen/Qwen1.5-0.5B-Chat --task text-generation-with-past /share_model_zoo/LLM/Qwen/optimum_onnx/Qwen1.5-0.5B-Chat

The resulting files are as follows:

-rw-r--r-- 1 root root  704 Apr 10 11:50 config.json
-rw-r--r-- 1 root root  205 Apr 10 11:50 generation_config.json
-rw-r--r-- 1 root root 1.4K Apr 10 11:50 tokenizer_config.json
-rw-r--r-- 1 root root  367 Apr 10 11:50 special_tokens_map.json
-rw-r--r-- 1 root root   80 Apr 10 11:50 added_tokens.json
-rw-r--r-- 1 root root 2.7M Apr 10 11:50 vocab.json
-rw-r--r-- 1 root root 1.6M Apr 10 11:50 merges.txt
-rw-r--r-- 1 root root 6.8M Apr 10 11:50 tokenizer.json
-rw-r--r-- 1 root root 8.0M Apr 10 12:08 _model_layers.0_self_attn_rotary_emb_Constant_attr__value
-rw-r--r-- 1 root root 8.0M Apr 10 12:08 _model_layers.0_self_attn_rotary_emb_Constant_5_attr__value
-rw-r--r-- 1 root root 1.8G Apr 10 12:29 model.onnx

I would like to ask what the files _model_layers.0_self_attn_rotary_emb_Constant_attr__value and _model_layers.0_self_attn_rotary_emb_Constant_5_attr__value are? Why are these two files generated, and how do I use them during inference? Thank you very much.

fxmarty · 2024-04-10T14:52:32Z

Hi @MrRace these files are an artifact of a step in the ONNX export where all external data are saved under the same model.onnx_data file. These files are not needed, should be deleted but are not. You can simply use model.onnx.

Basically, torch.onnx.export for models > 2GB export weights data as many independent files (onnx__MatMul_5747, onnx__MatMul_5748), which are fused in a single file (and the Constant_attr__value files as well), but these are leftovers.

This will be fixed in #1808.

anilmartha added the bug Something isn't working label Mar 7, 2024

fxmarty closed this as completed Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The exported ONNX model of Qwen/Qwen1.5-0.5B-Chat does not produce a cache-enabled model. #1747

The exported ONNX model of Qwen/Qwen1.5-0.5B-Chat does not produce a cache-enabled model. #1747

anilmartha commented Mar 7, 2024

fxmarty commented Mar 20, 2024 •

edited

Loading

MrRace commented Apr 10, 2024

fxmarty commented Apr 10, 2024

The exported ONNX model of Qwen/Qwen1.5-0.5B-Chat does not produce a cache-enabled model. #1747

The exported ONNX model of Qwen/Qwen1.5-0.5B-Chat does not produce a cache-enabled model. #1747

Comments

anilmartha commented Mar 7, 2024

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

fxmarty commented Mar 20, 2024 • edited Loading

MrRace commented Apr 10, 2024

fxmarty commented Apr 10, 2024

fxmarty commented Mar 20, 2024 •

edited

Loading