Skip to content

Error loading quantized Whisper models with onnxruntime 1.25 #28306

@istupakov

Description

@istupakov

Describe the issue

I got an error loading the uint8 quantized version of onnx-community/whisper-tiny with onnxruntime 1.25. With onnxruntime 1.24 and older versions everything works fine. I don't see any notes about breaking changes on this.

To reproduce

Code

from huggingface_hub import hf_hub_download
import onnxruntime as rt

path = hf_hub_download("onnx-community/whisper-tiny", "onnx/decoder_model_merged_uint8.onnx")
model = rt.InferenceSession(path)

Error:

[ONNXRuntimeError] : 1 : FAIL : qdq_actions.cc:136 TransposeDQWeightsForMatMulNBits Missing required scale: model.decoder.embed_tokens.weight_merged_0_scale for node: model.decoder.embed_tokens.weight_transposed_DequantizeLinear

Urgency

This broke the use of popular Whisper models with my onnx-asr library.

Platform

Linux

OS Version

Debian 13

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.25.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    quantizationissues related to quantization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions