Skip to content

Dynamic quantization of multilingual miniLM - output does not match float32 version. Onnxruntime 1.9.0  #9599

@jobergum

Description

@jobergum

Describe the bug
First thanks a lot for all your hard work on onnxruntime, I love it! To the issue at hand:

The quantized version of the original onnx model output does look anything like the float onnx model and also downstream accuracy suffers greatly. I'm thinking that something is off with regards to this particular model and quantization.

The model is a multilingual MiniLM model fine trained for sequence classification with a classification head. Model card is on huggingface hub.

The model when exported to onnx with float32 works as expected with very similar score and also downstream accuracy as when using torch, however when using quantization the model output score does not match the float versions and similar the downstream task accuracy drops significantly (next to random behaviour).

Urgency
None

System information

  • MacOS X
  • ONNX Runtime version: 1.9.0
  • Python version: 3.8.5

To Reproduce
The model could be downloaded from the internet, see instructions in the following notebook which also demonstrates
the output score difference. See notebook

Expected behavior
I would expect that the output is roughly the same score as the float version. I've successfully quantized subword tokenized MiniLM with classification heads with little or no accuracy drop as compared to the float32 version, but having some trouble with this model so I'm reaching out for help 💯

Metadata

Metadata

Assignees

No one assigned

    Labels

    quantizationissues related to quantizationstaleissues that have not been addressed in a while; categorized by a bot

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions