-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Describe the bug
First thanks a lot for all your hard work on onnxruntime, I love it! To the issue at hand:
The quantized version of the original onnx model output does look anything like the float onnx model and also downstream accuracy suffers greatly. I'm thinking that something is off with regards to this particular model and quantization.
The model is a multilingual MiniLM model fine trained for sequence classification with a classification head. Model card is on huggingface hub.
The model when exported to onnx with float32 works as expected with very similar score and also downstream accuracy as when using torch, however when using quantization the model output score does not match the float versions and similar the downstream task accuracy drops significantly (next to random behaviour).
Urgency
None
System information
- MacOS X
- ONNX Runtime version: 1.9.0
- Python version: 3.8.5
To Reproduce
The model could be downloaded from the internet, see instructions in the following notebook which also demonstrates
the output score difference. See notebook
Expected behavior
I would expect that the output is roughly the same score as the float version. I've successfully quantized subword tokenized MiniLM with classification heads with little or no accuracy drop as compared to the float32 version, but having some trouble with this model so I'm reaching out for help 💯