A model quantization example using ONNX.
For more details, please read my blog in 中文 or in English.
Make sure git-lfs is installed in your computer.
Install the Python environment:
git lfs install
git submodule update --init
pip install poetry
poetry install --no-root
If you are using GPU, please run the following command after poetry install --no-root
poetry remove onnxruntime
poetry add onnxruntime-gpu
poetry run python main.py
Experimenting with one of the DistilBERT models fine-tuned on the IMDB dataset from HuggingFace, available here.
The results running on a MacBook Air M1 CPU and Windows 10 WSL with an i5-8400 CPU are provided below (results may vary on different platforms):
Model Size | Inference Time per Instance | Accuracy | |
---|---|---|---|
PyTorch Model (MAC) | 256MB | 71.1ms | 93.8% |
ONNX Model(MAC) | 256MB | 113.5ms | 93.8% |
ONNX 8-bit Model(MAC) | 64MB | 87.7ms | 93.75% |
PyTorch Model (Win) | 256MB | 78.6ms | 93.8% |
ONNX Model(Win) | 256MB | 85.1ms | 93.8% |
ONNX 8-bit Model(Win) | 64MB | 61.1ms | 93.85% |