-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Static quantization issue #24
Comments
We have recently enabled the pytorch fx graph mode for quantization aware training and post-training quantization, so no need to add QuantStub and DeQuantStub in the model anymore. If you are interested on using the pytorch eager mode for post-training quantization and on how the model was modified by the Intel Neural Compressor team, you can take a look at : https://github.com/intel/neural-compressor/blob/master/examples/pytorch/eager/language_translation/ptq/transformers/modeling_bert.py |
Thanks a lot @echarlaix. Works like a charm with torch FX graph mode. However, when running the test (https://github.com/huggingface/optimum/blob/main/tests/intel/test_lpot.py) for static quantization and changing the batch size, I always run in an error if the number of samples divided by the batch size has a remainder and the last batch then has less samples than the actual batch size (e.g. with batch size 16):
Is there some way to circumvent this? |
Currently our tracing of the model with torch fx does not support dynamic inputs shapes and we are currently working towards it. In the meantime, a simple fix could be to set dataloader_drop_last of the TrainingArguments to True. |
Would definitely be a good solution for training. Since I do not want to loose any samples at inference, I just duplicated the last sample to fill up the batch and later on removed it again :). I'll close the issue since my initial problem is fixed with torch fx. Thanks for your help! |
Hello,
thanks a lot for your code and examples! I'm trying to get the static quantization working in the example code, but I always get
NotImplementedError: Could not run 'quantized::linear' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'quantized::linear' is only available for these backends: [QuantizedCPU, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, Tracer, Autocast, Batched, VmapMode].
Could you please give me a hint on how to get this running? From what I have found out is that we need to add Quant() and DeQuant() layers to the beginning and end of the BERT model? If yes, is there already a class that I can use?
The text was updated successfully, but these errors were encountered: