You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks a lot for sharing the coding.
I have a quick question, is GELU layer fully quantized as well? I see you mention that the vision transformer is fully quantized, but I was not able to find where you quantize GELU in code, nor do I find any description of it in the paper? Did I miss something?
Thanks a lot for your clarification.
The text was updated successfully, but these errors were encountered:
@Kevinpsk
Hi! Thanks for your interest in our work!
We quantize the output of GELU using the QAct, the code is here.
This processing is consistent with the actual application. Considering that PTQ must guarantee the quantized inference is consistent with the original full-precision inference, we must keep the non-linear GELU layers. Thus, to minimize hardware consumption for subsequent operations (e.g., the next linear operation), quantization is usually performed on the output of GELU, and the inference of the GELU function is performed in a cpu unit, or implemented using a Look-Up-Table (LUT).
If you are interested in a purely fixed-point implementation of the inference of GELU, we recommend you to follow the work of I-BERT, and also welcome to try our optimizations for LayerNorm and Softmax!
Hi there,
Thanks a lot for sharing the coding.
I have a quick question, is GELU layer fully quantized as well? I see you mention that the vision transformer is fully quantized, but I was not able to find where you quantize GELU in code, nor do I find any description of it in the paper? Did I miss something?
Thanks a lot for your clarification.
The text was updated successfully, but these errors were encountered: