You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I've noticed that the QuantAct layers preceding IntLayerNorm in the IBertSelfOutput and IbertOutput modules specify a 22 bit activation width while the QuantAct layer preceding IntLayerNorm in IBertEmbedding specifies a 16 bit activation.
I couldn't find any mention of these bit width choices in the paper. Could you please explain why these choices have been made?
Thank you!
The text was updated successfully, but these errors were encountered:
Those numbers are manually chosen to (1) avoid overflow and (2) minimize accuracy degradation.
We find that activations in Embedding layers are somewhat regular and contain fewer number of outliers, allowing 16-bit quantization without accuracy degradation. In contrary, activations in Transformer layers contain more number of outliers (that are sometimes orders of magnitude larger) and, therefore, assigning 16-bit for them could have a significant impact in accuracy. We find that 22-bit is a large enough bit width to avoid performance drop and at the same time avoid overflow in the subsequent IntLayerNorm layers. Therefore, it is fine to use 22-bit in Emedding layers as well - which is a more conservative strategy.
Hi,
I've noticed that the
QuantAct
layers precedingIntLayerNorm
in theIBertSelfOutput
andIbertOutput
modules specify a 22 bit activation width while theQuantAct
layer precedingIntLayerNorm
inIBertEmbedding
specifies a 16 bit activation.I couldn't find any mention of these bit width choices in the paper. Could you please explain why these choices have been made?
Thank you!
The text was updated successfully, but these errors were encountered: