You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are very interested in two post-training quantization papers from han lab!
SmoothQuant use W8A8 for efficient GPU computation. AWQ uses W4/3A16 for lower memory requirements and higher memory throughput.
But which one is faster in actual production? If you have any data about this, could you share it with us?
W4A16 is the fastest. I believe this is discussed in the paper, something along the lines of “weights make up the majority of delay”. Most layers in transformers are linear layers, so naturally you will see a large benefit from quantizing them.
I don’t have benchmarks to compare against SmoothQuant as it seems AWQ is preferred by the authors due to usability and speed with TinyChat.
Hi @codertimo , usually W8A8 (SmoothQuant) is better for compute-bounded scenarios (e.g., large batch size, targeting large throughput), and W4A16 (AWQ) is better for memory-bounded scenarios (smaller batch size, lower latency). Let me know if you have more questions.
Question
We are very interested in two post-training quantization papers from han lab!
SmoothQuant use W8A8 for efficient GPU computation.
AWQ uses W4/3A16 for lower memory requirements and higher memory throughput.
But which one is faster in actual production?
If you have any data about this, could you share it with us?
The text was updated successfully, but these errors were encountered: