-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Integrate Core AutoAWQ Inference Components into Transformers #42256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @fanqiNO1! Thanks a lot for taking the time to work on this! From my side, the repository structure you suggest makes sense — I don’t have any strong opinions as long as the API exposed from Just an FYI: we’re currently refactoring the quantization API, but since AWQ will be inference-only, I don’t expect any friction for you. For Regarding the quantized linear layers, I agree it’s better to keep them in For anything related to kernels or quantization, feel free to reach out — I can help you build the kernels and add them to the community repo. |
Thank you very much for your reply! I will migrate |
|
Thanks for this nice design proposal ! If there are too much code to just put in a awq.py file, feel free to divide it as you proposed ! As for the Quantized Linear, let's maybe focus first on if awq_ext is not None:
FP16_MATMUL_HEURISTIC_CONDITION = x.shape[0] * x.shape[1] >= 1024
if FP16_MATMUL_HEURISTIC_CONDITION:
out = awq_ext.dequantize_weights_cuda(
qweight, scales, qzeros, 0, 0, 0, False
)
out = torch.matmul(x, out)
else:
out = awq_ext.gemm_forward_cuda(
x.reshape(-1, x.shape[-1]), qweight, scales, qzeros, 8
)
elif TRITON_AVAILABLE:
FP16_MATMUL_HEURISTIC_CONDITION = x.shape[0] * x.shape[1] >= 1024
if FP16_MATMUL_HEURISTIC_CONDITION:
out = awq_dequantize_triton(qweight, scales, qzeros)
out = torch.matmul(x, out.to(x.dtype))
else:
out = awq_gemm_triton(
x.reshape(-1, x.shape[-1]), qweight, scales, qzeros, split_k_iters=8,
)
else:
global user_has_been_warned
if not user_has_been_warned:
warnings.warn("Using naive (slow) implementation." + msg)
user_has_been_warned = True
out = dequantize_gemm(qweight, qzeros, scales, w_bit, group_size)
out = torch.matmul(x, out) |
|
Just to let you know @fanqiNO1, there is a creator of gptq-model who wants to maintain the awq inference backend. We are still thinking what could be the best solution. |
What does this PR do?
Since the
AutoAWQrepository has been archived, this PR proposes integrating its essential inference-related components intotransformersto ensure continued support and maintenance.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
AutoAWQModule Analysisawq.evaluationawq.modelsAutoAWQForCausalLM)transformersmodel classes.awq.modules.fusedautoawq-kernels.awq.modules.linearWQLinear_*)AutoAWQ. Additionally, newly introduced models may bring compatibility issues—for instance (Qwen3 support). Depends onautoawq-kernels.awq.modules.tritongemmin Tritonautoawq-kernelsis available; excluded for now.awq.modules.actScaledActivationawq.quantizeawq.utilsAutoAWQ-Kernelsconsists of the following four components:WQLinear_GEMMinawq.modules.linear.WQLinear_GEMVFastinawq.modules.linear. (We do not need this component, as we do not useWQLinear_GEMVFast; however, support could potentially be added in the future.)WQLinear_Exllamainawq.modules.linear.WQLinear_ExllamaV2inawq.modules.linear.Design Proposal
To maintain backward compatibility and minimize disruption:
transformers.integrations.awqremains functional.@SunMarc @MekkCyber