-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exllama v2 #60
Exllama v2 #60
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a nit and a question about deprecation for v1 kernels. I didn't review the CUDA kernels themselves, but I assume those are coming from the original exllama2 repo.
@@ -32,6 +32,7 @@ | |||
HAS_EXLLAMA = False | |||
try: | |||
from lorax_server.utils.gptq.exllama import Ex4bitLinear | |||
from lorax_server.utils.gptq.exllamav2 import QuantLinear as exllamav2QuantLinear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're still building the v1 kernels, would it make sense to fallback to v1 if v2 is unavailable?
HAS_EXLLAMA_V2 = True
try:
from lorax_server.utils.gptq.exllamav2 import QuantLinear as exllamav2QuantLinear
except ImportError:
HAS_EXLLAMA_V2 = False
HAS_EXLLAMA_V1 = False
if not HAS_EXLLAMA_V2:
HAS_EXLLAMA_V1 = True
try:
from lorax_server.utils.gptq.exllama import Ex4bitLinear
except ImportError:
HAS_EXLLAMA_V1 = False
...
if use_exllama:
if HAS_EXLLAMA_V2:
linear = exllamav2QuantLinear(qweight, qzeros, scales, g_idx, bias, bits, groupsize)
else:
linear = Ex4bitLinear(qweight, qzeros, scales, g_idx, bias, bits, groupsize)
Happy to defer to your judgement if you think it wouldn't make sense (in which case, we may wish to remove the v1 kernels from the repo and build process).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far I can see there is no problem with older architectures like flash v1 vs v2
Will take a deeper look, preferring to remove v1 to decrease docker size and maintainability
Hey @flozi00 just wanted to double-check if this was ready to merge before landing. Let me know :) |
From my Side it looks good |
shell outputs:
seems to be okay
Could you please also do more tests ?