Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exllama v2 #60

Merged
merged 5 commits into from
Nov 27, 2023
Merged

Exllama v2 #60

merged 5 commits into from
Nov 27, 2023

Conversation

flozi00
Copy link
Collaborator

@flozi00 flozi00 commented Nov 23, 2023

shell outputs:

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\n> Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia altogether in April and May?\n\n> Natalia sold clips to 48 of her friends in April, and then she"}

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold halfps did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\n> Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia altogether in April and May?\n\n> Natalia sold clips to 48 of her friends in April, and then she"}

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold halfps did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\n> Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia altogether in April and May?\n\n> Natalia sold clips to 48 of h

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "### System: Du bist ein Chatbot Namens Egino. ### User: Wer bist du ? ### Assistant: ", "parameters": {"max_new_to
kens": 64}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\nIch bin Egino, ein Chatbot, der Fragen beantwortet. \n\nWie kann ich dir heute helfen? "}

root@71c94e4b85b3:/workspaces/lorax# 

root@71c94e4b85b3:/workspaces/lorax# 

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64, "adapter_id": "flozi00/Mistral-7B-german-assistant-v6"}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\n> Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia altogether in April and May?\n\n> Natalia sold clips to 48 of h

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64, "adapter_id": "qblocks/mistral_7b_norobots"}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\n> Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia altogether in April and May?\n\n> Natalia sold clips to 48 of her friends in April, and then she"}

root@71c94e4b85b3:/workspaces/lorax# 

seems to be okay
Could you please also do more tests ?

@flozi00 flozi00 marked this pull request as ready for review November 23, 2023 08:24
Copy link
Contributor

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a nit and a question about deprecation for v1 kernels. I didn't review the CUDA kernels themselves, but I assume those are coming from the original exllama2 repo.

@@ -32,6 +32,7 @@
HAS_EXLLAMA = False
try:
from lorax_server.utils.gptq.exllama import Ex4bitLinear
from lorax_server.utils.gptq.exllamav2 import QuantLinear as exllamav2QuantLinear
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're still building the v1 kernels, would it make sense to fallback to v1 if v2 is unavailable?

HAS_EXLLAMA_V2 = True
try:
    from lorax_server.utils.gptq.exllamav2 import QuantLinear as exllamav2QuantLinear
except ImportError:
    HAS_EXLLAMA_V2 = False

HAS_EXLLAMA_V1 = False
if not HAS_EXLLAMA_V2:
    HAS_EXLLAMA_V1 = True
    try:
        from lorax_server.utils.gptq.exllama import Ex4bitLinear
    except ImportError:
        HAS_EXLLAMA_V1 = False

...

if use_exllama:
    if HAS_EXLLAMA_V2:
            linear = exllamav2QuantLinear(qweight, qzeros, scales, g_idx, bias, bits, groupsize)
    else:
            linear = Ex4bitLinear(qweight, qzeros, scales, g_idx, bias, bits, groupsize)

Happy to defer to your judgement if you think it wouldn't make sense (in which case, we may wish to remove the v1 kernels from the repo and build process).

Copy link
Collaborator Author

@flozi00 flozi00 Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far I can see there is no problem with older architectures like flash v1 vs v2
Will take a deeper look, preferring to remove v1 to decrease docker size and maintainability

server/lorax_server/utils/gptq/exllamav2.py Outdated Show resolved Hide resolved
@flozi00 flozi00 changed the title port exllamav2 WIP port exllamav2 Nov 26, 2023
@tgaddair
Copy link
Contributor

Hey @flozi00 just wanted to double-check if this was ready to merge before landing. Let me know :)

@flozi00
Copy link
Collaborator Author

flozi00 commented Nov 27, 2023

From my Side it looks good
Maybe you could do another run to be sure everything is running correctly ?

@tgaddair tgaddair merged commit 8347f58 into predibase:main Nov 27, 2023
1 check failed
@tgaddair tgaddair changed the title WIP port exllamav2 Exllama v2 Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants