Exllama v2 #60

flozi00 · 2023-11-23T07:25:24Z

shell outputs:

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\n> Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia altogether in April and May?\n\n> Natalia sold clips to 48 of her friends in April, and then she"}

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold halfps did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\n> Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia altogether in April and May?\n\n> Natalia sold clips to 48 of her friends in April, and then she"}

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold halfps did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\n> Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia altogether in April and May?\n\n> Natalia sold clips to 48 of h

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "### System: Du bist ein Chatbot Namens Egino. ### User: Wer bist du ? ### Assistant: ", "parameters": {"max_new_to
kens": 64}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\nIch bin Egino, ein Chatbot, der Fragen beantwortet. \n\nWie kann ich dir heute helfen? "}

root@71c94e4b85b3:/workspaces/lorax# 

root@71c94e4b85b3:/workspaces/lorax# 

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64, "adapter_id": "flozi00/Mistral-7B-german-assistant-v6"}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\n> Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia altogether in April and May?\n\n> Natalia sold clips to 48 of h

root@71c94e4b85b3:/workspaces/lorax# curl 127.0.0.1:80/generate     -X POST     -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64, "adapter_id": "qblocks/mistral_7b_norobots"}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\n> Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia altogether in April and May?\n\n> Natalia sold clips to 48 of her friends in April, and then she"}

root@71c94e4b85b3:/workspaces/lorax#

seems to be okay
Could you please also do more tests ?

tgaddair

LGTM! Just a nit and a question about deprecation for v1 kernels. I didn't review the CUDA kernels themselves, but I assume those are coming from the original exllama2 repo.

tgaddair · 2023-11-23T21:02:06Z

server/lorax_server/utils/layers.py

@@ -32,6 +32,7 @@
    HAS_EXLLAMA = False
 try:
    from lorax_server.utils.gptq.exllama import Ex4bitLinear
+    from lorax_server.utils.gptq.exllamav2 import QuantLinear as exllamav2QuantLinear


Since we're still building the v1 kernels, would it make sense to fallback to v1 if v2 is unavailable?

HAS_EXLLAMA_V2 = True try: from lorax_server.utils.gptq.exllamav2 import QuantLinear as exllamav2QuantLinear except ImportError: HAS_EXLLAMA_V2 = False HAS_EXLLAMA_V1 = False if not HAS_EXLLAMA_V2: HAS_EXLLAMA_V1 = True try: from lorax_server.utils.gptq.exllama import Ex4bitLinear except ImportError: HAS_EXLLAMA_V1 = False ... if use_exllama: if HAS_EXLLAMA_V2: linear = exllamav2QuantLinear(qweight, qzeros, scales, g_idx, bias, bits, groupsize) else: linear = Ex4bitLinear(qweight, qzeros, scales, g_idx, bias, bits, groupsize)

Happy to defer to your judgement if you think it wouldn't make sense (in which case, we may wish to remove the v1 kernels from the repo and build process).

As far I can see there is no problem with older architectures like flash v1 vs v2
Will take a deeper look, preferring to remove v1 to decrease docker size and maintainability

server/lorax_server/utils/gptq/exllamav2.py

tgaddair · 2023-11-27T21:57:21Z

Hey @flozi00 just wanted to double-check if this was ready to merge before landing. Let me know :)

flozi00 · 2023-11-27T22:06:25Z

From my Side it looks good
Maybe you could do another run to be sure everything is running correctly ?

flozi00 added 2 commits November 23, 2023 07:24

port exllamav2

f8e7bdc

Update devcontainer.json to use local Dockerfile

d4893a7

flozi00 requested a review from tgaddair November 23, 2023 08:23

flozi00 marked this pull request as ready for review November 23, 2023 08:24

tgaddair approved these changes Nov 23, 2023

View reviewed changes

flozi00 added 2 commits November 24, 2023 14:19

rm commented code

9cd5211

rm exllama v1

8f92c9a

flozi00 changed the title ~~port exllamav2~~ WIP port exllamav2 Nov 26, 2023

Merge branch 'predibase:main' into main

be7d926

tgaddair merged commit 8347f58 into predibase:main Nov 27, 2023
1 check failed

tgaddair changed the title ~~WIP port exllamav2~~ Exllama v2 Dec 4, 2023

tgaddair mentioned this pull request Dec 4, 2023

how does this differ from s-Lora? #90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exllama v2 #60

Exllama v2 #60

flozi00 commented Nov 23, 2023 •

edited

Loading

tgaddair left a comment

tgaddair Nov 23, 2023

flozi00 Nov 23, 2023 •

edited

Loading

tgaddair commented Nov 27, 2023

flozi00 commented Nov 27, 2023

Exllama v2 #60

Exllama v2 #60

Conversation

flozi00 commented Nov 23, 2023 • edited Loading

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Nov 23, 2023

Choose a reason for hiding this comment

flozi00 Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

tgaddair commented Nov 27, 2023

flozi00 commented Nov 27, 2023

flozi00 commented Nov 23, 2023 •

edited

Loading

flozi00 Nov 23, 2023 •

edited

Loading