Update to support GPTQ triton commit c90adef #1229

sgsdxzy · 2023-04-15T12:18:58Z

The new fused_mlp seems to not work on some cards qwopqwop200/GPTQ-for-LLaMa#179. If passing --no-fused_mlp everything should work.

…e57.

EyeDeck · 2023-04-15T16:10:36Z

Runs on my machine, but the new GPTQ-for-LLaMA code gives garbage output. Seems to either pick a token at random and just spam it endlessly, or start spewing irrelevant nonsense code, or with some luck it might generate a vaguely sensical paragraph in English that has nothing to do with the input.

Additionally, there's still the same significant memory usage increase from as in #1073

However, if I ~~comment out the quant.make_quant_attn(model) line~~ run with --no-quant_attn it works fine again. The output looks normal, and it's back to crashing at precisely the 1980th input token (rather than the ~1300th). Hmm...

sgsdxzy · 2023-04-15T16:22:47Z

Runs on my machine, but the new GPTQ-for-LLaMA code gives garbage output. Seems to either pick a token at random and just spam it endlessly, or start spewing irrelevant nonsense code, or with some luck it might generate a vaguely sensical paragraph in English that has nothing to do with the input.

Additionally, there's still the same significant memory usage increase from as in #1073

However, if I ~~comment out the quant.make_quant_attn(model) line~~ run with --no-quant_attn it works fine again. The output looks normal, and it's back to crashing at precisely the 1980th input token. Hmm...

What's your gpu? I have a 2080 Ti that

fail with fused_mlp
EDIT：quant_attn before commit c90adef is indeed ~10% faster for me, but after today's new change it causes the model to gives garbage output too.

EyeDeck · 2023-04-15T16:52:42Z

3090 in my case.

If I run llama_inference.py directly...I'm not sure if it works either.

CUDA_VISIBLE_DEVICES=0 python llama_inference.py ../../models/LLaMA-30B-4bit-128g/ --wbits 4 --groupsize 128 --load ../../models/LLaMA-30B-4bit-128g/LLaMA-30B-4bit-128g-tsao.pt --text "this is llama"
Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:27<00:00,  2.27s/it]
Found 1 unique fused mlp KN values.
Warming up autotune cache ...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:30<00:00,  2.52s/it]
Done.
  <s>this is llama who is currently in his last year of school.
Sadly, some of the older students who were involved in the project in the past have been completely removed from the school system. They were taken to a place where

CUDA_VISIBLE_DEVICES=0 python llama_inference.py ../../models/LLaMA-30B-4bit-128g/ --wbits 4 --groupsize 128 --load ../../models/LLaMA-30B-4bit-128g/LLaMA-30B-4bit-128g-tsao.pt --text "this is llama" --min_length 1000 --max_length 1000
[...]
  <s>this is llama_2000_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00

CUDA_VISIBLE_DEVICES=0 python llama_inference.py ../../models/LLaMA-30B-4bit-128g/ --wbits 4 --groupsize 128 --load ../../models/LLaMA-30B-4bit-128g/LLaMA-30B-4bit-128g-tsao.pt --text "this is llama" --min_length 1000 --max_length 1000
[...]
  <s>this is llama100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

CUDA_VISIBLE_DEVICES=0 python llama_inference.py ../../models/LLaMA-30B-4bit-128g/ --wbits 4 --groupsize 128 --load ../../models/LLaMA-30B-4bit-128g/LLaMA-30B-4bit-128g-tsao.pt --text "this is llama" --min_length 1000 --max_length 1000
[...]
  <s>this is llama 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

That first run was...almost sensical.

sgsdxzy · 2023-04-16T05:53:59Z

Update: I find qwopqwop200/GPTQ-for-LLaMa@47dd6b3 breaks quant_attn. After reverting back to qwopqwop200/GPTQ-for-LLaMa@508de42 quant_attn works, and increase 10% speed for me.

sgsdxzy · 2023-04-17T03:55:17Z

@oobabooga I think this pr is ready to go. It allows users to use the latest triton branch while giving them choices to disable specific functionalities.

oobabooga · 2023-04-17T04:11:07Z

Thanks for the confirmation, @sgsdxzy. I'll merge now.

Update to support triton commit c90adefbf1934f4638ea5c3bba8fc536aad3d…

bd67e2c

…e57.

sgsdxzy changed the title ~~Update to support triton commit c90adef~~ Update to support GPTQ triton commit c90adef Apr 15, 2023

sgsdxzy mentioned this pull request Apr 15, 2023

Add support for triton branch of GPTQ. #1073

Merged

Allow disabling quant attention.

9f3c82a

sgsdxzy force-pushed the triton branch from 7685d5f to 9f3c82a Compare April 15, 2023 15:55

Merge remote-tracking branch 'origin/main' into triton

8cc07da

sgsdxzy mentioned this pull request Apr 16, 2023

Add support for GPTQ-triton using the --gptq-triton flag. #1263

Closed

sgsdxzy added 2 commits April 17, 2023 11:38

Merge remote-tracking branch 'origin/main' into triton

503622f

Add troubleshooting tips.

7d741dc

oobabooga merged commit b57ffc2 into oobabooga:main Apr 17, 2023

sgsdxzy deleted the triton branch April 17, 2023 04:27

Ph0rk0z pushed a commit to Ph0rk0z/text-generation-webui-testing that referenced this pull request Apr 17, 2023

Update to support GPTQ triton commit c90adef (oobabooga#1229)

809fd76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to support GPTQ triton commit c90adef #1229

Update to support GPTQ triton commit c90adef #1229

sgsdxzy commented Apr 15, 2023

EyeDeck commented Apr 15, 2023 •

edited

sgsdxzy commented Apr 15, 2023 •

edited

EyeDeck commented Apr 15, 2023 •

edited

sgsdxzy commented Apr 16, 2023

sgsdxzy commented Apr 17, 2023

oobabooga commented Apr 17, 2023

Update to support GPTQ triton commit c90adef #1229

Update to support GPTQ triton commit c90adef #1229

Conversation

sgsdxzy commented Apr 15, 2023

EyeDeck commented Apr 15, 2023 • edited

sgsdxzy commented Apr 15, 2023 • edited

EyeDeck commented Apr 15, 2023 • edited

sgsdxzy commented Apr 16, 2023

sgsdxzy commented Apr 17, 2023

oobabooga commented Apr 17, 2023

EyeDeck commented Apr 15, 2023 •

edited

sgsdxzy commented Apr 15, 2023 •

edited

EyeDeck commented Apr 15, 2023 •

edited