Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to support GPTQ triton commit c90adef #1229

Merged
merged 5 commits into from Apr 17, 2023
Merged

Conversation

sgsdxzy
Copy link
Contributor

@sgsdxzy sgsdxzy commented Apr 15, 2023

The new fused_mlp seems to not work on some cards qwopqwop200/GPTQ-for-LLaMa#179. If passing --no-fused_mlp everything should work.

@sgsdxzy sgsdxzy changed the title Update to support triton commit c90adef Update to support GPTQ triton commit c90adef Apr 15, 2023
@EyeDeck
Copy link
Contributor

EyeDeck commented Apr 15, 2023

Runs on my machine, but the new GPTQ-for-LLaMA code gives garbage output. Seems to either pick a token at random and just spam it endlessly, or start spewing irrelevant nonsense code, or with some luck it might generate a vaguely sensical paragraph in English that has nothing to do with the input.

Additionally, there's still the same significant memory usage increase from as in #1073

However, if I comment out the quant.make_quant_attn(model) line run with --no-quant_attn it works fine again. The output looks normal, and it's back to crashing at precisely the 1980th input token (rather than the ~1300th). Hmm...

@sgsdxzy
Copy link
Contributor Author

sgsdxzy commented Apr 15, 2023

Runs on my machine, but the new GPTQ-for-LLaMA code gives garbage output. Seems to either pick a token at random and just spam it endlessly, or start spewing irrelevant nonsense code, or with some luck it might generate a vaguely sensical paragraph in English that has nothing to do with the input.

Additionally, there's still the same significant memory usage increase from as in #1073

However, if I comment out the quant.make_quant_attn(model) line run with --no-quant_attn it works fine again. The output looks normal, and it's back to crashing at precisely the 1980th input token. Hmm...

What's your gpu? I have a 2080 Ti that

  1. fail with fused_mlp
  2. EDITquant_attn before commit c90adef is indeed ~10% faster for me, but after today's new change it causes the model to gives garbage output too.

@EyeDeck
Copy link
Contributor

EyeDeck commented Apr 15, 2023

3090 in my case.

If I run llama_inference.py directly...I'm not sure if it works either.

CUDA_VISIBLE_DEVICES=0 python llama_inference.py ../../models/LLaMA-30B-4bit-128g/ --wbits 4 --groupsize 128 --load ../../models/LLaMA-30B-4bit-128g/LLaMA-30B-4bit-128g-tsao.pt --text "this is llama"
Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:27<00:00,  2.27s/it]
Found 1 unique fused mlp KN values.
Warming up autotune cache ...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:30<00:00,  2.52s/it]
Done.
  <s>this is llama who is currently in his last year of school.
Sadly, some of the older students who were involved in the project in the past have been completely removed from the school system. They were taken to a place where

CUDA_VISIBLE_DEVICES=0 python llama_inference.py ../../models/LLaMA-30B-4bit-128g/ --wbits 4 --groupsize 128 --load ../../models/LLaMA-30B-4bit-128g/LLaMA-30B-4bit-128g-tsao.pt --text "this is llama" --min_length 1000 --max_length 1000
[...]
  <s>this is llama_2000_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00

CUDA_VISIBLE_DEVICES=0 python llama_inference.py ../../models/LLaMA-30B-4bit-128g/ --wbits 4 --groupsize 128 --load ../../models/LLaMA-30B-4bit-128g/LLaMA-30B-4bit-128g-tsao.pt --text "this is llama" --min_length 1000 --max_length 1000
[...]
  <s>this is llama100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

CUDA_VISIBLE_DEVICES=0 python llama_inference.py ../../models/LLaMA-30B-4bit-128g/ --wbits 4 --groupsize 128 --load ../../models/LLaMA-30B-4bit-128g/LLaMA-30B-4bit-128g-tsao.pt --text "this is llama" --min_length 1000 --max_length 1000
[...]
  <s>this is llama 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

That first run was...almost sensical.

@sgsdxzy
Copy link
Contributor Author

sgsdxzy commented Apr 16, 2023

Update: I find qwopqwop200/GPTQ-for-LLaMa@47dd6b3 breaks quant_attn. After reverting back to qwopqwop200/GPTQ-for-LLaMa@508de42 quant_attn works, and increase 10% speed for me.

@sgsdxzy
Copy link
Contributor Author

sgsdxzy commented Apr 17, 2023

@oobabooga I think this pr is ready to go. It allows users to use the latest triton branch while giving them choices to disable specific functionalities.

@oobabooga
Copy link
Owner

Thanks for the confirmation, @sgsdxzy. I'll merge now.

@oobabooga oobabooga merged commit b57ffc2 into oobabooga:main Apr 17, 2023
@sgsdxzy sgsdxzy deleted the triton branch April 17, 2023 04:27
Ph0rk0z pushed a commit to Ph0rk0z/text-generation-webui-testing that referenced this pull request Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants