Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize GPTQ_loader, support any model #615

Merged
merged 6 commits into from Mar 28, 2023

Conversation

mayaeary
Copy link
Contributor

@mayaeary mayaeary commented Mar 28, 2023

Improved version of #521

Generalized version of quantized loader, now it auto-detecting model from model file. It allows loading GPT-J and Pygmalion-6b without joggling of repositories.

I'll try to make generalized offload version, but for now only llama supports it.

You can quantize models using my fork - https://github.com/mayaeary/GPTQ-for-LLaMa/tree/gptj-v2.

Pre-quantized --wbits 4 --groupsize 128

# Download
python download-model.py https://huggingface.co/mayaeary/pygmalion-6b-4bit-128g

# Launch
python server.py --model pygmalion-6b-4bit-128g --wbits 4 --groupsize 128 --cai-chat

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 28, 2023

It needs a PR to GPTQ.. technically GPT-NEOX and NEO can also be done this way. In here they don't want to swap around the GPTQ repo.

The steps are get this merged upstream and then get support for the loras into upstream and PEFT as well.. Otherwise these never get merged.

@oobabooga
Copy link
Owner

This is very impressive.

@oobabooga
Copy link
Owner

I have done some basic sanity tests to check if everything is equivalent to the current code, and the answer is that yes.

@Ph0rk0z I agree that it would be nice to have this functionality merged upstream in GPTQ-for-LLaMa, but I see no reason to use Maya's code until that happens. The only caveat is that we will have to watch for eventual changes in the upstream make_quant and load_quant functions.

For reference, these are the VRAM usages that I have seen for pygmalion:

  • Soon after loading: 4.5GB
  • Full context length: 7.8GB
Test results

Prompt:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Who is best waifu, Rei or Asuka?
### Response:

Alpaca-30B-Int4

python server.py --wbits 4 --model Alpaca-30B-Int4 --listen

Maya:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Who is best waifu, Rei or Asuka?
### Response:
 It's a difficult choice, but I think Rei is the best waifu. She is kind, caring, and loyal, and she always puts others before herself. She is also a powerful warrior and a great pilot, making her a great choice for a waifu.

Output generated in 4.48 seconds (13.39 tokens/s, 60 tokens, context 47)

Main:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Who is best waifu, Rei or Asuka?
### Response:
 It's a difficult choice, but I think Rei is the best waifu. She is kind, caring, and loyal, and she always puts others before herself. She is also a powerful warrior and a great pilot, making her a great choice for a waifu.

Output generated in 4.50 seconds (13.33 tokens/s, 60 tokens, context 47)

alpaca-native-4bit

python server.py --model alpaca-native-4bit --wbits 4 --groupsize 128 --listen

Maya:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Who is best waifu, Rei or Asuka?
### Response:
 Personally, I think Rei is the best waifu. She is wise, kind, and always looks out for her friends. She is also strong and courageous, never backing down from a challenge. Asuka is also very powerful, but she can be impulsive and reckless at times. Rei, on the other hand, is more thoughtful and calculated in her decisions.

Output generated in 2.36 seconds (34.81 tokens/s, 82 tokens, context 47)

Main:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Who is best waifu, Rei or Asuka?
### Response:
 Personally, I think Rei is the best waifu. She is wise, kind, and always looks out for her friends. She is also strong and courageous, never backing down from a challenge. Asuka is also very powerful, but she can be impulsive and reckless at times. Rei, on the other hand, is more thoughtful and calculated in her decisions.

Output generated in 2.33 seconds (35.18 tokens/s, 82 tokens, context 47)

@oobabooga oobabooga merged commit b2f356a into oobabooga:main Mar 28, 2023
@mayaeary mayaeary deleted the feature/gpt-j-4bit-v2 branch March 29, 2023 08:03
@treshphilip
Copy link

Can someone convert that model https://huggingface.co/TehVenom/PPO_Pygway-V8p4_Dev-6b to 4bit please? I have not enough of GPU memory to do that(

@mayaeary
Copy link
Contributor Author

Can someone convert that model https://huggingface.co/TehVenom/PPO_Pygway-V8p4_Dev-6b to 4bit please? I have not enough of GPU memory to do that(

Done, added to first post

@treshphilip
Copy link

Can someone convert that model https://huggingface.co/TehVenom/PPO_Pygway-V8p4_Dev-6b to 4bit please? I have not enough of GPU memory to do that(

Done, added to first post

Thank you very much! I will go test that)

Ph0rk0z pushed a commit to Ph0rk0z/text-generation-webui-testing that referenced this pull request Apr 17, 2023
…y/feature/gpt-j-4bit-v2)

This includes Pygmalion 4bit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants