# GPTQ
GPTQ is a quantization format optimized for GPU inference. It makes use of a calibration dataset to improve its quantizations.

### Quantizing with [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ)

Let's do a short demo and quantize Mistral 7B.

First, we install `auto-gptq`. It will allow us to easily quantize and infer GPTQ models.

In [4]:
!pip install auto-gptq --no-build-isolation



Once we're done, we can download the model we want to quantize. First, let's log in with a read access token so we have access to the models.

Note: You need to first accept the terms in the repo.

In [5]:
from huggingface_hub import login

login("read_token")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Now everything is ready, so we can load the model and quantize it! Here, we will quantize the model to 4-bit!

In [7]:
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

pretrained_model_dir = "mistralai/Mistral-7B-Instruct-v0.3"
quantized_model_dir = "mistral_gptq_quant"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)

examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on the GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may be slightly bad, feel free to change
)

model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

INFO - Start quantizing layer 1/32
INFO:auto_gptq.modeling._base:Start quantizing layer 1/32
INFO - Quantizing self_attn.k_proj in layer 1/32...
INFO:auto_gptq.modeling._base:Quantizing self_attn.k_proj in layer 1/32...
INFO - Quantizing self_attn.v_proj in layer 1/32...
INFO:auto_gptq.modeling._base:Quantizing self_attn.v_proj in layer 1/32...
INFO - Quantizing self_attn.q_proj in layer 1/32...
INFO:auto_gptq.modeling._base:Quantizing self_attn.q_proj in layer 1/32...
INFO - Quantizing self_attn.o_proj in layer 1/32...
INFO:auto_gptq.modeling._base:Quantizing self_attn.o_proj in layer 1/32...
INFO - Quantizing mlp.up_proj in layer 1/32...
INFO:auto_gptq.modeling._base:Quantizing mlp.up_proj in layer 1/32...
INFO - Quantizing mlp.gate_proj in layer 1/32...
INFO:auto_gptq.modeling._base:Quantizing mlp.gate_proj in layer 1/32...
INFO - Quantizing mlp.down_proj in layer 1/32...
INFO:auto_gptq.modeling._base:Quantizing mlp.down_proj in layer 1/32...
INFO - Start quantizing layer 2/32
INFO:

Now that the model is quantized, we can save it so we can share it or load it later! Since quantizing with GPTQ takes a while and some resources, it's advised to always save them.

In [8]:
model.save_quantized(quantized_model_dir)

tokenizer.save_pretrained(quantized_model_dir)

model.save_quantized(quantized_model_dir, use_safetensors=True)

Model quantized and saved to GPTQ 4-bit precision!

You can also load it for inference using `auto-gptq` as follows:

In [10]:
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0") # loads quantized model to the first GPU
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir)

conversation = [{"role": "user", "content": "How are you today?"}]

prompt = tokenizer.apply_chat_template(
            conversation=conversation,
            tokenize=False,
            add_generation_prompt=True,
)

inputs = tokenizer(prompt, return_tensors="pt")
inputs.to("cuda:0") # loads tensors to the first GPU

outputs = model.generate(**inputs, max_new_tokens=32)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)

1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
INFO - The layer lm_head is not quantized.
INFO:auto_gptq.modeling._base:The layer lm_head is not quantized.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


How are you today? I don't have feelings, but I'm here and ready to help you with your questions or problems! How can I assist you today? 😊
