# AWQ
Similar to GPTQ, AWQ is optimized for GPU inference. It is based on the fact that ~1% of weights actually contribute significantly to the model's accuracy, and hence these must be treated delicately by using a dataset to analyze the activation distributions during inference and identify those important and critical weights.

### Quantizing with [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)

Let's do a short demo and quantize Mistral 7B!

First, we install `autoawq`. It will allow us to easily quantize and perform inference on AWQ models! AutoAWQ also provides, by default, a `pile-val` dataset that will be used for the quantization process!

In [1]:
!pip install autoawq

Collecting autoawq
  Downloading autoawq-0.2.6-cp310-cp310-manylinux2014_x86_64.whl.metadata (18 kB)
Collecting datasets (from autoawq)
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting zstandard (from autoawq)
  Downloading zstandard-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting autoawq-kernels (from autoawq)
  Downloading autoawq_kernels-0.0.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.1->autoawq)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.3.1->autoawq)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.3.1->autoawq)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting n

Once we're done, we can download the model we want to quantize. First, let's log in with a read access token so we have access to the models.

Note: You need to first accept the terms in the repo.

In [2]:
from huggingface_hub import login

login("read_token")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Now everything is ready, so we can load the model and quantize it! Here, we will quantize the model to 4-bit!

In [3]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

pretrained_model_dir = "mistralai/Mistral-7B-Instruct-v0.3"
quantized_model_dir = "mistral_awq_quant"

model = AutoAWQForCausalLM.from_pretrained(
    pretrained_model_dir, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, trust_remote_code=True)

# quantize the model
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
model.quantize(tokenizer, quant_config=quant_config)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.82k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

params.json:   0%|          | 0.00/202 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

tokenizer.model.v3:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading readme:   0%|          | 0.00/167 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/214670 [00:00<?, ? examples/s]

AWQ: 100%|██████████| 32/32 [17:47<00:00, 33.37s/it]


Now that the model is quantized, we can save it so we can share it or load it later! Since quantizing with AWQ takes a while and some resources, it's advised to always save them.

In [4]:
model.save_quantized(quantized_model_dir)

tokenizer.save_pretrained(quantized_model_dir)

Note that `shard_checkpoint` is deprecated and will be removed in v4.44. We recommend you using split_torch_state_dict_into_shards from huggingface_hub library


('mistral_awq_quant/tokenizer_config.json',
 'mistral_awq_quant/special_tokens_map.json',
 'mistral_awq_quant/tokenizer.model',
 'mistral_awq_quant/added_tokens.json',
 'mistral_awq_quant/tokenizer.json')

Model quantized and saved to AWQ 4-bit precision!

You can also load it for inference using `autoawq` as follows:

In [5]:
model = AutoAWQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0") # loads quantized model to the first GPU
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir)

conversation = [{"role": "user", "content": "How are you today?"}]

prompt = tokenizer.apply_chat_template(
            conversation,
            tokenize=False,
            add_generation_prompt=True,
)

inputs = tokenizer(prompt, return_tensors="pt")
inputs.to("cuda:0") # loads tensors to the first GPU

outputs = model.generate(**inputs, max_new_tokens=32)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)

Replacing layers...: 100%|██████████| 32/32 [00:07<00:00,  4.23it/s]
Fusing layers...: 100%|██████████| 32/32 [00:00<00:00, 130.00it/s]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


How are you today? I'm an AI and don't have feelings, but I'm here and ready to help you with your questions! How can I assist you today
