# Quantize LLM with AutoAWQ

[AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978) converts models to 4-bit precision, while preserving its original quality (i.e. no performance degradation) with similar throughput as pure float16 inference.

Reference: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)

In [None]:
!pip install -qU transformers accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!nvidia-smi

Wed Jan 24 03:28:05 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

**The latest AutoAWQ from PyPi uses CUDA 12.1. (Colab now supports CUDA 12.2)**

If you cannot use CUDA 12.1, you can still use CUDA 11.8 and install the wheel from the latest release.
```
pip install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl
```
See [AutoAWQ repo](https://github.com/casper-hansen/AutoAWQ)

In [None]:
!pip install -qU autoawq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.5/20.5 MB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m55.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.1/457.1 kB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.2/233.2 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.2 MB

## AutoAWQ integration with Transformers

In [None]:
model_name = "facebook/opt-125m"
quant_path = "opt-125m-AWQ"

In [None]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer


quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# Load model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoAWQForCausalLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.10k [00:00<?, ?B/s]

LICENSE.md:   0%|          | 0.00/11.1k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


In [None]:
# Quantize
model.quantize(tokenizer, quant_config=quant_config)

Downloading readme:   0%|          | 0.00/167 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating validation split: 0 examples [00:00, ? examples/s]

AWQ: 100%|██████████| 12/12 [02:05<00:00, 10.43s/it]


In [None]:
from transformers import AwqConfig, AutoConfig
from huggingface_hub import HfApi

# modify the config file so that it is compatible with transformers integration
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# the pretrained transformers model is stored in the model attribute + we need to pass a dict
model.model.config.quantization_config = quantization_config
# a second solution would be to use Autoconfig and push to hub (what we do at llm-awq)

# save model weights
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)



('opt-125m-awq/tokenizer_config.json',
 'opt-125m-awq/special_tokens_map.json',
 'opt-125m-awq/vocab.json',
 'opt-125m-awq/merges.txt',
 'opt-125m-awq/added_tokens.json',
 'opt-125m-awq/tokenizer.json')

## Run inference

Test the quantized model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="cuda")

text = "The reason the sky is blue is because"
inputs = tokenizer(text, return_tensors="pt").to("cuda")

out = model.generate(**inputs, max_new_tokens=32)
print(tokenizer.decode(out[0], skip_special_tokens=True))

The reason the sky is blue is because it's a blue sky.
I'm not sure if you're being serious or not, but I think you're right.


## Push to hub

To push your model to the hub, you'll need to input your HuggingFace token (https://huggingface.co/settings/tokens) in Google Colab's "Secrets" tab.

In [None]:
# !pip install -qU huggingface_hub

In [None]:
from huggingface_hub import create_repo, HfApi
from google.colab import userdata

username = "kesamet"

api = HfApi(token=userdata.get("HF_TOKEN"))

create_repo(
    repo_id=f"{username}/{quant_path}",
    repo_type="model",
    exist_ok=True,
)

RepoUrl('https://huggingface.co/kesamet/opt-125m-awq-AWQ', endpoint='https://huggingface.co', repo_type='model', repo_id='kesamet/opt-125m-awq-AWQ')

In [None]:
api.upload_folder(
    folder_path=quant_path,
    repo_id=f"{username}/{quant_path}",
)

model.safetensors:   0%|          | 0.00/202M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/kesamet/opt-125m-awq-AWQ/commit/b1360c6e00c8f76943386956d28825e7f7fe0c78', commit_message='Upload folder using huggingface_hub', commit_description='', oid='b1360c6e00c8f76943386956d28825e7f7fe0c78', pr_url=None, pr_revision=None, pr_num=None)