# Transformers 模型量化技术：AWQ

![img](https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/Thumbnail.png)

在2023年6月，Ji Lin等人发表了论文[AWQ：Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf)。

这篇论文详细介绍了一种激活感知权重量化算法，可以用于压缩任何基于 Transformer 的语言模型，同时只有微小的性能下降。关于 AWQ 算法的详细介绍，见[MIT Han Song 教授分享](https://hanlab.mit.edu/projects/awq)。

transformers 现在支持两个不同的 AWQ 开源实现库：

- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
- [LLM-AWQ](https://github.com/mit-han-lab/llm-awq) 


因为 LLM-AWQ 不支持 Nvidia T4 GPU（课程演示 GPU），所以我们使用 AutoAWQ 库来介绍和演示 AWQ 模型量化技术。

## 量化前模型测试文本生成任务

In [1]:
from transformers import pipeline

model_path = "facebook/opt-6.7b"

# 使用 GPU 加载原始的 OPT-125m 模型
generator = pipeline('text-generation',
                     model=model_path,
                     device=0,
                     do_sample=True,
                     num_return_sequences=3)

  from .autonotebook import tqdm as notebook_tqdm





Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:59<00:00, 29.98s/it]


#### 实测GPU显存占用：加载 OPT-125m 模型后

```shell
Sun Dec 24 15:11:33 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   47C    P0              26W /  70W |    635MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
```

In [2]:
generator("The woman worked as a")

[{'generated_text': 'The woman worked as a housewoman in the office. After that, she moved to the shop as'},
 {'generated_text': 'The woman worked as a salesperson at the store and was said to display the utmost professionalism in her'},
 {'generated_text': 'The woman worked as a nurse there in 2018. He was fired from his job at the hospital on'}]

In [3]:
generator("The man worked as a")

[{'generated_text': 'The man worked as a bus driver.  A job that requires you to interact with a large number'},
 {'generated_text': 'The man worked as a contractor at a property for years, and was recently fired when he reported it'},
 {'generated_text': 'The man worked as a security guard last night. He must be in trouble.\nThere’'}]

## 使用 AutoAWQ 量化模型

下面我们以 `facebook opt-125m` 模型为例，使用 `AutoAWQ` 库实现的 AWQ 算法实现模型量化。

In [1]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "facebook/opt-6.7b"
quant_path = "models/opt-6.7b-awq"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

  from .autonotebook import tqdm as notebook_tqdm





To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 14.90it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.70s/it]


In [2]:
# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

Downloading readme: 167B [00:00, ?B/s] 
Downloading data files:   0%|                                                                    | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|                                                                     | 0.00/471M [00:00<?, ?B/s][A
Downloading data:   1%|▌                                                           | 4.19M/471M [00:03<06:06, 1.27MB/s][A
Downloading data:   3%|█▌                                                          | 12.6M/471M [00:05<03:09, 2.42MB/s][A
Downloading data:   4%|██▋                                                         | 21.0M/471M [00:08<02:34, 2.91MB/s][A
Downloading data:   6%|███▋                                                        | 29.4M/471M [00:10<02:26, 3.02MB/s][A
Downloading data:   8%|████▊                                                       | 37.7M/471M [00:12<02:13, 3.25MB/s][A
Downloading data:  10%|█████▉                                                      | 46.1M/471M [00:15

#### 实测GPU显存使用：量化模型时峰值达到将近 4GB

```shell
Sun Dec 24 15:12:50 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   48C    P0              32W /  70W |    3703MiB / 15360MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
```

In [3]:
quant_config

{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}

#### Transformers 兼容性配置

为了使`quant_config` 与 transformers 兼容，我们需要修改配置文件：`使用 Transformers.AwqConfig 来实例化量化模型配置`

In [4]:
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config

In [5]:
# 保存模型权重
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)  # 保存分词器



('models/opt-6.7b-awq\\tokenizer_config.json',
 'models/opt-6.7b-awq\\special_tokens_map.json',
 'models/opt-6.7b-awq\\vocab.json',
 'models/opt-6.7b-awq\\merges.txt',
 'models/opt-6.7b-awq\\added_tokens.json',
 'models/opt-6.7b-awq\\tokenizer.json')

### 使用 GPU 加载量化模型

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="cuda").to(0)

In [7]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)


In [9]:
result = generate_text("Don't worry.I will help you to")
print(result)

Don't worry.I will help you to get rid of this problem.

I have a very good experience in this field.I have done many projects in this field.I am sure that I will do this project perfectly.

I am a professional web developer. I have more than 5 years of experience in web development. I have worked on many


In [10]:
result = generate_text("The girl worked as a")
print(result)

The girl worked as a waitress at a restaurant in the city of Krasnoyarsk, in the Urals region of Russia.

The man, who was not named, was a regular customer at the restaurant.

He was a regular customer at the restaurant.

He was a regular customer at the restaurant.



## Homework：使用 AWQ 量化 Facebook OPT-6.7B 模型

Facebook OPT 模型：https://huggingface.co/facebook?search_models=opt