[## 文档链接](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#quantization)

## GPTQ

### GPTQ：面向预训练 Transformer 模型设计的量化技术（ICLR 2023）
GPTQ：Accurate Post-Training Quantization for Generative Pre-trained Transformers 是一个高效、精准的
量化技术，特别适用于大规模GPT模型，能够在显著降低模型大小和计算需求的同时，保持高准确度和推理速度。
***
GPTQ算法具有以下技术特点：\
\
1.专为GPT模型设计：GPTQ针对大规模GPT模型（如1750亿参数规模的模型）进行优化，解决了这类模型因
规模庞大导致的高计算和存储成本问题。\
2.一次性权重量化方法：GPTQ是一种基于近似二阶信息的权重量化方法，能够在一次处理中完成模型的量化。\
3.高效率：GPTQ能在大约四个GPU小时内完成1750亿参数的GPT模型的量化。\
4.低位宽量化：通过将权重位宽降至每个权重3或4位，GPTQ显著减少了模型的大小。\
5.准确度保持：即便在进行显著的位宽减少后，GPTQ也能保持与未压缩模型相近的准确度，减少性能损失。\
6.支持极端量化：GPTQ还可以实现更极端的量化，如2位或三元量化，同时保持合理的准确度。\
7.推理速度提升：使用GPTQ量化的模型在高端GPU（如NVIDIA A100）上实现了大约3.25倍的推理速度提升，
在成本效益更高的GPU（如NVIDIA A6000）上实现了大约4.5倍的速度提升。\
8.适用于单GPU环境：GPTQ使得在单个GPU内执行大规模模型的生成推理成为可能，显著降低了部署这类模
型的硬件要求。

![image](../static/微信图片_20240304084405.png)

### GPTQ 量化算法核心流程
核心步骤：使用存储在Cholesky（切尔斯基）分解中的逆Hessian（海森）
信息量化连续列的块（加粗表示），并在步骤结束时更新剩余的权重
（蓝色表示），在每个块内递归（白色中间块）地应用量化过程。\
GPTQ量化过程的关键步骤操作，具体描述如下：\
1.块量化：选择一块连续的列（在图中加粗表示），并将其作为当前步骤
的量化目标。\
2.使用Cholesky分解：利用Cholesky分解得到的逆Hessian信息来量化选定的块。Cholesky分解提供了一种数值稳定的方法来处理逆矩阵，这对于维
持量化过程的准确性至关重要。\
3.权重更新：在每个量化步骤的最后，更新剩余的权重（在图中以蓝色表
示）。这个步骤确保了整个量化过程的连贯性和精确性。\
4.递归量化：在每个选定的块内部，量化过程是递归应用的。这意味着量
化过程首先聚焦于一个较小的子块，然后逐步扩展到整个块。
通过这种方式，GPTQ方法能够在保持高度精度的同时，高效地处理大量
的权重，这对于大型模型的量化至关重要。这种策略特别适用于处理大
型、复杂的模型，如GPT系列，其中权重数量巨大，且量化过程需要特别
小心以避免精度损失。

## AWQ

### 激活感知权重量化（Activation-aware Weight Quantization, AWQ）
激活感知权重量化（AWQ）算法，其原理不是对模型中的所有权重进行量化，而是仅保留小部分（1%）对LLM性能\
至关重要的权重。其算法主要特点如下：\
1.低位权重量化：AWQ专为大型语言模型（LLMs）设计，支持低位（即少位数）的权重量化，有效减少模型大小。\
2.重点保护显著权重：AWQ基于权重重要性不均的观察，只需保护大约1%的显著权重，即可显著减少量化误差。\
3.观察激活而非权重：在确定哪些权重是显著的过程中，AWQ通过观察激活分布而非权重分布来进行。\
4.无需反向传播或重构：AWQ不依赖于复杂的反向传播或重构过程，因此能够更好地保持模型的泛化能力，避免对
特定数据集的过拟合。\
5.适用于多种模型和任务：AWQ在多种语言建模任务和领域特定基准测试中表现出色，包括指令调整的语言模型和
多模态语言模型。\
6.高效的推理框架：与AWQ配套的是一个为LLMs量身定做的高效推理框架，提供显著的速度提升，适用于桌面和
移动GPU。\
7.支持边缘设备部署：这种方法支持在内存和计算能力有限的边缘设备（如NVIDIA Jetson Orin 64GB）上部署大
型模型，如70B Llama-2模型。

![image](../static/微信图片_20240304092755.png)

## 比较

![image](../static/微信图片_20240304093236.png)

## BAB


### BitsAndBytes 简介
BitsAndBytes（BNB）是自定义CUDA函数的轻量级包装器，特别是8比特优化器、矩阵乘法和量
化函数。主要特征如下：\
 \
•具有混合精度分解的8比特矩阵乘法\
•LLM.int8()推理\
•8比特优化器：Adam、AdamW、RMSProp、LARS、LAMB、Lion（节省75%内存）\
•稳定的嵌入层：通过更好的初始化和标准化提高稳定性\
•8比特量化：分位数、线性和动态量化\
•快速的分位数估计：比其他算法快100倍\
 \
在 Transformers 量化方案中，BNB 是将模型量化为8位和4位的最简单选择。\
 \
•8位量化将fp16中的异常值与int8中的非异常值相乘，将非异常值转换回fp16，然后将它们相加以
返回fp16中的权重。这减少了异常值对模型性能产生的降级效果。\
•4位量化进一步压缩了模型，并且通常与QLoRA一起用于微调量化LLM（低精度语言模型）。

## Half

In [None]:
from transformers import AutoTokenizer,AutoModel,AutoModelForCausalLM
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model=AutoModelForCausalLM.from_pretrained('E:\model\language\opt-125m',trust_remote_code=True,device_map='auto').to(device)

# 总结 AWQ yyds

# 代码环节

***

### AWQ

![image](../static/微信图片_20240305102346.png)

In [1]:
#配置
from transformers import AutoTokenizer,AutoModelForCausalLM,AutoConfig,AwqConfig
from awq import AutoAWQForCausalLM
import torch
model_path='E:\model\language\opt-125m'
quant_path='E:\model\language\quant\opt-125m-awq'
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")



In [2]:
device

device(type='cuda')

In [3]:
#加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path,trust_remote_code=True, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

In [5]:
# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

Repo card metadata block was not found. Setting CardData to empty.
  table = cls._concat_blocks(blocks, axis=0)
AWQ: 100%|██████████| 12/12 [01:57<00:00,  9.78s/it]


In [8]:
#保存模型配置
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config

In [7]:
quantization_config

{'quant_method': <QuantizationMethod.AWQ: 'awq'>,
 'bits': 4,
 'group_size': 128,
 'zero_point': True,
 'version': <AWQLinearVersion.GEMM: 'gemm'>,
 'backend': <AwqBackendPackingMethod.AUTOAWQ: 'autoawq'>,
 'fuse_max_seq_len': None,
 'modules_to_not_convert': None,
 'modules_to_fuse': None,
 'do_fuse': False}

In [9]:
model.model.config.quantization_config

{'quant_method': <QuantizationMethod.AWQ: 'awq'>,
 'bits': 4,
 'group_size': 128,
 'zero_point': True,
 'version': <AWQLinearVersion.GEMM: 'gemm'>,
 'backend': <AwqBackendPackingMethod.AUTOAWQ: 'autoawq'>,
 'fuse_max_seq_len': None,
 'modules_to_not_convert': None,
 'modules_to_fuse': None,
 'do_fuse': False}

In [10]:
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

('E:\\model\\language\\quant\\opt-125m-awq\\tokenizer_config.json',
 'E:\\model\\language\\quant\\opt-125m-awq\\special_tokens_map.json',
 'E:\\model\\language\\quant\\opt-125m-awq\\vocab.json',
 'E:\\model\\language\\quant\\opt-125m-awq\\merges.txt',
 'E:\\model\\language\\quant\\opt-125m-awq\\added_tokens.json',
 'E:\\model\\language\\quant\\opt-125m-awq\\tokenizer.json')

In [7]:
model.eval

<bound method Module.eval of OptAWQForCausalLM(
  (model): OPTForCausalLM(
    (model): OPTModel(
      (decoder): OPTDecoder(
        (embed_tokens): Embedding(50272, 768, padding_idx=1)
        (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
        (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (layers): ModuleList(
          (0-11): 12 x OPTDecoderLayer(
            (self_attn): OPTAttention(
              (k_proj): WQLinear_GEMM(in_features=768, out_features=768, bias=True, w_bit=4, group_size=128)
              (v_proj): WQLinear_GEMM(in_features=768, out_features=768, bias=True, w_bit=4, group_size=128)
              (q_proj): WQLinear_GEMM(in_features=768, out_features=768, bias=True, w_bit=4, group_size=128)
              (out_proj): WQLinear_GEMM(in_features=768, out_features=768, bias=True, w_bit=4, group_size=128)
            )
            (activation_fn): ReLU()
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, 

In [9]:
n=0
for name, param in model.named_parameters():
    n=n+1
    print(f"Parameter {name} data type: {param.dtype}")
print(n)

Parameter model.model.decoder.embed_tokens.weight data type: torch.float16
Parameter model.model.decoder.embed_positions.weight data type: torch.float16
Parameter model.model.decoder.final_layer_norm.weight data type: torch.float16
Parameter model.model.decoder.final_layer_norm.bias data type: torch.float16
Parameter model.model.decoder.layers.0.self_attn_layer_norm.weight data type: torch.float16
Parameter model.model.decoder.layers.0.self_attn_layer_norm.bias data type: torch.float16
Parameter model.model.decoder.layers.0.final_layer_norm.weight data type: torch.float16
Parameter model.model.decoder.layers.0.final_layer_norm.bias data type: torch.float16
Parameter model.model.decoder.layers.1.self_attn_layer_norm.weight data type: torch.float16
Parameter model.model.decoder.layers.1.self_attn_layer_norm.bias data type: torch.float16
Parameter model.model.decoder.layers.1.final_layer_norm.weight data type: torch.float16
Parameter model.model.decoder.layers.1.final_layer_norm.bias data

## GPTQ

![image](../static/微信图片_20240305102351.png)

In [22]:
from transformers import AutoModelForCausalLM,AutoTokenizer,GPTQConfig
import torch
model_path='E:\model\language\opt-125m'
quant_path='E:\model\language\quant\opt-125m-gptq'
quant_config=GPTQConfig(
    bits=4,
    group_size=128,
    dataset=["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."],
    desc_act=False,
)



In [23]:
tokenizer=AutoTokenizer.from_pretrained(model_path,trust_remote_code=True)
model=AutoModelForCausalLM.from_pretrained(model_path,quantization_config=quant_config, torch_dtype=torch.float16, device_map="auto",trust_remote_code=True)

Quantizing model.decoder.layers blocks :   0%|          | 0/12 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

In [24]:
model.model.decoder.layers[0].self_attn.q_proj.__dict__

{'training': True,
 '_parameters': OrderedDict(),
 '_buffers': OrderedDict([('qweight',
               tensor([[ 1712808666, -1248295259, -2025411892,  ..., -1486452502,
                         2019142072, -1735820810],
                       [-2000132747,  -578262345,  1484081337,  ..., -1230600537,
                        -2019252056, -2023311003],
                       [ -710293850, -1153090188,  1431922298,  ..., -1768449094,
                        -1984337253,  2022406582],
                       ...,
                       [-1451935592, -1494580055, -1772844344,  ..., -1517635426,
                         -664417400,  -409622870],
                       [-2007473565,  1218733898,  1737251004,  ...,  1741199510,
                        -1732560249, -1754850968],
                       [ 1999202918, -1986294939,  1737140825,  ..., -1461086871,
                        -1450465416, -1756087955]], device='cuda:0', dtype=torch.int32)),
              ('qzeros',
               tensor(

In [25]:
n=0
for name, param in model.named_parameters():
    n=n+1
    print(f"Parameter {name} data type: {param.dtype}")
print(n)

Parameter model.decoder.embed_tokens.weight data type: torch.float16
Parameter model.decoder.embed_positions.weight data type: torch.float16
Parameter model.decoder.final_layer_norm.weight data type: torch.float16
Parameter model.decoder.final_layer_norm.bias data type: torch.float16
Parameter model.decoder.layers.0.self_attn_layer_norm.weight data type: torch.float16
Parameter model.decoder.layers.0.self_attn_layer_norm.bias data type: torch.float16
Parameter model.decoder.layers.0.final_layer_norm.weight data type: torch.float16
Parameter model.decoder.layers.0.final_layer_norm.bias data type: torch.float16
Parameter model.decoder.layers.1.self_attn_layer_norm.weight data type: torch.float16
Parameter model.decoder.layers.1.self_attn_layer_norm.bias data type: torch.float16
Parameter model.decoder.layers.1.final_layer_norm.weight data type: torch.float16
Parameter model.decoder.layers.1.final_layer_norm.bias data type: torch.float16
Parameter model.decoder.layers.2.self_attn_layer_no

In [27]:
text = "Merry Christmas! I'm glad to"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Merry Christmas! I'm glad to be a good to be a good to be a good to be a good to be a good to be a good to be a good to a good to a good to a good to a good to a good to a good to a good to a good to a good to a good to a good to a good to


In [28]:
model.save_pretrained(quant_path)
tokenizer.save_pretrained(quant_path)

('E:\\model\\language\\quant\\opt-125m-gptq\\tokenizer_config.json',
 'E:\\model\\language\\quant\\opt-125m-gptq\\special_tokens_map.json',
 'E:\\model\\language\\quant\\opt-125m-gptq\\vocab.json',
 'E:\\model\\language\\quant\\opt-125m-gptq\\merges.txt',
 'E:\\model\\language\\quant\\opt-125m-gptq\\added_tokens.json',
 'E:\\model\\language\\quant\\opt-125m-gptq\\tokenizer.json')

## BNB

![image](../static/微信图片_20240305102356.png)

In [None]:
from transformers import AutoModel, BitsAndBytesConfig

_compute_dtype_map = {
    'fp32': torch.float32,
    'fp16': torch.float16,
    'bf16': torch.bfloat16
}

# QLoRA 量化配置
q_config = BitsAndBytesConfig(load_in_4bit=True,
                              bnb_4bit_quant_type='nf4',
                              bnb_4bit_use_double_quant=True,
                              bnb_4bit_compute_dtype=_compute_dtype_map['bf16'])


In [None]:
model = AutoModel.from_pretrained(model_name_or_path,
                                  quantization_config=q_config,
                                  device_map='auto',
                                  trust_remote_code=True)

In [None]:
# 获取当前模型占用的 GPU显存（差值为预留给 PyTorch 的显存）
memory_footprint_bytes = model.get_memory_footprint()
memory_footprint_mib = memory_footprint_bytes / (1024 ** 2)  # 转换为 MiB

print(f"{memory_footprint_mib:.2f}MiB")

# 调用量化模型

In [11]:
from transformers import AutoTokenizer,AutoModelForCausalLM,AutoConfig,AwqConfig
import torch
model_path='E:\model\language\opt-125m'
quant_path='E:\model\language\quant\opt-125m-awq'
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")




In [12]:

print(torch.__version__)
device

2.2.0+cu118


device(type='cuda')

In [14]:
model=AutoModelForCausalLM.from_pretrained(quant_path,trust_remote_code=True).cuda()
tokenizer=AutoTokenizer.from_pretrained(quant_path,trust_remote_code=True)



`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [15]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)
    print(inputs)
    out = model.generate(**inputs, max_new_tokens=64)
    print(out)
    return tokenizer.decode(out[0], skip_special_tokens=True)

In [16]:
generate_text('hello')

{'input_ids': tensor([[    2, 42891]], device='cuda:0'), 'attention_mask': tensor([[1, 1]], device='cuda:0')}
tensor([[    2, 42891,     6,  1437,    38,    95,   300,   103,     9,   127,
          1437,  6351,    11,    10,   186,     6,     8,    24,  1326,  5500,
             6,    38,   437,  6908,   213,   120,   127,    78,    65,    11,
           204,   377,     6,    98,    38,   429,   888,   907,    24, 50118,
         13987,    47,   313,   328, 15151,    47,   101,    24,   328,  4832,
           495,     2]], device='cuda:0')


"hello,  I just got some of my  gear in a week, and it looks fantastic, I'm gonna go get my first one in 4 months, so I might actually buy it\nThank you man! Glad you like it! :D"