## 一、写在前面

### 1.1 任务介绍

在解决智能客服场景中的语义理解问题时，我们常常需要判断两个自然语言问句在语义上是否等价。为此，我们希望引入轻量级的小规模语言模型（LLM）来实现这一目标，并在多个应用场景中进行充分的量化评估，以期在最小计算开销的前提下满足业务需求。

我们选用了 **ERNIE-4.5-0.3B-Base** 模型作为基础，该模型仅有 0.3B 参数量，属于基础预训练版本，具备极致轻量化和快速推理能力，非常适合部署在资源受限的场景中。

借助 FastDeploy，我们进一步支持了多种量化推理方式，包括 INT8、INT4 和 2-bit 等不同精度设置，能够对模型权重、激活值以及 KVCache 三类张量分别进行精度优化，全面适配低成本、低时延和长上下文等不同推理场景的需求。

以下是不同量化方案的简要对比：

| 量化方法 | 权重精度 | 激活精度 | KVCache 精度 | 在线/离线 | 支持硬件 |
|----------|-----------|-----------|----------------|-------------|------------|
| WINT8    | **INT8**  | BF16      | BF16           | 在线        | GPU, XPU   |
| WINT4    | **INT4**  | BF16      | BF16           | 在线        | GPU, XPU   |
| WINT2    | **2 Bits**| BF16      | BF16           | 离线        | GPU        |

接下来，我们将在原始模型（BASE）及其 INT8、INT4、INT2 量化版本上，开展多维度的应用场景评测，深入分析不同精度配置下的性能与效果表现，为实际部署提供数据支撑和决策依据。

### 1.2 数据介绍

其中，用以评测的数据集：

* 百度DuQM测试集

通过对搜索问答场景中的原始问题进行替换、插入等操作，并过滤掉真实场景中未出现过的问题，保证扰动后问题的自然性和流畅性，然后进行人工筛选和语义匹配标注，得到最终的评测集。

* OPPO小布对话短文本测试集

采样自OPPO语音助手小布的真实对话场景数据，进行人工筛选和语义匹配标注，得到最终的评测集。

给定一组问题对，判断问题对在语义上是否匹配(等价)，例如：

|类型|问题1|问题2|标签|
|-|-|-|-|
|匹配|胎儿什么时候入盆|胚胎什么时候入盆|1|
|不匹配|人民币怎么换港币|港币怎么换人民币|0|

### 1.3 评测指标

本次评测采用的评价指标为宏平均准确率（Macro-Accuracy），即先求得14个维度的准确率（Accuracy），然后对所有维度的准确率求平均(Macro-Averaging)，

详细评分: $Acc_{macro} = \frac{\sum_{i=1}^N Acc_i}{N}$

其中，$Acc_i= \frac{TP_i + TN_i}{TP_i + TN_i + FP_i + FN_i}$, *TP=True positive, TN=True negative, FP=False positive, FN=False negative*


## 二、准备工作

### 2.1 文本长度

为避免文本长度受限，我们查看以下拟跑批文本的最长文本长度。

In [2]:

max_l = 0
with open("test/test.tsv", "r") as f:
    for i in f:
        l = len(i)
        if l > max_l:
            max_l = l
print(f"MAX {max_l}")


MAX 97


### 2.2 相关脚本


* **0.init.sh/环境初始化**

In [3]:
! cat 0.init.sh


pip install pandarallel

clear
path=/home/aistudio/data/models
model=ERNIE-4.5-0.3B-Base-Paddle

echo $model
rm -rf $path/$model
aistudio download --model PaddlePaddle/$model --local_dir $path/$model
ls -l $path/$model


* **1.server-X.sh/启动服务**

In [4]:
! cat 1.server*.sh


clear
path=/home/aistudio/data/models
model=ERNIE-4.5-0.3B-Base-Paddle

python -m fastdeploy.entrypoints.openai.api_server --model $path/$model --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --max-model-len 3072 --max-num-seqs 128

clear
path=/home/aistudio/data/models
model=ERNIE-4.5-0.3B-Base-Paddle

python -m fastdeploy.entrypoints.openai.api_server --model $path/$model --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --max-model-len 3072 --max-num-seqs 128 --quantization wint2

clear
path=/home/aistudio/data/models
model=ERNIE-4.5-0.3B-Base-Paddle

python -m fastdeploy.entrypoints.openai.api_server --model $path/$model --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --max-model-len 3072 --max-num-seqs 128 --quantization wint4

clear
path=/home/aistudio/data/models
model=ERNIE-4.5-0.3B-Base-Paddle

python -m fastdeploy.entrypoints.openai.api_server --model $path/$model --port 8180 --metrics-port 8181 --engine-wo

* **2.run.py/模型调用**

In [None]:
import os
import sys
import json
import openai
import pandas as pd
from tqdm import tqdm

os.system("clear")

# Import
from pandarallel import pandarallel
# Initialization
pandarallel.initialize(nb_workers=64, progress_bar=True)

host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

_system = {
    "role": "system", 
    "content": """
你是一个自然语言处理专家，现在需要进行问题匹配，不匹配返回0，匹配返回1，

输入：
{"A": "婴儿吃什么蔬菜好", "B": "婴儿吃什么绿色蔬菜好"}

输出：
{"result": 0}

严格按照输出的json格式。
"""
}
_E1, _E2 = 0, 0


def get(_c):
    response = client.chat.completions.create(
        model="null",
        messages=[
            _system, 
            {
                "role": "user", 
                "content": json.dumps(_c, ensure_ascii=False)
            }
        ],
        stream=False,
        response_format={
            "type": "json_object",
            "schema": {
                "type": "object",
                "properties": {
                    "result": {"type": "int"}
                },
                "required": [ 
                    "result"
                ]
            }
        }
    )

    try:
        _j = json.loads(
            response.choices[0].message.content.replace("`","").replace("json","")
        )
        _r = int(_j.get("result", 0))
    except Exception as e:
        print(f"\n \033[1;36m ERROR:\033[0m \n{response.choices[0].message.content[:100]}\n{e}", )
        _r = 0
    return _r if _r == 0 else 1


t = """
{"A": "婴儿吃什么蔬菜好", "B": "婴儿吃什么绿色蔬菜好"}
"""
print(f"""Test "{get(t)}".\n""")
# raise "Test"


data = pd.read_csv(
    "test/test.tsv", 
    sep="\t", header=None,
    names=["A", "B"],
)
print(data.shape)

data["text"] = [
    {"A": f"{_1}", "B": f"{_2}"}
    for _1, _2 in zip(data["A"], data["B"])
]
data["result"] = data["text"].parallel_apply(get)
print(data)
print(data["result"].value_counts())

with open("test/predict.csv", "w") as f:
    for i in data["result"]:
        f.write(f"{i}\n")



* **3.moni.py/运行监控**

In [None]:
import warnings
warnings.filterwarnings("ignore")

import psutil
from pynvml import *
import subprocess
import time


class ResourceMonitor:
    def __init__(self):
        nvmlInit()
        self.handle = nvmlDeviceGetHandleByIndex(0)
        
    def get_stats(self):
        cpu = psutil.cpu_percent(interval=0.1)
        mem = psutil.virtual_memory().percent
        gpu = nvmlDeviceGetUtilizationRates(self.handle).gpu
        return {"CPU": cpu, "Mem": mem, "GPU": gpu}


monitor = ResourceMonitor()
process = subprocess.Popen("python 2.run.py > 2.run.log", shell=True)


try:
    while True:
        stats = monitor.get_stats()
        print(f"CPU:\t{stats['CPU']:.2f}%\t|Mem:\t{stats['Mem']:.2f}%\t|GPU:\t{stats['GPU']:.2f}%")
        if process.poll() is not None: break
        time.sleep(60)
finally:
    process.terminate()
    nvmlShutdown()



## 三、开始运行

其中，启动服务部分建议在终端另行执行。


### 3.1 ERNIE-4.5-0.3B-Base

* **启动服务**

```shell
sh 1.server-base.sh &
```

* **开始评测**


In [5]:
! cat 1.server-base.sh


clear
path=/home/aistudio/data/models
model=ERNIE-4.5-0.3B-Base-Paddle

python -m fastdeploy.entrypoints.openai.api_server --model $path/$model --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --max-model-len 3072 --max-num-seqs 128


In [44]:
%%time
! python 3.moni.py


CPU:	7.30%	|Mem:	21.90%	|GPU:	0.00%
CPU:	8.90%	|Mem:	22.00%	|GPU:	44.00%
CPU:	5.60%	|Mem:	22.00%	|GPU:	45.00%
CPU:	6.50%	|Mem:	22.00%	|GPU:	43.00%
CPU:	7.70%	|Mem:	22.00%	|GPU:	43.00%
CPU:	7.00%	|Mem:	22.00%	|GPU:	43.00%
CPU:	15.90%	|Mem:	22.00%	|GPU:	40.00%
CPU:	5.70%	|Mem:	22.00%	|GPU:	44.00%
CPU:	6.50%	|Mem:	22.00%	|GPU:	39.00%
CPU:	6.10%	|Mem:	22.00%	|GPU:	46.00%
CPU:	9.30%	|Mem:	22.00%	|GPU:	41.00%
CPU:	8.60%	|Mem:	22.00%	|GPU:	44.00%
CPU:	6.10%	|Mem:	22.00%	|GPU:	40.00%
CPU:	7.90%	|Mem:	22.00%	|GPU:	40.00%
CPU:	6.30%	|Mem:	22.00%	|GPU:	39.00%
CPU:	6.50%	|Mem:	22.00%	|GPU:	43.00%
CPU:	7.50%	|Mem:	22.00%	|GPU:	43.00%
CPU:	6.80%	|Mem:	22.00%	|GPU:	39.00%
CPU:	7.00%	|Mem:	22.00%	|GPU:	42.00%
CPU:	5.10%	|Mem:	21.60%	|GPU:	0.00%
CPU times: user 8.7 s, sys: 1.45 s, total: 10.1 s
Wall time: 19min 2s


### 3.2 ERNIE-4.5-0.3B-INT8

* **启动服务**

```shell
sh 1.server-int8.sh &
```

* **开始评测**


In [6]:
! cat 1.server-int8.sh


clear
path=/home/aistudio/data/models
model=ERNIE-4.5-0.3B-Base-Paddle

python -m fastdeploy.entrypoints.openai.api_server --model $path/$model --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --max-model-len 3072 --max-num-seqs 128 --quantization wint8


In [33]:
%%time
! python 3.moni.py


CPU:	5.30%	|Mem:	21.90%	|GPU:	0.00%
CPU:	6.10%	|Mem:	21.90%	|GPU:	48.00%
CPU:	7.30%	|Mem:	21.90%	|GPU:	48.00%
CPU:	7.10%	|Mem:	21.90%	|GPU:	44.00%
CPU:	12.00%	|Mem:	21.90%	|GPU:	48.00%
CPU:	7.60%	|Mem:	21.90%	|GPU:	44.00%
CPU:	6.60%	|Mem:	21.90%	|GPU:	48.00%
CPU:	7.10%	|Mem:	21.90%	|GPU:	44.00%
CPU:	8.20%	|Mem:	22.00%	|GPU:	44.00%
CPU:	14.30%	|Mem:	22.00%	|GPU:	45.00%
CPU:	8.00%	|Mem:	22.00%	|GPU:	46.00%
CPU:	6.20%	|Mem:	22.00%	|GPU:	45.00%
CPU:	16.50%	|Mem:	22.00%	|GPU:	49.00%
CPU:	10.30%	|Mem:	22.00%	|GPU:	43.00%
CPU:	8.10%	|Mem:	22.00%	|GPU:	48.00%
CPU:	8.10%	|Mem:	22.00%	|GPU:	46.00%
CPU:	5.90%	|Mem:	22.00%	|GPU:	47.00%
CPU:	15.50%	|Mem:	22.00%	|GPU:	46.00%
CPU:	5.20%	|Mem:	21.90%	|GPU:	0.00%
CPU times: user 8.03 s, sys: 1.5 s, total: 9.53 s
Wall time: 18min 3s


### 3.3 ERNIE-4.5-0.3B-INT4

* **启动服务**

```shell
sh 1.server-int4.sh &
```

* **开始评测**


In [7]:
! cat 1.server-int4.sh


clear
path=/home/aistudio/data/models
model=ERNIE-4.5-0.3B-Base-Paddle

python -m fastdeploy.entrypoints.openai.api_server --model $path/$model --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --max-model-len 3072 --max-num-seqs 128 --quantization wint4


In [38]:
%%time
! python 3.moni.py


CPU:	6.80%	|Mem:	21.80%	|GPU:	0.00%
CPU:	8.80%	|Mem:	21.90%	|GPU:	43.00%
CPU:	6.50%	|Mem:	21.90%	|GPU:	50.00%
CPU:	8.50%	|Mem:	21.90%	|GPU:	49.00%
CPU:	7.40%	|Mem:	21.90%	|GPU:	44.00%
CPU:	7.00%	|Mem:	21.90%	|GPU:	47.00%
CPU:	7.90%	|Mem:	21.90%	|GPU:	47.00%
CPU:	13.50%	|Mem:	21.90%	|GPU:	42.00%
CPU:	6.70%	|Mem:	21.90%	|GPU:	46.00%
CPU:	6.50%	|Mem:	21.90%	|GPU:	46.00%
CPU:	7.10%	|Mem:	22.00%	|GPU:	47.00%
CPU:	7.60%	|Mem:	21.90%	|GPU:	50.00%
CPU:	12.30%	|Mem:	21.90%	|GPU:	47.00%
CPU:	7.30%	|Mem:	22.00%	|GPU:	45.00%
CPU:	6.60%	|Mem:	22.00%	|GPU:	51.00%
CPU:	14.90%	|Mem:	22.00%	|GPU:	47.00%
CPU:	7.50%	|Mem:	22.00%	|GPU:	47.00%
CPU:	7.90%	|Mem:	22.00%	|GPU:	48.00%
CPU:	7.50%	|Mem:	21.90%	|GPU:	0.00%
CPU times: user 8.77 s, sys: 1.48 s, total: 10.2 s
Wall time: 18min 3s


### 3.4 ERNIE-4.5-0.3B-INT2

* **启动服务**

```shell
sh 1.server-int2.sh &
```

* **开始评测**


In [8]:
! cat 1.server-int2.sh


clear
path=/home/aistudio/data/models
model=ERNIE-4.5-0.3B-Base-Paddle

python -m fastdeploy.entrypoints.openai.api_server --model $path/$model --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --max-model-len 3072 --max-num-seqs 128 --quantization wint2


In [43]:
%%time
! python 3.moni.py


CPU:	6.80%	|Mem:	21.90%	|GPU:	0.00%
CPU:	8.30%	|Mem:	22.00%	|GPU:	41.00%
CPU:	13.10%	|Mem:	22.00%	|GPU:	39.00%
CPU:	8.50%	|Mem:	22.00%	|GPU:	39.00%
CPU:	6.50%	|Mem:	22.00%	|GPU:	43.00%
CPU:	7.30%	|Mem:	22.00%	|GPU:	44.00%
CPU:	7.30%	|Mem:	22.00%	|GPU:	41.00%
CPU:	7.40%	|Mem:	22.00%	|GPU:	44.00%
CPU:	9.40%	|Mem:	22.00%	|GPU:	40.00%
CPU:	7.40%	|Mem:	22.00%	|GPU:	40.00%
CPU:	7.50%	|Mem:	22.00%	|GPU:	42.00%
CPU:	6.70%	|Mem:	22.00%	|GPU:	44.00%
CPU:	6.90%	|Mem:	22.00%	|GPU:	41.00%
CPU:	8.00%	|Mem:	22.00%	|GPU:	45.00%
CPU:	7.90%	|Mem:	22.00%	|GPU:	36.00%
CPU:	6.90%	|Mem:	22.00%	|GPU:	45.00%
CPU:	5.70%	|Mem:	22.00%	|GPU:	44.00%
CPU:	7.20%	|Mem:	22.00%	|GPU:	44.00%
CPU:	11.50%	|Mem:	22.00%	|GPU:	47.00%
CPU:	6.60%	|Mem:	21.90%	|GPU:	0.00%
CPU times: user 8.76 s, sys: 1.59 s, total: 10.4 s
Wall time: 19min 3s


## 四、评测结果

### 4.1 实验数据

|                                    |        BASE|        INT8|        INT4|        INT2|
|------------------------------------|------------|------------|------------|------------|
| score                              |   48.329   |   48.591   | **49.258** |   48.620   |
| OPPO                               |   46.475   |   46.315   |   46.255   | **49.785** |
| DuQM_pos                           |   46.201   |   45.482   | **46.288** |   55.759   |
| DuQM_named_entity                  |   54.265   |   53.529   |   51.250   | **58.456** |
| DuQM_synonym                       |   51.194   |   51.592   | **53.981** |   43.312   |
| DuQM_antonym                       |   31.148   |   35.082   |   38.361   | **50.164** |
| DuQM_negation                      |   44.160   |   45.584   |   40.456   | **49.003** |
| DuQM_temporal                      |   48.718   |   40.598   |   39.744   | **53.419** |
| DuQM_symmetry                      |   60.225   | **61.351** |   60.976   |   51.595   |
| DuQM_asymmetry                     |   43.058   |   43.662   |   45.272   | **50.704** |
| DuQM_neg_asymmetry                 |   46.939   |   48.980   | **57.143** |   42.857   |
| DuQM_voice                         |   49.618   |   50.382   | **53.435** |   46.565   |
| DuQM_misspelling                   | **48.291** |   47.650   |   46.795   |   39.957   |
| DuQM_discourse_particle(simple)    | **58.216** |   55.869   |   57.746   |   47.887   |
| DuQM_discourse_particle(complex)   |   48.092   | **54.198** |   51.908   |   41.221   |

### 4.2 评测总结


In [9]:

# pip install rich
from rich.console import Console
from rich.markdown import Markdown

console = Console()

result = ""
with open("4.result.log", "r") as f:
    for i in f:
        result += i
console.print(Markdown(result))
