为什么量化后推理速度更慢了？

我是用Qwen1.5-1.8B-Chat LoRA微调后，使用`swift export`merge lora+INT4量化，无论是用gptq还是awq都相比直接merge lora后不量化模型推理更慢了。

我使用的是以下的api推理，是推理代码写的有问题吗

```py
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type
)

ckpt_dir = '/Output/qwen1half-1_8b-chat/v7-20240306-171317/checkpoint-15000-merged-gptq-int4/'
model_type = ModelType.qwen1half_1_8b_chat
template_type = get_default_template_type(model_type)

model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'},
                                       model_id_or_path=ckpt_dir)

template = get_template(template_type, tokenizer)

response, history = inference(model, template, query, history=history)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

为什么量化后推理速度更慢了？ #548

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

为什么量化后推理速度更慢了？ #548

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions