Skip to content

为什么量化后推理速度更慢了? #548

@lisenjie757

Description

@lisenjie757

我是用Qwen1.5-1.8B-Chat LoRA微调后,使用swift exportmerge lora+INT4量化,无论是用gptq还是awq都相比直接merge lora后不量化模型推理更慢了。

我使用的是以下的api推理,是推理代码写的有问题吗

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

from swift.llm import (
    get_model_tokenizer, get_template, inference, ModelType, get_default_template_type
)

ckpt_dir = '/Output/qwen1half-1_8b-chat/v7-20240306-171317/checkpoint-15000-merged-gptq-int4/'
model_type = ModelType.qwen1half_1_8b_chat
template_type = get_default_template_type(model_type)

model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'},
                                       model_id_or_path=ckpt_dir)

template = get_template(template_type, tokenizer)

response, history = inference(model, template, query, history=history)

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions