## 引言

前面的模型评测脚本evaluate.py在使用时一直有一个问题，每次评估耗时很长，如下所示，两千条数据需要将近20分钟。
`progress: 100%|██████████| 2348/2348 [19:12<00:00,  2.04it/s]`
但是根据模型训练时的batch_size设置，模型是支持批量预测的，所以我们有必要根据此特性对评测脚本做一次改造，以支持批量预测，提高每次评测的效率。


## 初始化

In [1]:
%run evaluate.py

In [2]:
testdata_path = '/data2/anti_fraud/dataset/eval0819.jsonl'
model_path = '/data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct'
checkpoint_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0826/checkpoint-900'
device = 'cuda'

加载模型：

In [5]:
model, tokenizer = load_model(model_path, checkpoint_path, device)

加载数据集，并构造用于测试的小批量数据（8条）。

In [6]:
dataset = load_jsonl(testdata_path)
batch_data = dataset[0: 8]
contents = [item['input'] for item in batch_data]

## 调试可行性

封装一个方法用于构造提示词。

In [7]:
def build_prompt(content):
    prompt = f"下面是一段对话文本, 请分析对话内容是否有诈骗风险，只以json格式输出你的判断结果(is_fraud: true/false)。\n\n{content}"
    return [{"role": "user", "content": prompt}]

按照模型要求的格式来填充提示词。

> 注：单条使用apply_chat_template时直接指定了`tokenize=True`和`return_tensors="pt"`参数完成了序列化和张量转换。但批量时则需要使用tokenizer来进行序列化，内部会做一些针对批量的特殊处理（例如填充长度），因此不能指定这两个参数。

In [8]:
prompts = [build_prompt(content) for content in contents]
tokenized = tokenizer.apply_chat_template(prompts, add_generation_prompt=True, tokenize=False)
print(f"apply_chat_complete:{type(tokenized)}, len: {len(tokenized)}, tokenized[0]:{tokenized[0]}")

apply_chat_complete:<class 'list'>, len: 8, tokenized[0]:<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
下面是一段对话文本, 请分析对话内容是否有诈骗风险，只以json格式输出你的判断结果(is_fraud: true/false)。

发言人3: 如果投资一个亿就能回收，并且后面全全都是他的那个效益。<|im_end|>
<|im_start|>assistant



In [9]:
inputs = tokenizer(tokenized, padding=True, return_tensors="pt").to(device)
print(f"inputs: {inputs}")

inputs: {'input_ids': tensor([[151644,   8948,    198,  ..., 151643, 151643, 151643],
        [151644,   8948,    198,  ..., 151643, 151643, 151643],
        [151644,   8948,    198,  ..., 151643, 151643, 151643],
        ...,
        [151644,   8948,    198,  ..., 151643, 151643, 151643],
        [151644,   8948,    198,  ..., 151643, 151643, 151643],
        [151644,   8948,    198,  ..., 151644,  77091,    198]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0')}


> 注意这个padding=True的参数，如果不设置此参数，会报```ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).```

> 这句话的意思是说：当使用批量模式时，必须将一个批量内的数据长度对齐，对齐方式有两种选择：1）以长的序列为主，将短的序列填充，对应padding参数； 2）以短的序列为主，将长的截断，对应truncation参数；

封装一个predict_with_tensors函数，用于批量生成文本。

In [10]:
def predict_with_tensors(model, tokenizer, inputs, device='cuda', debug=False):
    default_response = {'is_fraud': False}
    gen_kwargs = {"max_new_tokens": 2048, "do_sample": True, "top_k": 1}
    
    with torch.no_grad():
        outputs = model.generate(**inputs, **gen_kwargs)
        responses = []
        for i in range(outputs.size(0)):
            output = outputs[i, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(output, skip_special_tokens=True)
            responses.append(safe_loads(response, default_response))
        return responses

In [11]:
%%time
predict_with_tensors(model, tokenizer, inputs, device, debug=False)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


invalid json: 1. {"is_fraud": false}
invalid json: 10,12 {"is_fraud": false}
invalid json: 
invalid json: peeches
invalid json: 10. is_fraud: true
CPU times: user 1.82 s, sys: 272 ms, total: 2.09 s
Wall time: 2.09 s


[{'is_fraud': False},
 {'is_fraud': False},
 {'is_fraud': False},
 {'is_fraud': True},
 100,
 {'is_fraud': False},
 {'is_fraud': False},
 {'is_fraud': True}]

结果出现了好多invalid json错误, 伴随着一些奇怪的输出`10,12 {"is_fraud": false}`。

仔细看的话，上面有一句警告，意思是说：在使用仅解码器的模型架构时，检测到使用了右侧填充，这种填充方式可能会影响生成结果，建议在初始化tokenizer时将padding_side参数设置为'left'使用左侧填充。

再查看上面的inputs，input_ids右侧有很多151643，151643是什么呢？

In [12]:
tokenizer

Qwen2TokenizerFast(name_or_path='/data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

由上面的特殊token列表可以看到，151643对应的`<|endoftext|>`在special_tokens中恰好是填充`pad_token`。

那句警告正好符合我们的问题现象，那就按照建议重新初始化一个padding_side='left'的tokenizer。

In [14]:
tokenizer_left = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, padding_side="left")

In [16]:
inputs = tokenizer_left(tokenized, padding=True, return_tensors="pt").to(device)
print(f"inputs: {inputs}")

inputs: {'input_ids': tensor([[151643, 151643, 151643,  ..., 151644,  77091,    198],
        [151643, 151643, 151643,  ..., 151644,  77091,    198],
        [151643, 151643, 151643,  ..., 151644,  77091,    198],
        ...,
        [151643, 151643, 151643,  ..., 151644,  77091,    198],
        [151643, 151643, 151643,  ..., 151644,  77091,    198],
        [151644,   8948,    198,  ..., 151644,  77091,    198]],
       device='cuda:0'), 'attention_mask': tensor([[0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0')}


这次的序列化结果中，151643(填充token)都到了左边，看起来是对了。

尝试反序列化一下，看是否正常。

In [18]:
tokenizer_left.decode(inputs['input_ids'][0], skip_special_tokens=True)

'system\nYou are a helpful assistant.\nuser\n下面是一段对话文本, 请分析对话内容是否有诈骗风险，只以json格式输出你的判断结果(is_fraud: true/false)。\n\n发言人3: 如果投资一个亿就能回收，并且后面全全都是他的那个效益。\nassistant\n'

In [None]:
再次运行批量预测函数。

In [19]:
%%time
predict_with_tensors(model, tokenizer, inputs, device, debug=False)

CPU times: user 615 ms, sys: 43 ms, total: 658 ms
Wall time: 655 ms


[{'is_fraud': False},
 {'is_fraud': False},
 {'is_fraud': False},
 {'is_fraud': True},
 {'is_fraud': True},
 {'is_fraud': False},
 {'is_fraud': True},
 {'is_fraud': True}]

## 脚本改造

封装一个真正的predict_batch函数用于支持批量预测，并同步添加一个run_test_batch函数也用于支持批量解析标签，最后都添加到evaluate.py脚本中。


In [None]:
def predict_batch(model, tokenizer, contents: List[str], device='cuda', debug=False):
    prompts = [build_prompt(content) for content in contents]
    inputs = tokenizer(
        tokenizer.apply_chat_template(prompts, add_generation_prompt=True, tokenize=False),
        padding=True,
        return_tensors="pt"
    ).to(device)
    
    default_response = {'is_fraud': False}
    gen_kwargs = {"max_new_tokens": 2048, "do_sample": True, "top_k": 1}
    
    with torch.no_grad():
        outputs = model.generate(**inputs, **gen_kwargs)
        responses = []
        for i in range(outputs.size(0)):
            output = outputs[i, inputs['input_ids'].shape[1]:]
            response = tokenizer.decode(output, skip_special_tokens=True)
            responses.append(safe_loads(response, default_response))
        return responses

def run_test_batch(model, tokenizer, test_data: List[Dict], batch_size: int = 8, device='cuda', debug=False):
    print(f"run in batch mode, batch_size={batch_size}")
    real_labels = []
    pred_labels = []
    pbar = tqdm(total=len(test_data), desc=f'progress')
    
    for i in range(0, len(test_data), batch_size):
        batch_data = test_data[i:i + batch_size]
        dialog_inputs = [item['input'] for item in batch_data]
        real_batch_labels = [item['label'] for item in batch_data]
        
        predictions = predict_batch(model, tokenizer, dialog_inputs, device)
        pred_batch_labels = [prediction['is_fraud'] for prediction in predictions]
        
        real_labels.extend(real_batch_labels)
        pred_labels.extend(pred_batch_labels)
        
        pbar.update(len(batch_data))
        
    return real_labels, pred_labels

In [None]:
再对evaluate函数稍作改造，扩展一个batch=true/false参数来支持批量评估。

In [None]:
def evaluate_with_model(model, tokenizer, testdata_path, device='cuda', batch=False, debug=False):
    dataset = load_jsonl(testdata_path)
    run_test_func = run_test_batch if batch else run_test
    true_labels, pred_labels = run_test_func(model, tokenizer, dataset, device=device, debug=debug)
    precision, recall = precision_recall(true_labels, pred_labels, debug=debug)
    print(f"precision: {precision}, recall: {recall}")

def evaluate(model_path, checkpoint_path, testdata_path, device='cuda', batch=False, debug=False):    
    model, tokenizer = load_model(model_path, checkpoint_path, device)
    evaluate_with_model(model, tokenizer, testdata_path, device, batch, debug)

## 批量评估测试

In [26]:
%%time
evaluate(model_path, checkpoint_path, testdata_path, device, batch=True, debug=True)

progress:   0%|          | 0/2348 [05:08<?, ?it/s]


run in batch mode, batch_size=8


progress: 100%|██████████| 2348/2348 [03:14<00:00, 12.09it/s]

tn：1160, fp:5, fn:103, tp:1080
precision: 0.9953917050691244, recall: 0.9129332206255283
CPU times: user 3min 16s, sys: 21.9 s, total: 3min 38s
Wall time: 3min 16s





同样的数据量（2348条），之前单条模式需要`19min 12s`,而批量模式只需要`3min 16s`，耗时只有原来的1/6。

In [27]:
%%time
checkpoint_path_1200 = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0826/checkpoint-1200'
evaluate(model_path, checkpoint_path_1200, testdata_path, device, batch=True, debug=True)

run in batch mode, batch_size=8


progress: 100%|██████████| 2348/2348 [03:12<00:00, 12.18it/s]

tn：1160, fp:5, fn:89, tp:1094
precision: 0.9954504094631483, recall: 0.9247675401521556
CPU times: user 3min 15s, sys: 21.8 s, total: 3min 37s
Wall time: 3min 15s





**小结**：本文通过对模型批量生成文本的探索，让评测函数支持了批量预测，提高了评测效率的同时，也对padding的使用场景以及左、右填充的区别有了更深的理解。

## 相关阅读
-  [欺诈文本分类微调（五）：模型评测](https://golfxiao.blog.csdn.net/article/details/141355995)