1. 预处理你提供的原始问答数据集 `qa_data.jsonl`, 里面每条数据一般大致由问题和参考答案组成。**这部分的预处理逻辑一般要自己编写，处理后使其满足每条数据只有两个属性： `"question"`,  `"right_answer"`，分别对应问题、参考答案。** 输出文件 `qa_data_ready.jsonl` 。可参考下面的实现。


In [None]:
import json
from tqdm import tqdm
with open('/byllm/qa_data_ready.jsonl','w',encoding='utf-8') as f2:
    with open('/byllm/qa_data.jsonl','r',encoding='utf-8') as f:
        total_lines = sum(1 for _ in f)
        all_q = total_lines
        f.seek(0)
        for line in tqdm(f, total=total_lines, desc="Processing lines"):
            item = json.loads(line.strip())
            q=item['knowledge'] + ' so ' + item['question']
            
            my_dict={"question":q,"right_answer":item['answer']}
            
            f2.write(json.dumps(my_dict)+'\n')

2. 加载需要评测的模型，这里评测`Qwen1.5-0.5B-Chat`。定义批量回答函数 `batch_inference`。

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
modelpath="Qwen1.5-0.5B-Chat"
model = AutoModelForCausalLM.from_pretrained(
   modelpath,
    torch_dtype="auto",
    device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained(modelpath,trust_remote_code=True, padding_side="left")
def batch_inference(prompts:list[str])->list[str]:
    texts=[]
    for prompt in prompts:
        messages = [
            # 有的模型可以省略system prompt
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        texts.append(text)
    model_inputs = tokenizer(texts, return_tensors="pt", padding=True).to('cuda')

    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512,
       
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return responses


3. 现在开始调用上述的模型对`qa_data_ready.jsonl`里的问题`question`批量回答。输出文件`qa_data_answer.jsonl`, 每条数据有三个属性 `"question"`,  `"your_answer"`, `"right_answer"`。`"your_answer"`就是模型的回答。批次大小`batchsize`根据自己的GPU资源多少进行调整，越大越快。


In [None]:
from torch.utils.data import DataLoader
import json
from tqdm import tqdm
from datasets import load_dataset

batch_size=50

eval_dataset = load_dataset("/byllm", data_files="/byllm/qa_data_ready.jsonl", split="train")
data_loader = DataLoader(dataset=eval_dataset, batch_size=batch_size, shuffle=True, num_workers=8)
with open('/byllm/qa_data_answer.jsonl','w',encoding='utf-8') as f2:
    for i in tqdm(data_loader, total=len(data_loader)):
        ans = batch_inference(i['question'])
        for q,a,ra in zip(i['question'], ans, i['right_answer']):
            my_dict={"question":q,"your_answer":a,"right_answer":ra }
            f2.write(json.dumps(my_dict)+'\n')





4. 现在调用网络大模型API对 `qa_data_answer.jsonl` 里你模型的回答和参考答案进行对比，判断是否一致。若某条数据一致，输出的属性`"label"`是1，不一致是0。最终输出打分文件`qa_data_answer_judge.jsonl`，里面有4个属性：`"question"`,  `"your_answer"`, `"right_answer"`, `"label"`,`"response"`，并输出正确率。这里调用的大模型API是[deepseek深度求索](https://platform.deepseek.com/api_keys)。点击链接可以申请API，并在代码中进行替换。

In [None]:
from openai import OpenAI
from tqdm import tqdm
import json

#这里输入你申请的key
api_key='sk-af2903a7da03f06dddbnwaubda'

client = OpenAI(
    api_key=api_key, base_url='https://api.deepseek.com'
)


def func(s):
    # get a string, return a answer string
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": s},
        ],
        max_tokens=100,
        temperature=0.7,
        stream=False,
    )
    return response.choices[0].message.content


def llm_judge(answer_file, judge_func):
    # answer_file: a jsonl file, each line with "question", "your_answer", "right_answer".
    # judge_func: get a string, return a answer string
    judge_file = answer_file[:-6] + "_judge.jsonl"
    all_q = 0
    right_a = 0
    template = "Now I give you one question and two answers to it. One of the answers is student's answer, another is the right answer. Please based on the given right answer, judge if the student's answer get\
        it right. If the student get it right, please respond with a 'yes' and reasons, otherwise with a 'no' and reasons.\n Here is the question:{question}.\n \
            Student's answer: {your_answer}. \n Right answer: {right_answer}. "

    with open(judge_file, "w", encoding="utf-8") as f2:

        with open(answer_file, "r", encoding="utf-8") as f:
            total_lines = sum(1 for _ in f)
            f.seek(0)
            for line in tqdm(f, total=total_lines, desc="Processing lines"):
                item = json.loads(line.strip())
                pro = template.format(
                    question=item["question"],
                    your_answer=item["your_answer"],
                    right_answer=item["right_answer"],
                )
                try:
                    response= judge_func(pro)
                except Exception:
                    # 若模型拒绝回答就丢弃这条数据
                    continue
                label = 0
                # 若开始几个字符包括yes就是yes，否则视为no
                if "yes" in response.lower()[:5]:
                    right_a += 1
                    label = 1

                result = {
                    "question": item["question"],
                    "your_answer": item["your_answer"],
                    "right_answer": item["right_answer"],
                    "label": label,
                    "response":response
                }
                f2.write(json.dumps(result) + "\n")
                all_q += 1
               
    return right_a, all_q, right_a / all_q


right_a, all_q, accuracy = llm_judge(
    "byllm/qa_data_answer.jsonl", func
)


print(f"right answers ={right_a}, all = {all_q}, accuracy ={accuracy} ")


一次针对 QWEN 0.5B 的简单评测就完成了。