## Step3. 推理

**本示例采用qwen2-7B-instruct作为base模型进行推理，并生成结果文件——submission.jsonl
主要流程为：①导入模型、数据集、tokenizer；②编写处理上下文的帮手函数；③推理并生成submission.jsonl**


### **准备环境**

In [1]:
!pip install torch
!pip install PyPDF2
!pip install tqdm
!pip install transformers
!pip install accelerate

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting torch
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cb/e2/1bd899d3eb60c6495cf5d0d2885edacac08bde7a1407eadeb2ab36eca3c7/torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl (779.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m779.1/779.1 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[?25hCollecting nvidia-nccl-cu12==2.20.5
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/4b/2a/0a131f572aa09f741c30ccd45a8e56316e8be8dfc7bc19bf0ab7cfef7b19/nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.2/176.2 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting networkx
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/38/e9/5f72929373e1a0e8d142a130f3f97e6ff920070f87f91c4e13e40e0fba5a/networkx-3.3-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

### 导包

In [2]:
import os
import re
import sys
import json
import warnings
import PyPDF2
import numpy as np
from tqdm import tqdm
from transformers import AutoModelForCausalLM,AutoTokenizer
import torch


  from .autonotebook import tqdm as notebook_tqdm


### 设置路径
**选手们请注意此处，DATA_PATH在测评时由系统自动设置。
本地调试时，请修改为相应的位置。**

In [6]:
# 基本参数设置
DATA_PATH=os.getenv('DATA_PATH')
device="cuda"
model_path="/bohr/model-i7fa/v1/" #设置为挂载模型的数据集的路径

In [7]:
#测试时，DATA_PATH会被自动设置为测试集路径
if not DATA_PATH:
    DATA_PATH='/bohr/exampleData-pi6b/v6'
    print("Warning: DATA_PATH environment variable is not set. Using default path:", DATA_PATH)

PDF_PATH=DATA_PATH+'/pdfs/'

test_input_path=DATA_PATH+'/question.jsonl' #题目
test_output_path='submission.jsonl' #推理生成的答案将与系统的eval.ipynb同级



### 导入模型、数据集、Tokenizer

In [8]:
#导入模型和Tokenizer
model=AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_path,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer=AutoTokenizer.from_pretrained(model_path)
# 设置为评估模式
model.eval()

Loading checkpoint shards: 100%|██████████| 4/4 [00:31<00:00,  7.81s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [9]:
#工具函数，读取测试集jsonl到list
def read_jsonl(file_path):
    data=[]
    with open(file_path,'r',encoding='utf-8') as f:
        for line in tqdm(f):
            try:
                obj=json.loads(line.strip())
                data.append(obj)
            except json.JSONDecodeError as e :
                print(f"Error decoding JSON:{e}")
    return data

test_input_list=read_jsonl(test_input_path)

210it [00:00, 30502.97it/s]


### 定义帮手函数（用于处理上下文，解析pdf）

In [11]:
#工具函数，选手应该自己思考并实现处理pdfs的逻辑（包括如何解析，是否使用rag，token_limit等等)
#此处给出简单的示例——使用pypdf解析，根据"pages"指定的页数把对应的内容简单的拼接在user content后面，并截取前2048个词
#  从PDF文件中提取文本，并以字符串列表的形式返回。
#    参数：
#        pdf_path: PDF文件的路径。
#        add_page_num: 是否在每页文本的开头添加页码。
#    返回：
#        texts: 一个字符串列表，其中每个字符串都是一页的文本。

def extract_text(pdf_path, add_page_num: bool = False) -> list[str]:

    # Open the PDF file
    texts = []
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)

        # Iterate through each page and extract text
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text = page.extract_text()
            text = f"Page {page_num + 1}:\n{text}\n" if add_page_num else text + "\n"
            texts.append(text)
    return texts

In [12]:

#读取doi字段，根据路径去解析相应的pdf，并根据"pages"字段来截取需要的上下文，把上下文作为user prompt append进原来的input list
def parse_pdf_and_concate(obj):
    pdf_path=obj["doi"]
    pdf_path = pdf_path.replace('/', '_').replace(' (Supporting Information)', '_si')
    pdf_path=PDF_PATH+pdf_path+'.pdf'
    attach_content_list=extract_text(pdf_path=pdf_path)
    if "pages" in obj and obj["pages"] != [1,-1] :
        #例如 pages=[5,6] 代表attach_content_list 中第4个str和第五个str
        index=obj["pages"]
        attach_content_list=attach_content_list[index[0]-1:index[1]]
    
    attached_file_content = "\nThe file is as follows:\n\n" + "".join(attach_content_list)
    attached_file_content = attached_file_content[:2048]   
    obj["input"].append({"role":"user","content":attached_file_content})

### 推理

In [13]:
#调用模型进行推理，并把结果保存成一个json
test_out_list=[]
for obj in tqdm(test_input_list):

    if "pages" in obj:
        #处理pdf并拼接
        parse_pdf_and_concate(obj)
    

    message=obj["input"]
    text=tokenizer.apply_chat_template(
        message,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs=tokenizer([text],return_tensors="pt").to(device)

    with torch.no_grad():
        generated_ids=model.generate(
            model_inputs.input_ids,
            max_new_tokens=512,
            temperature=0.2,
    
        )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response=tokenizer.batch_decode(generated_ids,skip_special_tokens=True)[0]
    obj["ideal"]=response
    #把拼接的文献内容弹出去
    obj["input"].pop()
    test_out_list.append(obj)

  0%|          | 0/210 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 210/210 [17:26<00:00,  4.98s/it] 


### 把结果保存成submission.jsonl

In [16]:
#把结果写入json
with open(test_output_path,'w',encoding='utf-8') as f:
    for item in test_out_list:
        json.dump(item, f, ensure_ascii=False)
        f.write('\n')

### 【补充】平台使用帮助
baseline中模型的输出结果、微调/合并示例中模型权重的保存位置都可以根据选手的需要进行修改，可以考虑保存到个人的文件夹下（/personal/），然后再挂载到创建的数据集中。
平台数据集的使用规则可以参考：
https://bohrium-doc.dp.tech/docs/userguide/Dataset/