# Prompting the models to classify the statements

In [3]:
import pandas as pd

embedding = "te3l" # / "te3s"
grobid_model = "full_model_texts"
no_prev_chunking = True

path = f"../data/dfs/{embedding}{'_no_prev_chunking' if no_prev_chunking else ''}/{grobid_model}/ReferenceErrorDetection_data_with_chunk_info.pkl"
print(path)

# read the dataframe from a pickle file
df = pd.read_pickle(path)

../data/dfs/te3l_no_prev_chunking/full_model_texts/ReferenceErrorDetection_data_with_chunk_info.pkl


In [4]:
df.head()

Unnamed: 0,Source,Citing Article ID,Citing Article DOI,Citing Article Title,Citing Article Retracted,Citing Article Downloaded,Domain,Statement with Citation,Reference Article ID,Reference Article DOI,Reference Article Title,Reference Article Abstract,Reference Article PDF Available,Reference Article Retracted,Reference Article Downloaded,Label,Explanation,Top_3_Chunk_IDs,Top_3_Chunk_Texts
0,PubPeer,c001,10.1016/j.est.2021.103553,Heating a residential building using the heat ...,Yes,Yes,Engineering,Others have aimed to reduce irreversibility or...,r001,10.1155/2021/2087027,A Fault Analysis Method for Three-Phase Induct...,The fault prediction and abductive fault diagn...,Yes,No,Yes,Unsubstantiate,Irrelevant,"[b6b9b6c8-c29a-476a-9753-77b7720f014d, 55516ca...","[-us, they cannot effectively diagnose multipl..."
1,PubPeer,c001,10.1016/j.est.2021.103553,Heating a residential building using the heat ...,Yes,Yes,Engineering,Some researchers have also studied various hea...,r002,10.1016/j.physa.2018.12.031,Develop 24 dissimilar ANNs by suitable archite...,The artificial neural network optimization met...,Yes,No,Yes,Unsubstantiate,Irrelevant,"[0381ff83-6aae-46d7-a27d-39289fafa3db, a0be0af...",[The evaluations of nanofluid thermo-physical ...
2,PubPeer,c002,10.1155/2022/4601350,Oxidative Potential and Nanoantioxidant Activi...,Yes,Yes,Chemistry,The relative content of total flavonoids in th...,r003,10.1088/1742-6596/1937/1/012038,Lipid Data Acquisition for devices Treatment o...,"Recently, the widespread deployment of smart p...",Yes,No,Yes,Unsubstantiate,Irrelevant,"[894cd136-5607-4243-a49a-da5a76a3185c, a421f5a...",[The correspondence curve for our photochemica...
3,PubPeer,c003,10.1155/2022/2408685,The Choice of Anesthetic Drugs in Outpatient H...,Yes,Yes,Medicine,Research has shown that remimazolam tosylate e...,r004,10.1186/s12871-018-0543-3,"Effect of propofol on breast cancer cell, the ...",Breast cancer is the second leading cause of c...,Yes,No,Yes,Unsubstantiate,Irrelevant,"[7e0ac930-9428-4cf4-a6ce-03b06a04aba6, 9722f8a...",[The second study analyzed 325 patients with 1...
4,PubPeer,c004,10.1155/2022/4783847,A Fault-Tolerant Structure for Nano-Power Comm...,Yes,Yes,Engineering,if the efficiency of the routing algorithm is ...,r005,10.36410/jcpr.2022.23.3.312,Analysis and research hotspots of ceramic mate...,"From the perspective of scientometrics, comb t...",Yes,No,Yes,Unsubstantiate,Irrelevant,"[557d7f97-ca5f-4bda-b009-6c72783fe2d7, 9e35981...","[In the table, China's intermediary centrality..."


## Create the prompts

In [5]:
def format_excerpts(excerpt_list):
    excerpts_text = ""
    for id, excerpt in enumerate(excerpt_list):
        excerpts_text += f"Excerpt {id+1}: \n{excerpt}\n\n"
    return excerpts_text

In [6]:
print(format_excerpts(df.iloc[0]['Top_3_Chunk_Texts']))

Excerpt 1: 
-us, they cannot effectively diagnose multiple faults, not achieving the requirement of performing an overall fault analysis of the whole machine. -erefore, how to improve the abovementioned fault prediction and abductive fault diagnosis methods or put forward new ones is the main issue in the corresponding engineering domain for the motors. On the other hand, with the rapid development of artificial intelligence technology, intelligent analysis and diagnosis methods are gradually developed, such as expert systems (ESs)  [15] , artificial neural networks (ANNs)  [16] [17] [18] [19] [20] , Petri nets (PNs)  [21] [22] [23] , tissue P systems (TPSs)  [24] [25] [26] , and spiking neural P systems (SNPSs)  [27] [28] [29] [30] [31] [32] [33] [34] . Specifically, SNPS is a novel high-performance bioinspired distributed parallel computing model with powerful information processing ability.

Excerpt 2: 
When k � 5, N - r 5 � (O 18 ) T . -us, the termination condition is satisfied an

In [7]:
def create_prompt(df_row):
    title = df_row['Citing Article Title']
    statement = df_row['Statement with Citation']
    reference_title = df_row['Reference Article Title']
    reference_abstract = df_row['Reference Article Abstract']
    reference_excerpts = format_excerpts(df_row['Top_3_Chunk_Texts'])

    prompt = f"""   
You are an experienced scientific writer and editor. 
You will be given a statement from an article that cites a reference article and information from the reference article. 
You will determine and explain if the reference article supports the statement.  
    
Specifically, choose a label from "Fully substantiate", "Partially substantiate", and "Unsubstantiate". 
Further explanations of the labels are as follows: 
"Fully substantiated": The reference article fully substantiates the relevant part of the statement from the present article. 
"Partially substantiated": According to the reference article, there is a minor error in the statement but the error does not invalidate the purpose of the statement. 
"Unsubstantiate": The reference part does not substantiate any part of the statement. This could be because the statement is contradictory to, unrelated to, or simply missing from the reference article.  
    
Format your answer in JSON with two elements: "label" and "explanation". 
Your explanation should be short and concise. 
    
# The citing article
Title: {title} 
Statement: {statement}
    
# The reference article 
Title: {reference_title} 
Abstract: {reference_abstract} 
Excerpts: \n\n{reference_excerpts}
"""

    return prompt

In [8]:
example_prompt = create_prompt(df.iloc[22])
print(example_prompt)

   
You are an experienced scientific writer and editor. 
You will be given a statement from an article that cites a reference article and information from the reference article. 
You will determine and explain if the reference article supports the statement.  
    
Specifically, choose a label from "Fully substantiate", "Partially substantiate", and "Unsubstantiate". 
Further explanations of the labels are as follows: 
"Fully substantiated": The reference article fully substantiates the relevant part of the statement from the present article. 
"Partially substantiated": According to the reference article, there is a minor error in the statement but the error does not invalidate the purpose of the statement. 
"Unsubstantiate": The reference part does not substantiate any part of the statement. This could be because the statement is contradictory to, unrelated to, or simply missing from the reference article.  
    
Format your answer in JSON with two elements: "label" and "explanation"

## Prompting the models (batch processing)

In [39]:
import os
import json

def create_batch_file(df, model="gpt-3.5-turbo-0125"):
    output_dir = f"../data/batch_files/{embedding}{'_no_prev_chunking' if no_prev_chunking else ''}/{grobid_model}"
    os.makedirs(output_dir, exist_ok=True)
    output_file = os.path.join(output_dir, "prompt_batch.jsonl")
    
    # If the file already exists, empty it
    if os.path.exists(output_file):
        open(output_file, "w").close()
    
    for index, row in df.iterrows():
        if row['Reference Article Downloaded'] == 'Yes':
            prompt = create_prompt(row)
            json_sequence = {
                "custom_id": f"request-{index}", 
                "method": "POST", 
                "url": "/v1/chat/completions", 
                "body": {
                    "model": model, 
                    "messages": [
                        {
                            "role": "user",
                            "content": prompt
                        }
                    ],
                    "temperature": 0,
                }
            }

            with open(output_file, "a") as f:
                f.write(json.dumps(json_sequence) + "\n")
                
    return output_file

In [40]:
batch_file_path = create_batch_file(df)
batch_file_path

'../data/batch_files/te3l_no_prev_chunking/full_model_texts/prompt_batch.jsonl'

In [41]:
# Read the content of open_ai_key.txt into a variable
with open('../open_ai_key.txt', 'r') as file:
    open_ai_key = file.read().strip()

In [42]:
from openai import OpenAI
client = OpenAI(api_key=open_ai_key)

batch_input_file = client.files.create(
    file=open(batch_file_path, "rb"),
    purpose="batch"
)

print(batch_input_file)

FileObject(id='file-SYDuxSAEvuU9N7kjTFXTvp', bytes=1473189, created_at=1742824666, filename='prompt_batch.jsonl', object='file', purpose='batch', status='processed', expires_at=None, status_details=None)


In [43]:
batch_input_file_id = batch_input_file.id
batch_creation_response = client.batches.create(
    input_file_id=batch_input_file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

In [44]:
batch_creation_response

Batch(id='batch_67e164dd8e0881908b90679ecf92f9c1', completion_window='24h', created_at=1742824669, endpoint='/v1/chat/completions', input_file_id='file-SYDuxSAEvuU9N7kjTFXTvp', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1742911069, failed_at=None, finalizing_at=None, in_progress_at=None, metadata=None, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

### Check the batch status

In [45]:
batch_creation_response.id

'batch_67e164dd8e0881908b90679ecf92f9c1'

In [46]:
batch = client.batches.retrieve(batch_creation_response.id)
print(batch.status)

in_progress


In [50]:
import time

def wait_for_batch_completion(batch_id, client, interval=10):
    while True:
        batch = client.batches.retrieve(batch_id)
        print(f"Current status: {batch.status}")
        if batch.status == 'completed':
            print("Batch processing completed.")
            break
        time.sleep(interval)
    return batch

In [51]:
batch = wait_for_batch_completion(batch_creation_response.id, client)

Current status: completed
Batch processing completed.


In [55]:
model_responses = client.files.content(batch.output_file_id).text
print(model_responses)

{"id": "batch_req_67e1651cc6a4819095839d505ade6537", "custom_id": "request-0", "response": {"status_code": 200, "request_id": "b3b6f6afd69978c582f5e9bdffaa7bb6", "body": {"id": "chatcmpl-BEcbHw0prG9O1spmKFU6ivIqJ6qSd", "object": "chat.completion", "created": 1742824671, "model": "gpt-3.5-turbo-0125", "choices": [{"index": 0, "message": {"role": "assistant", "content": "{\n    \"label\": \"Unsubstantiate\",\n    \"explanation\": \"The reference article does not mention anything related to reducing irreversibility or optimizing energy-consumed devices, so it does not support the statement.\"\n}", "refusal": null, "annotations": []}, "logprobs": null, "finish_reason": "stop"}], "usage": {"prompt_tokens": 1018, "completion_tokens": 47, "total_tokens": 1065, "prompt_tokens_details": {"cached_tokens": 0, "audio_tokens": 0}, "completion_tokens_details": {"reasoning_tokens": 0, "audio_tokens": 0, "accepted_prediction_tokens": 0, "rejected_prediction_tokens": 0}}, "service_tier": "default", "sy

In [57]:
import json

# Parse the model_responses into a list of objects
responses_list = [json.loads(line) for line in model_responses.splitlines()]

# Print the parsed responses
print(responses_list)

[{'id': 'batch_req_67e1651cc6a4819095839d505ade6537', 'custom_id': 'request-0', 'response': {'status_code': 200, 'request_id': 'b3b6f6afd69978c582f5e9bdffaa7bb6', 'body': {'id': 'chatcmpl-BEcbHw0prG9O1spmKFU6ivIqJ6qSd', 'object': 'chat.completion', 'created': 1742824671, 'model': 'gpt-3.5-turbo-0125', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '{\n    "label": "Unsubstantiate",\n    "explanation": "The reference article does not mention anything related to reducing irreversibility or optimizing energy-consumed devices, so it does not support the statement."\n}', 'refusal': None, 'annotations': []}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 1018, 'completion_tokens': 47, 'total_tokens': 1065, 'prompt_tokens_details': {'cached_tokens': 0, 'audio_tokens': 0}, 'completion_tokens_details': {'reasoning_tokens': 0, 'audio_tokens': 0, 'accepted_prediction_tokens': 0, 'rejected_prediction_tokens': 0}}, 'service_tier': 'default', 'system_fi

In [58]:
responses_dict = {int(response['custom_id'].split('-')[1]): response for response in responses_list}

In [67]:
# Create a new column in the dataframe to store the responses
if 'Model Classification' not in df.columns:
    df['Model Classification'] = None

# Iterate through the dataframe
for index, row in df.iterrows():
    if row['Reference Article Downloaded'] == 'Yes':

        model_response = responses_dict[index]['response']['body']['choices'][0]['message']['content']
        
        # Save the response to the new column
        df.at[index, 'Model Classification'] = model_response

In [68]:
df.to_pickle(f"../data/dfs/{embedding}{'_no_prev_chunking' if no_prev_chunking else ''}/{grobid_model}/ReferenceErrorDetection_data_with_prompt_results_batched.pkl")

## Prompting the models (no batching)

In [33]:
# Read the content of open_ai_key.txt into a variable
with open('../open_ai_key.txt', 'r') as file:
    open_ai_key = file.read().strip()

In [34]:
from openai import OpenAI
client = OpenAI(api_key=open_ai_key)

def send_prompt(prompt, model="gpt-3.5-turbo-0125"):
    completion = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature=0,
        timeout=30,

    )
    return completion.choices[0].message.content

In [35]:
send_prompt(example_prompt)

'{\n    "label": "Unsubstantiate",\n    "explanation": "The reference article on DeepCleave does not provide any information related to the problem of difficulty in arranging classes and summarizing grades in college music education courses."\n}'

In [None]:
# path = f"../data/dfs/{embedding}{'_no_prev_chunking' if no_prev_chunking else ''}/{grobid_model}/ReferenceErrorDetection_data_with_prompt_results.pkl"
# df = pd.read_pickle(path)

In [None]:
ids_to_prompt = []

In [38]:
%%time

# Create a new column in the dataframe to store the responses
if 'Model Classification' not in df.columns:
    df['Model Classification'] = None

# Iterate through the dataframe
for index, row in df.iterrows():
    if row['Reference Article Downloaded'] == 'Yes':
        if len(ids_to_prompt) != 0 and row['Reference Article ID'] not in ids_to_prompt:
            continue

        print(f"Processing: " + row['Reference Article ID'])

        # Create the prompt
        prompt = create_prompt(row)
        
        # Send the prompt and get the response
        response = send_prompt(prompt)
        
        # Save the response to the new column
        df.at[index, 'Model Classification'] = response

Processing: r071
Processing: r075
Processing: r147
CPU times: user 92.7 ms, sys: 4.03 ms, total: 96.7 ms
Wall time: 3.24 s


In [39]:
df.to_pickle(f"../data/dfs/{embedding}{'_no_prev_chunking' if no_prev_chunking else ''}/{grobid_model}/ReferenceErrorDetection_data_with_prompt_results.pkl")