In [1]:
import pandas as pd

chunking = "256_20"
only_text = False

path = f"../data/dfs/{'only_text' if only_text else ''}{chunking}/ReferenceErrorDetection_data_with_chunk_info.pkl"
print(path)

# read the dataframe from a pickle file
df = pd.read_pickle(path)

../data/dfs/256_20/ReferenceErrorDetection_data_with_chunk_info.pkl


## Create the prompts

In [4]:
import re

def normalize_whitespace(text):
    return re.sub(r'\s+', ' ', text).strip()

In [5]:
def format_excerpts(excerpt_list):
    excerpts_text = ""
    for id, excerpt in enumerate(excerpt_list):
        excerpts_text += f"Excerpt {id+1}: \n{normalize_whitespace(excerpt)}\n\n"
    return excerpts_text

In [6]:
print(format_excerpts(df.iloc[0]['Top_3_Chunk_Texts']))

Excerpt 1: 
Automatic implementation of fuzzy reasoning spiking neural P systems for diagnosing faults in complex power systems HNRong KYi GXZhang JDong PPaul ZHuang Complexity 2019 16 2019 A rough set-based bioinspired fault diagnosis method for electrical substations TWang WLiu JBZhao International Journal of Electrical Power & Energy Systems 119 105961 2020 Simplified and yet turing universal spiking neural P systems with polarizations optimized by anti-spikes TWu TZhang FXu Neurocomputing 414 2020

Excerpt 2: 
A novel controllable crowbar based on fault type protection technique for DFIG wind energy conversion system using adaptive neuro-fuzzy inference system ONoureldeen IHamdan Protection and Control of Modern Power Systems 3 1 2018 Adaptive fault diagnosis of motors using comprehensive learning particle swarm optimizer with fuzzy petri net XZCheng CGWang JMLi Computing and Informatics 39 1-2 2020 Detection and localization of asymmetry in stator winding of three phase induction 

In [11]:
def create_prompt(df_row):
    title = df_row['Citing Article Title']

    statement = df_row["Corrected Statement"]
    assert statement is not None and statement != '', "Statement cannot be None or empty"

    reference_number = df_row['Reference Number']
    reference_title = df_row['Reference Article Title']
    reference_abstract = df_row['Reference Article Abstract']
    reference_excerpts = format_excerpts(df_row['Top_3_Chunk_Texts'])

    prompt = f"""   
You are an experienced scientific writer and editor. 
You will be given a citation statement from an article that cites a reference article.
From this same reference article you will receive the additional information, including the title, the abstract of the article and the top 3 most relevant excerpts from the reference article. The relevance of the excerpts was previously determined by another large language model based on the citation statement.
Your task is to determine and explain if the reference article supports the given citation statement.  

The statement sentence can contain multiple citations and can refer to multiple reference articles, which are all cited in IEEE style.
You are given the number of the reference article that you should check ("Reference Number"). When for example the statement is "X is true [37, 38] and Y is false under certain conditions [39]", and you are given the reference number 37, you should only check the first part of the statement that refers to the reference article 37.
    
As your classification result, decide between the two labels "Substantiated" and "Unsubstantiated". 
Further explanations of the labels are as follows: 
"Substantiated": The reference article fully substantiates the relevant part of the presented citation statement. This means that only based on the information from the reference article, the statement does not contain errors and can be considered correct. 
"Unsubstantiated": The reference article does not substantiate the relevant part of the presented citation statement. This could be because the statement is contradictory to, unrelated to, or simply missing from the reference article. All of these options would indicate that the citation is incorrect based on the cited references.
    
Format your answer in JSON with two elements: "label" and "explanation". 
Your explanation should be short and concise. 
    
# The citing article
-- Title: {title} 
-- Statement: {statement}
    
# The reference article 
-- Reference Number: {reference_number}
-- Title: {reference_title} 
-- Abstract: {reference_abstract} 
-- Excerpts: \n{reference_excerpts}
"""

    return prompt

In [12]:
example_prompt = create_prompt(df.iloc[22])
print(example_prompt)

   
You are an experienced scientific writer and editor. 
You will be given a citation statement from an article that cites a reference article.
From this same reference article you will receive the additional information, including the title, the abstract of the article and the top 3 most relevant excerpts from the reference article. The relevance of the excerpts was previously determined by another large language model based on the citation statement.
Your task is to determine and explain if the reference article supports the given citation statement.  

The statement sentence can contain multiple citations and can refer to multiple reference articles, which are all cited in IEEE style.
You are given the number of the reference article that you should check ("Reference Number"). When for example the statement is "X is true [37, 38] and Y is false under certain conditions [39]", and you are given the reference number 37, you should only check the first part of the statement that refer

In [27]:
def create_prompt_ai_improved(df_row):
    title = df_row['Citing Article Title']

    statement = df_row["Corrected Statement"]
    assert statement is not None and statement != '', "Statement cannot be None or empty"

    reference_number = df_row['Reference Number']
    reference_title = df_row['Reference Article Title']
    reference_abstract = df_row['Reference Article Abstract']
    reference_excerpts = format_excerpts(df_row['Top_3_Chunk_Texts'])

    prompt = f"""
You are an experienced scientific writer and editor. Your task is to evaluate whether a reference article supports a given citation statement from another article.

Inputs you will receive:
1. A citation statement from a citing article. This statement may include multiple IEEE-style citations (e.g., "[37, 38, 39]").
2. The reference article being evaluated, including:
   - The reference number (e.g., 37)
   - The title
   - The abstract
   - The top 3 most relevant excerpts (pre-selected by a language model based on the citation statement)

Your task:
Determine whether the relevant part of the citation statement is substantiated by the provided reference article.

Important rules:
- Only assess the portion of the citation statement that corresponds to the given reference number.
  For example, if the statement is:
  "X is true [37, 38] and Y is false under certain conditions [39]"
  and the reference number is 37, you should only evaluate whether the claim "X is true" is substantiated by reference 37.
- Use only the abstract and provided excerpts to make your decision. Do not assume additional content not given.

Classification labels:
- "Substantiated": The relevant part of the statement is fully supported by the reference article. It is consistent with, clearly stated in, or directly derived from the reference content.
- "Unsubstantiated": The reference does not support the statement. This includes if it contradicts the statement, omits key claims, or addresses unrelated topics.

Your output should be in JSON format with two fields:
{{
  "label": "Substantiated" | "Unsubstantiated",
  "explanation": "A short, clear explanation (1-3 sentences) justifying your label"
}}

Be concise, but ensure your explanation shows your reasoning clearly.

# The citing article
-- Title: {title} 
-- Statement: {statement}
    
# The reference article 
-- Reference Number: {reference_number}
-- Title: {reference_title} 
-- Abstract: {reference_abstract} 
-- Excerpts: \n{reference_excerpts}
"""
    
    return prompt

In [28]:
example_prompt = create_prompt_ai_improved(df.iloc[22])
print(example_prompt)


You are an experienced scientific writer and editor. Your task is to evaluate whether a reference article supports a given citation statement from another article.

Inputs you will receive:
1. A citation statement from a citing article. This statement may include multiple IEEE-style citations (e.g., "[37, 38, 39]").
2. The reference article being evaluated, including:
   - The reference number (e.g., 37)
   - The title
   - The abstract
   - The top 3 most relevant excerpts (pre-selected by a language model based on the citation statement)

Your task:
Determine whether the relevant part of the citation statement is substantiated by the provided reference article.

Important rules:
- Only assess the portion of the citation statement that corresponds to the given reference number.
  For example, if the statement is:
  "X is true [37, 38] and Y is false under certain conditions [39]"
  and the reference number is 37, you should only evaluate whether the claim "X is true" is substantiated