# **Title and Abstract Screening Baseline and Uncertainty**

## Project Structure

1. [Introduction](#1)
2. [Loading The Relevant Libraries and Packages](#2)
3. [Loading Dataset](#3)
4. [Prompt Engineering and Title and Abstract Screening](#4)
5. [Calculation of Entropy based uncertainty](#5)
6. [Calculation of Performance Metrics for each individual Run](#5)
7. [Summarization of Zero-shot Performance in Boxplot graph](#7)


## 1. Introduction <a id = 1></a>

This Jupyter Notebook implements the refinement phase of my Bachelor Thesis on Active Prompting for Large Language Model-Assisted Title and Abstract Screening. Building on the uncertainty evaluation phase, it utilizes the previously selected uncertain and certain example pools. The notebook outlines a single run of GPT-4o-mini for abstract screening, employing both few-shot and chain-of-thought (CoT) approaches. A critical step in this process is the manual insertion of examples from their respective pools into the prompts. Few-shot prompts include only the abstract and solution, while CoT prompts additionally incorporate the reasoning chains. This manual insertion is essential for the experiment's execution. Following the screening, the notebook calculates performance metrics identical to those used in the uncertainty evaluation phase, facilitating direct comparison. This refined phase aims to compare the effectiveness of active prompting (using the uncertain pool) against a control group (using the certain pool). By doing so, it seeks to quantify the impact of each approach on the model's screening performance, offering insights into the potential benefits of active prompting. 


## 2. Loading the relevant Libraries and Packages <a id = 2></a>

In [None]:
# Basic Python packages
import os
import time
import json
import re

# Data preprocessing libraries
import pandas as pd
from IPython.display import display
import math

# OpenAI library
from openai import OpenAI

# load environment variables from .env
from dotenv import load_dotenv
load_dotenv()

In [20]:
# variables
api_key = os.getenv("OPENAI_API_KEY")
temperature = 0.1
max_tokens = 500
n_value = 1
llm_model = "gpt-4o-mini"
rate_limit_timeout = 0.15
gold_standard_file = os.path.join('Experiment with Certain Exemplars', 'gold_standard_certain_exemplars_removed.xlsx') 
output_file = os.path.join('Experiment with Certain Exemplars', '3_TN_Exemplars_Few-Shot.xlsx')
iterations = 1

In [None]:
# Load the gold standard 
df = pd.read_excel(gold_standard_file, index_col=None)
df = df[df.filter(regex='^(?!Unnamed)').columns]
df.head(5)

In [22]:
def generate_prompt(abstract): 
    return  f"""
          You are an experienced researcher tasked with evaluating the relevance of scientific papers. Below is an abstract along with inclusion and exclusion criteria. Determine whether it should be included or excluded based on the criteria and provide a brief explanation for your decision. 

          ## Inclusion Criteria:

          1. The abstract must explicitly mention automated machine learning (AutoML) or related concepts such as low-code/no-code machine learning tools or neural architecture search.
          2. The abstract must explicitly mention reproducibility or related concepts such as transparency or explainability in the context of AutoML or the related concepts mentioned above.

          ## Exclusion Criteria:

          None

          ## Instructions
          - Evaluate the abstract against each criterion separately.
          - For each criterion, state whether it is MET or NOT MET, and provide a brief explanation.
          - If you are UNCERTAIN about Criterion A, treat it as MET (include).
          - If you are UNCERTAIN about Criterion B, treat it as NOT MET (exclude).
          - After evaluating both criteria, provide a combined evaluation (INCLUDE or EXCLUDE).
          - For the final evaluation, use 1 for INCLUDE and 0 for EXCLUDE.
          - Provide a concise summary (max. 2 sentences) of the key factors influencing inclusion or exclusion.
          - Make sure to use double quotes in the response format.

          ## Examples: 
          // It is necessary to manually insert examples from few-host or chain-of-thought the pools 

          
          ## Return the response in JSON format:

          {{
            "evaluation":  // 0 = exclude, 1 = include,
            "explanation": // "reason for exclusion"
          }}

          ## Abstract to Evaluate:
          {abstract}
          """

In [23]:
#initiate openai client
client = OpenAI(api_key=api_key)

def query_openai(abstract):
    # Create a chat completion request using the gpt-4o-mini model
    response = client.chat.completions.create(
            model=llm_model,
            messages=[{"role": "system", "content": generate_prompt(abstract)}],
            max_tokens=max_tokens,        
            temperature=temperature,         
            n=n_value,                   
        )

    # openai appends responses in json format with ```json ```
    def clean_json_string(json_string):
        pattern = r'^```json\s*(.*?)\s*```$'
        cleaned_string = re.sub(pattern, r'\1', json_string, flags=re.DOTALL)
        return cleaned_string.strip()

    try:
        json_string = clean_json_string(response.choices[0].message.content)
        parsed = json.loads(json_string)
        return parsed
    # Handle parsing errors gracefully
    except json.JSONDecodeError:
        print("Error parsing JSON response:", response)
        return None, None  

In [None]:
for j, row in df.iterrows():
    abstract = row['abstract']
    eval_col_name = f"refined_evaluation"
    expl_col_name = f"refined_explanation"

    # request classification from openai
    data = query_openai(abstract)

    # save response data to df     
    df.at[j, eval_col_name] = int(data['evaluation'])
    df.at[j, expl_col_name] = data['explanation']

    if j % 10 == 0:
        print(f"{j} papers scanned") 

    # add timeout to prevent exceeding the openai rate-limit
    time.sleep(rate_limit_timeout)
 
df.head(1)

In [25]:
# convert string evaluations to int; this is important for the further analysis
df['gold_standard_evaluation'] = pd.to_numeric(df['gold_standard_evaluation'], errors='coerce')
df['refined_evaluation'] = pd.to_numeric(df['refined_evaluation'], errors='coerce')

In [None]:
analysis_data = []

column_name = f'refined_evaluation'

# True Positives (TP): Gold standard is 1, and the model predicted 1
tp = ((df['gold_standard_evaluation'] == 1) & (df[column_name] == 1)).sum()
    
# True Negatives (TN): Gold standard is 0, and the model predicted 0
tn = ((df['gold_standard_evaluation'] == 0) & (df[column_name] == 0)).sum()
    
# False Positives (FP): Gold standard is 0, but the model predicted 1
fp = ((df['gold_standard_evaluation'] == 0) & (df[column_name] == 1)).sum()
    
# False Negatives (FN): Gold standard is 1, but the model predicted 0
fn = ((df['gold_standard_evaluation'] == 1) & (df[column_name] == 0)).sum()

# Calculate Accuracy 
accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) != 0 else 0 # Avoid division by zero

# Calculate Recall 
recall = tp / (tp + fn) if (tp + fn) != 0 else 0  # Avoid division by zero

# Calculate Presicion
precision = tp / (tp + fp) if (tp + fp) != 0 else 0 # Avoid division by zero
    
# Calculate Specificity
specificity = tn / (tn + fp) if (tn + fp) != 0 else 0 # Avoid division by zero

# Calculate Negative Predictive Value NPV
npv = tn / (tn + fn) if (tn + fn) != 0 else 0 # Avoid division by zero

# Calculate F1 Score
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) != 0 else 0 # Avoid division by zero

analysis_data.append({
    'True Positives': tp,
    'True Negatives': tn,
    'False Positives': fp,
    'False Negatives': fn,
    'Accuracy': accuracy,
    'Recall': recall,
    'Precision': precision,
    'Specificity': specificity,
    'NPV': npv,
    'F1': f1_score
})

# Create analysis DataFrame
df_analysis = pd.DataFrame(analysis_data)

# Display analysis results
print(df_analysis)


In [27]:
# Create an Excel writer object
with pd.ExcelWriter(output_file, engine='openpyxl') as writer:  
    # Write each DataFrame to a different sheet
    df.to_excel(writer, sheet_name='Original Data', index=False)  
    df_analysis.to_excel(writer, sheet_name='Analysis Results', index=False) 