# Title and Abstract Screening: Baseline Evaluation and Uncertainty Calculation

## Project Structure

1. [Introduction](#1)
2. [Environment Setup](#2)
3. [Parameter Configuration](#3)
4. [OpenAI API Connection and Query Definition](#4)
5. [Gold Standard Import](#5)
6. [Query Execution](#5)
7. [Uncertainty Calculation](#7)
8. [Performance Evaluation](#8)
9. [Formatation & Export](#9)

## 1. Introduction <a id = 1></a>

This Jupyter Notebook implements the uncertainty evaluation phase of my Bachelor Thesis on Active Prompting for Large Language Model-Assisted Title and Abstract Screening. It showcases GPT-4o-mini conducting multiple zero-shot screenings of abstracts, determining inclusion or exclusion. The code calculates entropy from these predictions to quantify the model's certainty for each abstract. Additionally, it computes performance metrics for each iteration, establishing a benchmark for later comparative analysis. Based on these results, two example pools are generated:

1. An uncertain pool with high-entropy examples, following active prompting principles.
2. A certain pool with low-entropy examples, serving as a control group.

These pools will be used for few-shot and chain-of-thought annotation in the subsequent refinement phase, allowing for comparative analysis of the active prompting approach

## 2. Environment Setup <a id = 2></a>

In [None]:
# Basic Python packages
import os
import time
import json
import re

# Data preprocessing libraries
import pandas as pd
from IPython.display import display
import math

# OpenAI library
from openai import OpenAI

# load environment variables from .env
from dotenv import load_dotenv
load_dotenv()

## 3. Parameter Configuration <a id = 3></a>

In [12]:
# variables
api_key = os.getenv("OPENAI_API_KEY")
temperature = 0.7
max_tokens = 250
n_value = 1
llm_model = "gpt-4o-mini"
rate_limit_timeout = 0.15
gold_standard_file = os.path.join('data', 'gold_standard.xlsx') 
output_file = os.path.join('output', 'baseline_and_uncertainty_evaluation.xlsx')
iterations = 10

def generate_prompt(abstract): 
    return  f"""
You are an experienced researcher tasked with evaluating the relevance of scientific papers. Below is an abstract along with inclusion and exclusion criteria. Determine whether it should be included or excluded based on the criteria and provide a brief explanation for your decision. 

## Inclusion Criteria:

1. The abstract must explicitly mention automated machine learning (AutoML) or related concepts such as low-code/no-code machine learning tools or neural architecture search.
2. The abstract must explicitly mention reproducibility or related concepts such as transparency or explainability in the context of AutoML or the related concepts mentioned above.

## Exclusion Criteria:

None

## Instructions
- Provide a concise summary (max. 2 sentences) of the key factors influencing inclusion or exclusion.
- Make sure to use double quotes in response format

Return the response in json format:

{{
  "evaluation": // 0 = exclude, 1 = include
  "explanation": // "reason for exclusion"  
}}
 
## Abstract:
{abstract}
"""


## 4. OpenAI API Connection and Query Definition  <a id = 4></a>

In [13]:
#initiate openai client
client = OpenAI(api_key=api_key)

def query_openai(abstract):
    # Create a chat completion request using the gpt-4o-mini model
    response = client.chat.completions.create(
            model=llm_model,
            messages=[{"role": "system", "content": generate_prompt(abstract)}],
            max_tokens=max_tokens,        
            temperature=temperature,         
            n=n_value,                   
        )

    # openai appends responses in json format with ```json ```
    def clean_json_string(json_string):
        pattern = r'^```json\s*(.*?)\s*```$'
        cleaned_string = re.sub(pattern, r'\1', json_string, flags=re.DOTALL)
        return cleaned_string.strip()

    try:
        json_string = clean_json_string(response.choices[0].message.content)
        parsed = json.loads(json_string)
        return parsed
    # Handle parsing errors gracefully
    except json.JSONDecodeError:
        print("Error parsing JSON response:", response)
        return None, None  

## 5. Gold Standard Import <a id = 5></a>

In [None]:
# Load the gold standard 
df = pd.read_excel(gold_standard_file, index_col=None, dtype={'publication_year': str, 'gold_standard_evaluation': int})
df.head(5)

## 6. Query Execution <a id = 6></a>

In [None]:
for i in range(iterations):
    for j, row in df.iterrows():
        abstract = row['abstract']
        eval_col_name = f"eval_{i + 1}"
        expl_col_name = f"expl_{i + 1}"

        # request classification from openai
        data = query_openai(abstract)

        # save response data to df     
        df.at[j, eval_col_name] = int(data['evaluation'])
        df.at[j, expl_col_name] = data['explanation']

        if j % 10 == 0:
            print(f"{j} papers of iteration {i} scanned") 

        # add timeout to prevent exceeding the openai rate-limit
        time.sleep(rate_limit_timeout)

    print(f'iteration_{i + 1} complete')
 
df.head(1)

## 7. Uncertainty Calculation <a id = 7></a>

In [None]:
def binary_entropy(p):
    """Calculates the binary entropy (in bits) of a probability p."""
    if p == 0 or p == 1:
        return 0 
    else:
        return -(p * math.log2(p) + (1 - p) * math.log2(1 - p))

for n in range(len(df)):
    row = df.iloc[n]
    run_columns = [col for col in row.index if col.startswith('eval_')]
    positive_probabilities = row[run_columns].mean() 
    df.at[n, 'entropy'] = binary_entropy(positive_probabilities)

df.head(1)

## 8. Performance Evaluation <a id = 8></a>

In [None]:
analysis_data = []

for i in range(iterations):
    column_name = f'eval_{i + 1}'

    # True Positives (TP): Gold standard is 1, and the model predicted 1
    tp = ((df['gold_standard_evaluation'] == 1) & (df[column_name] == 1)).sum()
    
    # True Negatives (TN): Gold standard is 0, and the model predicted 0
    tn = ((df['gold_standard_evaluation'] == 0) & (df[column_name] == 0)).sum()
    
    # False Positives (FP): Gold standard is 0, but the model predicted 1
    fp = ((df['gold_standard_evaluation'] == 0) & (df[column_name] == 1)).sum()
    
    # False Negatives (FN): Gold standard is 1, but the model predicted 0
    fn = ((df['gold_standard_evaluation'] == 1) & (df[column_name] == 0)).sum()

    # Calculate Accuracy 
    accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) != 0 else 0 # Avoid division by zero

    # Calculate Recall 
    recall = tp / (tp + fn) if (tp + fn) != 0 else 0  # Avoid division by zero

    # Calculate Presicion
    precision = tp / (tp + fp) if (tp + fp) != 0 else 0 # Avoid division by zero
    
    # Calculate Specificity
    specificity = tn / (tn + fp) if (tn + fp) != 0 else 0 # Avoid division by zero

    # Calculate Negative Predictive Value NPV
    npv = tn / (tn + fn) if (tn + fn) != 0 else 0 # Avoid division by zero

    # Calculate F1 Score
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) != 0 else 0 # Avoid division by zero

    analysis_data.append({
        'Iteration': i + 1,
        'True Positives': tp,
        'True Negatives': tn,
        'False Positives': fp,
        'False Negatives': fn,
        'Accuracy': accuracy,
        'Recall': recall,
        'Precision': precision,
        'Specificity': specificity,
        'NPV': npv,
        'F1': f1_score
    })

# Create analysis DataFrame
df_analysis = pd.DataFrame(analysis_data)

# Display analysis results
print(df_analysis)


## 9. Formatation & Export <a id = 9></a>

In [18]:
eval_cols = [col for col in df.columns if col.startswith('eval_')]
expl_cols = [col for col in df.columns if col.startswith('expl_')]
entropy_col = [col for col in df.columns if col.startswith('entropy')]
other_cols = [col for col in df.columns if not (col in eval_cols or col in expl_cols or col in entropy_col)]

df_formatted = df[other_cols + eval_cols + entropy_col + expl_cols] 

# Create an Excel writer object
with pd.ExcelWriter(output_file, engine='openpyxl') as writer:  
    # Write each DataFrame to a different sheet
    df_formatted.to_excel(writer, sheet_name='Original Data', index=False)  
    df_analysis.to_excel(writer, sheet_name='Analysis Results', index=False) 