# Modular Fact-Checking System Implementation

**Henry Zelenak | Last updated: 05/12/2025**

This code implements a modular fact-checking system using Python.

**Note that the system is currently configured for "tuned_GPT-sBERTn1024-sentEx-rephsHist" runs, where fine-tuned GPT-sentEx and sBERTn1024-sentEx models are used and rephrasing history is provided to GPT-rephrase on each iteration of Module 2.**

## 0. Imports and Introduction
<a href="https://drive.google.com/file/d/1TYzbcR3QkzQNZR0IX34vhDwzFcJcukHJ/view?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Version Notes:
v4: Added claim rephrasing history for GPT_rephrase (submodule 2.1) to ensure new claims are not repeated. Limit the number of rephrasings to 5, reverting to the original claim after 5 iterations.

In [None]:
# Ensure fever-scorer is installed correctly (assuming previous steps worked)
!git clone -b release-v2.0 https://github.com/sheffieldnlp/fever-scorer.git
%cd fever-scorer
!pip install -r requirements.txt

# Open /setup.py and add 'license="MIT"' on line 12, then overwrite the file
import os
with open('setup.py', 'r') as f:
    lines = f.readlines()
    lines[11] = 'license="MIT"\n'
with open('setup.py', 'w') as f:
    f.writelines(lines)
    f.close()
    print("setup.py updated")
!pip install .
%cd ..

# Install necessary libraries
!pip install rouge-score sentence-transformers wikipedia
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from rouge_score import rouge_scorer
import openai
from openai import OpenAI
import numpy as np
from nltk import Tree, pos_tag, word_tokenize, ne_chunk
from nltk.corpus import stopwords
import numpy as np
from fever.scorer import fever_score # Import the FEVER scorer
from nltk import RegexpParser
import json
from sentence_transformers import SentenceTransformer, util
import requests
from bs4 import BeautifulSoup
import re
import ast
import time # For logging
import wikipedia # For fetching wikipedia content
import os
from tqdm import tqdm
tqdm.pandas()
import gc
from google.colab import userdata
import datetime

# Download necessary NLTK data files (ensure they are downloaded)
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True) # Added _eng suffix, common naming
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('words', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('treebank', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('maxent_ne_chunker_tab', quiet=True)


## 1. Load Data and Models
<a id="1"></a>

In [None]:
# Mount google drive (if using Colab)
try:
    from google.colab import drive, userdata
    drive.mount('/content/drive')
    # Adjust path as needed
    BASE_DIR = '/content/drive/My Drive/SUNY_Poly_DSA598/'
    DATA_DIR = os.path.join(BASE_DIR, 'datasets/FEVER/')
    # Ensure the directory exists
    # os.makedirs(DATA_DIR, exist_ok=True)

    # Set API Key securely
    api_key = userdata.get('openaikey')

    # Assuming data files are in DATA_DIR after mounting
    test_path = "/content/drive/My Drive/SUNY_Poly_DSA598/datasets/FEVER/paper_test.jsonl" # Explicit path example

except ModuleNotFoundError:
    print("Not running in Colab or google libraries not found. Assuming local setup.")

    # Fallback path if not in Colab drive structure
    if not os.path.exists("./datasets/FEVER/paper_test.jsonl"):

         test_path = "paper_test.jsonl" 
    else:
        test_path = "./datasets/FEVER/paper_test.jsonl"

def load_jsonl(file_path, encoding='utf-8'):
    """Loads a JSON Lines file into a list of Python objects."""
    data = []
    try:
        with open(file_path, 'r', encoding=encoding) as f:
            for line in f:
                try:
                    data.append(json.loads(line))
                except json.JSONDecodeError:
                    print(f"Warning: Skipping invalid JSON line in {file_path}: {line.strip()}")
    except FileNotFoundError:
        print(f"ERROR: File not found at {file_path}")
        return None # Return None or empty list on error
    return data

# Load test dataset
test_data = load_jsonl(test_path)
if test_data is None:
    print("Exiting due to missing test data file.")
    exit()

print(f"Loaded {len(test_data)} items from {test_path}")

# Initialize OpenAI client
if api_key:
    query_client = OpenAI(api_key=api_key)
    sentEx_client = OpenAI(api_key=api_key)
    rephrase_client = OpenAI(api_key=api_key)
    nli_client = OpenAI(api_key=api_key)
    print("OpenAI API key found. Client initialized.")
else:
    print("ERROR: OpenAI API key not found. Please set it up.")
    exit()

#### INITIALIZE SBERT ####
# sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
# More performant model
#sbert_model = SentenceTransformer('all-mpnet-base-v2')
# FINE TUNED MODEL
model_path = f"{BASE_DIR}models/sBERT/all-mpnet-base-v2_n1024_04-20_12:22_(ORCL_TEST)"
sbert_model = SentenceTransformer(model_path)

#### GET GPT-4o-mini MODEL NAMES (GPT_clf and GPT_sentEx)
sentEx_ft_model = 'ft:gpt-4o-2024-08-06:personal::BOUBY8Au'
#clf_ft_model = 'ft:gpt-4o-2024-08-06:personal::BOUBZKQy'
#query_ft_model = 'ft:gpt-4o-2024-08-06:personal::BR9osbAZ:ckpt-step-150'

Mounted at /content/drive
Loaded 9999 items from /content/drive/My Drive/SUNY_Poly_DSA598/datasets/FEVER/paper_test.jsonl
OpenAI API key found. Client initialized.


## 2. Module 1: Document Retrieval
<a id="2"></a>

## 2. Helper Functions (Entity/Keyword Extraction, Near Match) (Submodule 1.1)
<a id="2"></a>

In [4]:
# --- Set up Wikipedia API ---
# --- Entity and Keyword Extraction (Submodule 1.1) ---
stop_words = set(stopwords.words('english'))

def extract_entities(text):
    """Extracts named entities using NLTK."""
    tokens = word_tokenize(text)
    tagged_tokens = pos_tag(tokens)
    named_entities = ne_chunk(tagged_tokens)
    entities = []
    for subtree in named_entities:
        if isinstance(subtree, Tree):
            # Improve entity extraction: filter by common NE types if needed
            # if hasattr(subtree, 'label') and subtree.label() in ['PERSON', 'ORGANIZATION', 'GPE', 'LOCATION']:
            entity = " ".join([word for word, tag in subtree.leaves()])
            entities.append(entity)
    # Simple post-processing: remove duplicates and potentially filter short/generic entities
    entities = sorted(list(set(entities)), key=len, reverse=True) # Prioritize longer entities
    return entities

def extract_keywords(text, max_keywords=5):
    """Extracts keywords using TF-IDF."""
    try:
        vectorizer = TfidfVectorizer(stop_words='english', max_features=50) # Use more features initially
        tfidf_matrix = vectorizer.fit_transform([text])
        feature_names = vectorizer.get_feature_names_out()
        # Get scores for the single document
        scores = tfidf_matrix.toarray().flatten()
        # Get indices of top N scores
        top_indices = scores.argsort()[-max_keywords:][::-1]
        keywords = [feature_names[i] for i in top_indices]
        return keywords
    except ValueError:
        # Handle case where text might be too short or only contains stop words
        return []


# --- Near Match Function ---
def near_match(a, b, threshold=0.9, verbose=0):
    """
    Checks if two strings are similar based on Jaccard similarity of words.
    Improved robustness for empty strings.
    """
    if not a or not b: # Handle empty strings
        return False
    set_a = set(a.lower().split())
    set_b = set(b.lower().split())
    intersection = len(set_a.intersection(set_b))
    union = len(set_a.union(set_b))
    if union == 0: # Both strings only contained whitespace or were identical empties
        return True if a == b else False # Match if identical, else False
    sim = intersection / union # Jaccard similarity

    if verbose >= 1:
        print(f"Comparing:\n  A: '{a}'\n  B: '{b}'\n  Similarity: {sim:.4f} (Threshold: {threshold}) -> Match: {sim >= threshold}")
    return sim >= threshold

In [5]:
# --- Module 1: Document Retrieval ---
query_client = OpenAI(api_key=api_key)
def query_generator(claim, keywords, entities, max_pages_to_fetch, temp=0.3, debug=False):
    """
    **UPDATED:** Simulates an entity->URL model.
    Generates potential Wikipedia page titles based on extracted entities.
    Currently uses the entities directly, formatted as potential titles.
    A more advanced simulation could involve an LLM call.

    Args:
        claim (str): The input claim (context).
        keywords (list of str): Keywords (less emphasis now).
        entities (list of str): The primary entities to use for lookup.

    Returns:
        list: A list of potential Wikipedia page titles (strings).
    """
    # Simple simulation: Use entities as potential page titles
    # Replace spaces with underscores, common Wikipedia format
    #potential_titles = [entity.replace(" ", "_") for entity in entities]

    potential_titles = []
    try:
        prompt = f"Given the claim '{claim}', list the most relevant Wikipedia page titles likely to contain evidence. Include key facts about the claim, such as the type of items mentioned. Respond with only a bracketed list of lowercase page titles with spaces as underscores, each title wrapped in single quotes and separated by a comma."
        response = query_client.chat.completions.create(
          model='gpt-4o-mini',
            messages=[
               {"role": "system", "content": "You are an assistant that identifies relevant Wikipedia page titles based on a claim and entities."},
               {"role": "user", "content": prompt},
           ],
          max_tokens=256,
          temperature=temp,
        )
        llm_titles_str = response.choices[0].message.content.strip()
        if debug:
            print(f"DEBUG 1.0 (query_generator):")
            print(f"\tClaim: {claim}")
            print(f"\tEntities: {entities}")
            print(f"\tLLM Output: {llm_titles_str}")
            print("-_-" * 5)
        try:
            llm_titles = ast.literal_eval(llm_titles_str)
            if isinstance(llm_titles, list):
                potential_titles.extend(llm_titles)
        except (ValueError, SyntaxError):
            print(f"Warning: LLM returned non-list format for titles: {llm_titles_str}")
    except Exception as e:
         print(f"Warning: LLM call for query generation failed: {e}")

    # Remove duplicates and limit the number of titles
    unique_titles = sorted(list(set(potential_titles)), key=len) # Keep unique, maybe shorter titles are base articles

    # Limit the number of pages to fetch to avoid excessive API calls/cost
    selected_titles = unique_titles[:max_pages_to_fetch]


    if debug:
        print(f"DEBUG 1.1 (query_generator):")
        print(f"\tEntities: {entities}")
        print(f"\tGenerated Potential Titles: {unique_titles}")
        print(f"\tSelected Titles for Retrieval: {selected_titles}")
        print("-_-" * 10)

    return selected_titles


disambiguate_options_client = OpenAI(api_key=api_key)
def retrieve_documents_from_wikipedia(page_titles, claim, entities, num_search_results=2, temp=0.2, debug=False):
    """
    **UPDATED:** Retrieves document content (introduction) from specific Wikipedia page titles.
    Uses the 'wikipedia' library for API access.

    Args:
        page_titles (list of str): List of Wikipedia page titles to fetch.
        max_intro_sentences (int): Max sentences to take from the intro.
        debug (bool): Enable debug printing.

    Returns:
        tuple: (list of str, list of str):
                 - documents: List of retrieved document introduction texts.
                 - document_sources: List of corresponding page titles from which content was retrieved.
    """
    documents = []
    document_sources = []
    wikipedia.set_lang("en") # Ensure English Wikipedia

    if not page_titles:
        if debug:
            print("DEBUG 1.2 (retrieve_documents): No page titles provided.")
        return [], []

    for title in page_titles:
        try:
            # Suggestion handling: wikipedia library can sometimes find pages even with slight title variations
            search_results = wikipedia.search(title, results=num_search_results)
            if not search_results:
                 if debug:
                     print(f"DEBUG 1.2: No Wikipedia page found for potential title: '{title}'")
                 continue

            # Get the closest match in the search result titles to the claim using sBERT
            claim_embedding = sbert_model.encode(claim, convert_to_tensor=True)
            search_results_embeddings = sbert_model.encode(search_results, convert_to_tensor=True)
            similarities = util.pytorch_cos_sim(claim_embedding, search_results_embeddings)[0]
            closest_index = similarities.argmax().item()
            actual_title = search_results[closest_index]

            # Get the page object (handle disambiguation / page errors)
            page = wikipedia.page(actual_title, auto_suggest=False, redirect=True) # Use actual title now

            # Extract introduction (summary)
            # The library's summary often captures the intro well. Limit sentences.
            intro_text = page.summary
            sentences = nltk.sent_tokenize(intro_text)
            content = " ".join(sentences)

            # Basic cleaning (redundant if summary is clean, but good practice)
            content = re.sub(r'\s+', ' ', content).strip() # Normalize whitespace

            if content:
              if page.title not in document_sources:
                documents.append(content)
                document_sources.append(page.title) # Use the canonical title from the page object
                if debug:
                    print(f"DEBUG 1.2: Successfully retrieved intro from '{page.title}' (searched for '{title}')")
              else:
                if debug:
                    print(f"DEBUG 1.2: Skipping duplicate intro for '{page.title}' (searched for '{title}')")
            else:
                 if debug:
                    print(f"DEBUG 1.2: Empty content retrieved for page '{page.title}'")

        except wikipedia.exceptions.PageError:
            if debug:
                print(f"DEBUG 1.2: PageError - Wikipedia page not found for title: '{title}' (or '{actual_title}')")
        except wikipedia.exceptions.DisambiguationError as e:
            if debug:
                print(f"DEBUG 1.2: DisambiguationError for title: '{title}'. Options: {e.options[:len(e.options)]}...")
            # match the options to the claim and entities with gpt_4o-mini
            prompt = f"Given the claim '{claim}' and the entities '{entities}', choose the most relevant Wikipedia page title from the following options: {e.options}. Respond with only the selected title."
            response = disambiguate_options_client.chat.completions.create(
              model="gpt-4o-mini",
                messages=[
                   {"role": "system", "content": "You are an assistant that selects the most relevant Wikipedia page title from a list of options."},
                   {"role": "user", "content": prompt},
               ],
              max_tokens=100,
              temperature=0.3,
            )
            selected_title = response.choices[0].message.content.strip()
            if selected_title in e.options:
                try:
                    page = wikipedia.page(selected_title, auto_suggest=False, redirect=True)
                    intro_text = page.summary
                    sentences = nltk.sent_tokenize(intro_text)
                    content = " ".join(sentences)
                    content = re.sub(r'\s+', ' ', content).strip() # Normalize whitespace

                    if content:
                        documents.append(content)
                        document_sources.append(page.title) # Use the canonical title from the page object
                        if debug:
                            print(f"DEBUG 1.2: Successfully retrieved intro from '{page.title}' (disambiguated to '{selected_title}')")
                    else:
                         if debug:
                            print(f"DEBUG 1.2: Empty content retrieved for disambiguated page '{page.title}'")
                except Exception as e:
                    print(f"Warning: Error retrieving disambiguated page '{selected_title}': {e}")
            else:
                print(f"Warning: Selected title '{selected_title}' not in disambiguation options.")
        except requests.exceptions.RequestException as e:
            print(f"Warning: Network error retrieving '{title}': {e}")
            # Optional: Implement retry logic
            time.sleep(1) # Basic wait on error
        except Exception as e:
            print(f"Warning: Unexpected error retrieving '{title}': {e}")

    if debug:
        print(f"DEBUG 1.2: Retrieved content for {len(documents)} pages out of {len(page_titles)} potential titles.")
        print("-_-" * 10)
        print("-------------------------------------------------------------------\n")

    # Concatenate the documents into one string and tokenize it with nltk
    all_text = " ".join(documents)
    total_tokens = nltk.word_tokenize(all_text)

    return documents, document_sources, total_tokens, len(documents)

## 3. Module 2: Evidence Sentence Extraction
<a id="3"></a>

In [None]:
# --- Module 2 Helper: GPT-4o-mini claim rephrasing (New) --- #
rephrase_client = OpenAI(api_key=api_key)
def rephrase_claim(claim, rephrs_history="", rephrs_temp=0.5, debug=False):
  """
  **NEW:** Rephrases the claim using GPT-4o-mini.

  Args:
      claim (str): The input claim.

  Returns:
      list of str: Rephrased claims.
  """

  if rephrs_history != "":
    prompt = f"Rephrase the claim '{claim}' to encompass the same meaning but with different wording. Do not change the meaning or add any new information. Do not use any of the previous versions:\n {rephrs_history}"
  else:
    prompt = f"Rephrase the claim '{claim}' to encompass the same meaning but with different wording. Do not change the meaning or add any new information."


  response = rephrase_client.chat.completions.create(
    model="gpt-4o-mini", # Use specific model
    messages=[
        {"role": "system", "content": prompt},
        {"role": "user", "content": claim},
    ],
    max_tokens=512, # Adjust based on expected output length
    n=1,
    stop=None,
    temperature=rephrs_temp, # Lower temp for more deterministic output
  )
  rephrased_claim = response.choices[0].message.content.strip()

  if debug:
      print(f"DEBUG 2.1.1 (rephrase_claim):")
      #print(f"\tOriginal Claim: {claim}")
      print(f"\tRephrased Claim: {rephrased_claim}")
      print("-_-" * 5)

  # Return the rephrased claim
  return rephrased_claim


# --- Module 2 Helper: sBERT Filtering (Updated) ---

def sbert_slide_filter(documents, doc_sources, claim, sbert_threshold, debug=False):
    """
    **UPDATED:** Performs sentence filtering using sBERT similarity.
    Processes each document individually, assigning sentence IDs.
    Returns candidates as [source, id, text, score].

    Args:
        documents (list of str): List of document texts (introductions).
        doc_sources (list of str): Corresponding source identifiers (page titles).
        claim (str): The claim text.
        sbert_threshold (float): The similarity threshold.
        debug (bool): Enable debug printing.

    Returns:
        list: List of candidate sentences: [[page_title, sentence_id, sentence_text, similarity_score], ...]
    """
    all_candidates = []
    claim_embedding = sbert_model.encode(claim, convert_to_tensor=True)

    if len(documents) != len(doc_sources):
        print("Error: Mismatch between documents and sources count in sbert_slide_filter.")
        return []

    for doc_text, source_id in zip(documents, doc_sources):
        sentences = nltk.sent_tokenize(doc_text)
        if not sentences:
            continue

        # Calculate the total number of tokens
        total_tokens = sum(len(sent.split()) for sent in sentences)

        # sBERT was trained on the page titles (as appended to the sentences) with bracket encoding from the FEVER dataset
        # We need to convert the page title to the same format
       ### THIS RESULTED IN A LOWER SCORE FOR THE FINE TUNED MODEL - TESTED 04-25-25 AND REMOVED 04-26-25
        ###""" RESULTS
        """
        --- FEVER Scoring Results ---
        Strict Score (Exact Match): 43.33%
        Label Accuracy: 70.00%
        Evidence Precision: 41.50%
        Evidence Recall: 35.00%
        Evidence F1 Score: 37.97%
        Number of test cases scored: 30
        """

        """
        bracket_mapping = {
            "(": "-LRB-",
            ")": "-RRB-",
            "[": "-LSB-",
            "]": "-RSB-"
        }
        # Convert the source_id to the same format
        for key, value in bracket_mapping.items():
            source_id = source_id.replace(key, value)
        """

        ############################################################################
        ### FINE-TUNed MODEL MODE—WE NEED TO APPEND THE PAGE TITLE TO THE CANDIDATE ORIGIN SENTENCE
        # Append the source_id to the end of each sentences for the fine-tuned model
        for i, sentence in enumerate(sentences):
            sentences[i] = sentence + " " + source_id # Append source_id (URL/title) with a space separator
        ############################################################################

        # Encode all sentences in the document at once for efficiency
        sentence_embeddings = sbert_model.encode(sentences, convert_to_tensor=True)

        # Calculate cosine similarities between claim and all sentences in this doc
        similarities = util.pytorch_cos_sim(claim_embedding, sentence_embeddings)[0] # Shape [1, num_sentences] -> [num_sentences]

        for i, sentence in enumerate(sentences):
            similarity_score = similarities[i].item() # Get scalar value

            if similarity_score >= sbert_threshold:
                candidate = [source_id, i, sentence, similarity_score]
                all_candidates.append(candidate)
                if debug == 2: # More verbose debug
                     print(f"DEBUG 2.2.1 (sbert_filter):")
                     print(f"\tClaim: {claim[:50]}...")
                     print(f"\tDoc: {source_id}, Sent ID: {i}")
                     print(f"\tSentence: {sentence[:100]}...")
                     # print(f"\tSentence+Source (optional): {sentence_with_source[:100]}...")
                     print(f"\tSimilarity: {similarity_score:.4f} (Threshold: {sbert_threshold}) -> PASSED")
                     print("-_-" * 5)

    # Sort candidates by similarity score (descending) - helps LLM prioritize
    all_candidates.sort(key=lambda x: x[3], reverse=True)

    if debug:
       print(f"DEBUG 2.2.2 (sbert_filter):")
       print(f"\tTotal candidates found across all docs: {len(all_candidates)}")
       # print(f"\tTop 3 candidates: {all_candidates[:3]}") # Print top few if needed
       print("-_-" * 10)

    return all_candidates, total_tokens


# --- Module 2 Helper: LLM Sentence Selection (Updated Prompt) ---
sentEx_client = OpenAI(api_key=api_key)
def extract_sentences_with_llm(claim, candidate_sentences_text, prompt, sentEx_temp, debug=False):
    """
    **UPDATED:** Extracts sentences using an LLM based on provided candidates.
    Prompt adjusted for selection task.

    Args:
        claim (str): The input claim.
        candidate_sentences_text (list of str): Candidate sentences provided by sBERT.
        prompt (str): The specific prompt for the LLM (should guide selection).
        debug (bool): Enable debug printing.

    Returns:
        list of str: Selected sentences (as strings). Returns ["NOT ENOUGH INFO"] on failure or specific LLM response.
    """

    if not candidate_sentences_text:
      if debug:
        print("Warning: No candidates from sBERT to select from.")
      return ["NOT ENOUGH INFO"] # No candidates to select from

    # Format candidates for the prompt (e.g., numbered list)
    formatted_candidates = "\n".join([f"{i+1}. {s}" for i, s in enumerate(candidate_sentences_text)])

    full_prompt = f"{prompt}\n\nClaim: {claim}\n\nSelect from these candidate sentences:\n{formatted_candidates}"

    if debug > 1:
        print(f"DEBUG 2.3.2 (LLM Selection):")
        print(f"\tLLM Prompt (partial):\n{prompt}\n...") # Show base prompt
        print(f"\tNum Candidates Sent to LLM: {len(candidate_sentences_text)}")
        # print(f"\tCandidates: {formatted_candidates}")
        print("-_-" * 5)

    try:
        response = sentEx_client.chat.completions.create(
            model=sentEx_ft_model, # Use the specified model
            messages=[
                {"role": "system", "content": "You are a helpful assistant. Select sentences from the provided list that are evidence for the claim. Return ONLY the selected sentences, each on a new line. If none are relevant, respond ONLY with 'NOT ENOUGH INFO'."},
                {"role": "user", "content": full_prompt},
            ],
            max_tokens=512, # Adjust based on expected output length
            n=1,
            stop=None,
            temperature=sentEx_temp, # Lower temp for more deterministic selection
        )
        llm_output = response.choices[0].message.content.strip()

        if debug:
            print(f"DEBUG 2.3.3 (LLM Selection):")
            print(f"\tLLM Raw Output:\n{llm_output}")
            print("-_-" * 5)

        if "NOT ENOUGH INFO" in llm_output:
             # Check if it's the *only* response, case-insensitive
             if llm_output.upper() == "NOT ENOUGH INFO":
                 return ["NOT ENOUGH INFO"]
             else:
                 # Handle cases where NEI is mixed with sentences - treat as NEI or try to parse?
                 # Safer to treat as NEI if the instruction was to only return NEI when applicable.
                 print(f"Warning: LLM output contained 'NOT ENOUGH INFO' along with other text. Interpreting as NEI.")
                 return ["NOT ENOUGH INFO"]


        # Split the response into sentences, removing empty lines
        selected_sentences = [s.strip() for s in llm_output.split('\n') if s.strip()]

        # Optional: Post-process LLM output - remove potential numbering (e.g., "1. Sentence text")
        processed_sentences = []
        for s in selected_sentences:
            match = re.match(r'^\s*\d+\.\s*(.*)', s) # Matches "1. ", " 2. ", etc.
            if match:
                processed_sentences.append(match.group(1).strip())
            else:
                processed_sentences.append(s) # Keep as is if no numbering pattern

        return processed_sentences

    except Exception as e:
        print(f"Error during LLM call in extract_sentences_with_llm: {e}")
        return ["NOT ENOUGH INFO"] # Treat errors as failure to find info


# --- Module 2 Main Control Flow (Updated) ---

def module_2_2_controls(claim, documents, doc_sources, entities, keywords, initial_sbert_thresh=0.2, min_sbert_thresh=0.1, thresh_decay=0.05, max_evidence=5, max_iterations=5, near_match_thresh=0.9, rephrs_temp=0.3, sentEx_temp=0.3, verbose=0, debug=False):
    """
    **UPDATED:** Module 2 implementing iterative sBERT -> LLM selection with reassociation.

    Args:
        claim (str): The input claim.
        documents (list of str): List of retrieved document texts.
        doc_sources (list of str): Corresponding source identifiers (page titles).
        entities (list of str): Entities from the claim.
        keywords (list of str): Keywords from the claim.
        initial_sbert_thresh (float): Starting sBERT similarity threshold.
        min_sbert_thresh (float): Minimum sBERT threshold allowed.
        thresh_decay (float): Amount to decrease threshold if LLM selects few sentences.
        max_evidence (int): Target number of evidence sentences.
        verbose (int): Verbosity level.
        debug (bool): Enable debug printing.

    Returns:
        tuple: (list, str, dict):
                 - final_evidence_ids: List of selected evidence: [[page_title, sentence_id], ...]
                 - status: "OK" or "NOT ENOUGH INFO".
                 - report: Dictionary with run details.
    """
    if verbose: print("###### M2: Starting Evidence Extraction ######")

    final_evidence_ids = [] # Stores [[page_title, sentence_id]]
    all_sbert_candidates_map = {} # Store all candidates found { (title, id) : [title, id, text, score] } to avoid duplicates and for re-association
    selected_candidate_keys = set() # Keep track of (title, id) keys already selected

    current_sbert_thresh = initial_sbert_thresh
    sbert_total_tokens = 0

    llm_total_tokens = 0
    llm_total_sentences = 0

    # Define rephrs_history for rephrasing
    rephrs_history = ""

    # Define prompts (using entities/keywords)
    entity_str = ", ".join(entities) if entities else "relevant entities"
    keyword_str = ", ".join(keywords) if keywords else "relevant keywords"
    prompts = {
      "init": f"Retrieve sentences from the list that either support or refute the following claim. Specifically, focus on sentences mentioning {entity_str}. Order the sentences by relevance, highest first, and return a list separated by the return character. If there are no relevant sentences, respond with 'NOT ENOUGH INFO'. DO NOT CREATE ANY SENTENCES THAT ARE NOT IN THE PROVIDED LIST, AND DO NOT TRUNCATE THE SENTENCE.",
      "followup": f"You didn’t find enough sentences. Find additional (new) sentences that that are relevant to key points in the claim. Order the sentences by relevance, highest first, and return a list separated by the return character. If there are no relevant sentences, respond with 'NOT ENOUGH INFO'. DO NOT CREATE ANY SENTENCES THAT ARE NOT IN THE PROVIDED LIST, AND DO NOT TRUNCATE THE SENTENCE.",
    }

    if debug:
        print(f"DEBUG 2.1 (module_2_controls):")
        print(f"\tClaim: {claim}")
        print(f"\tEntities: {entities}")
        print(f"\tKeywords: {keywords}")
        print(f"\tInitial sBERT Thresh: {initial_sbert_thresh}, Min Thresh: {min_sbert_thresh}")
        print(f"\tMax Evidence Target: {max_evidence}, Max Iterations: {max_iterations}")
        print("-_-" * 10)

    for iteration in range(max_iterations):
        if len(final_evidence_ids) >= max_evidence:
            if verbose: print(f"M2 Iter {iteration}: Reached target evidence count ({len(final_evidence_ids)}).")
            break

        if debug:
            print(f"DEBUG 2. Iteration {iteration+1}/{max_iterations}, Current sBERT Thresh: {current_sbert_thresh:.3f}")

        ##### GPT_REPHRASE CALL #####
        # Rephrase the claim for better context
        if iteration == 0:
            this_claim = claim # Use original claim for the first iteration
        elif iteration < 5:
            this_claim = rephrase_claim(claim, rephrs_history, rephrs_temp, debug) #
        else:
            this_claim = claim # Use original claim again if too many iterations (too expensive)
        rephrs_history += f"\n{this_claim}" # Add to history for next iteration

        ##### SBERT_SENTEX CALL #####
        # 1. Get sBERT Candidates (at current threshold)
        # 1.1: Calculate the total number of tokens across all documents
        sbert_candidates, iter_tokens = sbert_slide_filter(documents, doc_sources, this_claim, current_sbert_thresh, debug=debug)
        sbert_total_tokens += iter_tokens

        # Store new candidates and identify *new* ones for this iteration's LLM input
        new_candidates_for_llm = []
        current_iter_candidate_details = [] # Store details [[title, id, text],...] for re-association
        for cand in sbert_candidates:
            key = (cand[0], cand[1]) # (title, id)
            if key not in all_sbert_candidates_map:
                all_sbert_candidates_map[key] = cand # Store full details
            # Only consider candidates not already selected for the LLM input
            if key not in selected_candidate_keys:
                 new_candidates_for_llm.append(cand[2]) # Add sentence text to LLM input list
                 current_iter_candidate_details.append([cand[0], cand[1], cand[2]]) # Store [title, id, text] for matching

        if not new_candidates_for_llm:
            if verbose: print(f"M2 Iter {iteration+1}: No new candidates found by sBERT at threshold {current_sbert_thresh:.3f}.")
            # Option: Lower threshold aggressively or break if already low
            if current_sbert_thresh > min_sbert_thresh:
                 current_sbert_thresh = max(min_sbert_thresh, current_sbert_thresh - thresh_decay * 2) # Faster decay if nothing found
                 if verbose: print(f"   Lowering threshold to {current_sbert_thresh:.3f} for next try.")
                 continue # Try again with lower threshold
            else:
                 if verbose: print(f"M2 Iter {iteration+1}: No new candidates and threshold at minimum ({current_sbert_thresh:.3f}). Stopping.")
                 break # Stop if threshold is already at minimum

        ##### GPT_SENTEX CALL #####
        # 2. LLM Selection
        # 2.1: Calculate the number of tokens sent to the LLM at this iteration
        llm_total_tokens += sum(len(sent.split()) for sent in new_candidates_for_llm)
        llm_total_sentences += len(new_candidates_for_llm)
        if verbose: print(f"M2 Iter {iteration+1}: Sending {len(new_candidates_for_llm)} new candidates to LLM for selection.")
        # Use the appropriate prompt based on the iteration
        if iteration == 0:
            this_prompt = prompts["init"]
        else:
            this_prompt = prompts["followup"]
        selected_sentences_text = extract_sentences_with_llm(this_claim, new_candidates_for_llm, this_prompt, sentEx_temp, debug=debug)

        # 3. Process LLM Output & Re-association
        num_selected_this_iter = 0
        if selected_sentences_text == ["NOT ENOUGH INFO"]:
            if verbose: print(f"M2 Iter {iteration+1}: LLM indicated 'NOT ENOUGH INFO' from the provided candidates.")
            # Decide how to proceed: lower threshold, stop?
            # Lower threshold if LLM found nothing on the first or second tries
            if iteration < 2 and current_sbert_thresh > min_sbert_thresh:
                current_sbert_thresh = max(min_sbert_thresh, current_sbert_thresh - thresh_decay)
                if verbose: print(f"   Lowering threshold to {current_sbert_thresh:.3f} for next try.")
                continue # Try again with lower threshold
            else:
                if verbose: print(f"M2 Iter {iteration+1}: Stopping because LLM found no evidence after two tries.")
                break
        else:
            if verbose: print(f"M2 Iter {iteration+1}: LLM selected {len(selected_sentences_text)} sentences.")
            # Re-associate selected text with [title, id] using near_match
            for llm_sent in selected_sentences_text:
                best_match_key = None
                highest_sim = -1.0
                # Find the best match among the candidates sent to the LLM this iteration
                for title, sent_id, orig_text in current_iter_candidate_details:
                    key = (title, sent_id)
                    # Skip if this candidate was already selected in this iteration by a previous LLM sentence match
                    # Or if it was selected in a *previous* iteration
                    if key in selected_candidate_keys:
                        continue

                    # Use near_match to compare LLM output with original sBERT candidate text
                    similarity = len(set(llm_sent.lower().split()).intersection(set(orig_text.lower().split()))) / len(set(llm_sent.lower().split()).union(set(orig_text.lower().split())))

                    # Using exact match or near_match for robustness
                    if near_match(llm_sent, orig_text, threshold=near_match_thresh, verbose=debug>1): # Use near match
                         # Crude way to find the 'best' match if multiple near-matches exist
                         if similarity > highest_sim:
                              highest_sim = similarity
                              best_match_key = key

                if best_match_key:
                    if best_match_key not in selected_candidate_keys:
                        final_evidence_ids.append([best_match_key[0], best_match_key[1]]) # Store [title, id]
                        #######################################################################################
                        # NOTE: NOT IMPLEMENTED FOR TESTING—THROWAWAY IDEA
                        ### NOTE: WE ARE ADDING 1 TO EACH SENTENCE INDEX. This is due to observation alone: A few (<40% of the predicted evidence items have the correct page title but the sentence ID is one fewer)
                        ### THIS MAY BE VERY CONSEQUENTIAL

                        #######################################################################################
                        selected_candidate_keys.add(best_match_key) # Mark as selected
                        num_selected_this_iter += 1
                        if verbose > 1: print(f"   Matched: '{llm_sent[:50]}...' -> {best_match_key}")
                        if len(final_evidence_ids) >= max_evidence:
                            break # Stop if max evidence reached during re-association
                    # else: (already selected) - do nothing
                else:
                    if verbose: print(f"   Warning: Could not re-associate LLM output with retrieved data: '{llm_sent[:80]}...'")


        # 4. Dynamic Threshold Adjustment (Based on LLM selection)
        if num_selected_this_iter < len(new_candidates_for_llm) / 4 and len(new_candidates_for_llm) > 0: # Example: If LLM selected less than 25% of candidates
            if current_sbert_thresh > min_sbert_thresh:
                current_sbert_thresh = max(min_sbert_thresh, current_sbert_thresh - thresh_decay)
                if verbose: print(f"M2 Iter {iteration+1}: LLM selected few items ({num_selected_this_iter}). Lowering sBERT threshold to {current_sbert_thresh:.3f}")
        # Optional: Increase threshold slightly if LLM selects almost everything? (Less common)
        elif num_selected_this_iter > len(new_candidates_for_llm) * 0.8:
            current_sbert_thresh = min(initial_sbert_thresh, current_sbert_thresh + thresh_decay / 2)
            if verbose: print(f"   LLM selected many items. Slightly increasing threshold to {current_sbert_thresh:.3f}")

        if debug:
            print(f"DEBUG 2. End Iter {iteration+1}: Total evidence found: {len(final_evidence_ids)}")
            print("-_-" * 10)


    # Final Status and Report
    status = "OK" if final_evidence_ids else "NOT ENOUGH INFO"
    if not final_evidence_ids and verbose:
        print("M2: Finished iterations. No evidence selected.")
        print("-------------------------------------------------------------------\n")
    elif verbose:
        print(f"M2: Finished. Selected {len(final_evidence_ids)} evidence items.")
        print("-------------------------------------------------------------------\n")


    # Store all sentences found by sbert (for analysis) and selected ones
    # Need to retrieve text for selected IDs for the report
    all_sbert_sentences_text = [details[2] for details in all_sbert_candidates_map.values()]
    selected_evidence_texts = [all_sbert_candidates_map[key][2] for key in selected_candidate_keys if key in all_sbert_candidates_map]

    report = {
        "claim": claim,
        "final_evidence_ids": final_evidence_ids, # [[title, id], ...]
        "selected_evidence_texts": selected_evidence_texts, # List of text for selected evidence
        "status": status,
        "iterations_run": iteration + 1,
        "max_evidence": max_evidence,
        "max_iterations": max_iterations,
        "mod_2_total_documents": len(documents),
        "sbert_total_sentences": len(all_sbert_candidates_map),
        "sbert_total_tokens": sbert_total_tokens,
        "initial_sbert_thresh": initial_sbert_thresh,
        "final_sbert_threshold": current_sbert_thresh,
        "min_sbert_thresh": min_sbert_thresh,
        "thresh_decay": thresh_decay,
        # "all_sbert_candidates_text": all_sbert_sentences_text, # Can be large
        "llm_total_sentences": llm_total_sentences,
        "llm_total_tokens": llm_total_tokens,
        "near_match_thresh": near_match_thresh,
    }

    return final_evidence_ids, status, report



## 4. Module 3: Claim Classification
<a id="4"></a>


In [7]:
# --- Module 3: Classification (Updated) ---
nli_client = OpenAI(api_key=api_key)
def module_3_classification(claim, evidence_texts, nli_client_temp=0.1, verbose=0, debug=False):
    """
    **UPDATED:** Classifies the claim based on the TEXT of the extracted evidence sentences.

    Args:
        claim (str): The input claim.
        evidence_texts (list of str): List of extracted evidence sentence texts.
        verbose (int): Verbosity level.
        debug (bool): Enable debug printing.

    Returns:
        tuple: (str, str, str):
                 - classification_result: "SUPPORTS", "REFUTES", or "NOT ENOUGH INFO".
                 - exit_status: "OK" or "NOT ENOUGH INFO".
                 - prompt: The prompt used for classification.
    """

    if verbose: print("###### M3: Starting Classification ######")

    if not evidence_texts:
        if verbose: print("M3: No evidence text provided. Classifying as NOT ENOUGH INFO.")
        # Return structure consistent with LLM call but without making one
        return "NOT ENOUGH INFO", "NOT ENOUGH INFO", "No evidence provided to LLM."

    # Format evidence for the prompt
    formatted_evidence = "\n".join([f"- {e}" for e in evidence_texts])
    if not formatted_evidence: # Handle case where list might contain only empty strings
         if verbose: print("M3: Evidence text list was empty or contained only whitespace. Classifying as NOT ENOUGH INFO.")
         return "NOT ENOUGH INFO", "NOT ENOUGH INFO", "Evidence text was empty."


    # 3.1 Prompt
    base_prompt = f"Based ONLY on the following evidence sentences, classify the claim as SUPPORTS, REFUTES, or NOT ENOUGH INFO.\n\nClaim: '{claim}'\n\nEvidence:\n{formatted_evidence}\n\nRespond ONLY with SUPPORTS, REFUTES, or NOT ENOUGH INFO."
    ft_prompt = f"Given the claim, classify the stance of the potentially relevant evidence out of the following categories: '1' (if the claim is supported by the evidence), '0' (if the claim is refuted by the evidence), '2' (if you do not have enough info to make a confident decision). Respond with a single digit label. Do not use any other labels.\n\nClaim: '{claim}'\n\nEvidence:\n{formatted_evidence}"

    if debug:
        print(f"DEBUG 3.1 (module_3_classification):")
        print(f"\tClaim: {claim}")
        print(f"\tEvidence Texts Sent ({len(evidence_texts)}):")
        # for i, txt in enumerate(evidence_texts): print(f"\t  {i+1}. {txt[:100]}...") # Print snippet
        print(f"\tPrompt (partial): {base_prompt[:200]}...")
        print("-_-" * 10)

    # 3.2 Classification Call
    try:
        response = nli_client.chat.completions.create(
            model='gpt-4o-mini', # Use specified model
            messages=[
                {"role": "system", "content": "You are a claim classification assistant. Respond only with 'SUPPORTS', 'REFUTES', or 'NOT ENOUGH INFO'."},
                {"role": "user", "content": base_prompt},
            ],
            max_tokens=10,  # Classification is short
            n=1,
            stop=None,
            temperature=nli_client_temp, # Low temperature for classification
        )
        classification_result = response.choices[0].message.content.strip().upper() # Normalize output

        # For the fine tuned model that outputs encoded labels
        """
        # Convert numerical labels to text
        if classification_result == "1":
            classification_result = "SUPPORTS"
        elif classification_result == "0":
            classification_result = "REFUTES"
        elif classification_result == "2":
            classification_result = "NOT ENOUGH INFO"
        else:
            print(f"Warning: Module 3 LLM returned invalid label '{classification_result}'. Defaulting to NOT ENOUGH INFO.")
            classification_result = "NOT ENOUGH INFO"
        """
        # Validate output
        valid_labels = ["SUPPORTS", "REFUTES", "NOT ENOUGH INFO"]
        if classification_result not in valid_labels:
             print(f"Warning: Module 3 LLM returned invalid label '{classification_result}'. Defaulting to NOT ENOUGH INFO.")
             classification_result = "NOT ENOUGH INFO"

    except Exception as e:
         print(f"Error during Module 3 classification LLM call: {e}")
         classification_result = "NOT ENOUGH INFO" # Default on error


    # 3.3 Exit Status
    exit_status = "OK" if classification_result in ["SUPPORTS", "REFUTES"] else "NOT ENOUGH INFO"

    if debug:
      print(f"DEBUG 3.2/3.3 (module_3_classification):")
      print(f"\tLLM Classification Result: {classification_result}")
      print(f"\tExit Status: {exit_status}")
      print("-_-" * 10)

    return classification_result, exit_status, base_prompt

## 5. Module 0: System Control & Execution
<a id="5"></a>

In [8]:
# --- Helper to get test claim from paper_test.jsonl ---
global_test_data_index = 0

def get_test_claim(test_data_list, verbose=0, debug=False):
    """
    **UPDATED:** Gets the next claim from the loaded paper_test.jsonl data.

    Args:
        test_data_list (list): The list loaded from paper_test.jsonl.
        verbose (int): Verbosity level.
        debug (bool): Enable debug printing.

    Returns:
        tuple: (claim_id, claim_text) or (None, None) if index is out of bounds.
    """
    global global_test_data_index
    if global_test_data_index >= len(test_data_list):
        print("Reached end of test data.")
        return None, None # Signal end

    item = test_data_list[global_test_data_index]
    claim_id = item.get("id")
    claim_text = item.get("claim")

    if claim_id is None or claim_text is None:
        print(f"Warning: Skipping item at index {global_test_data_index} due to missing 'id' or 'claim'. Item: {item}")
        global_test_data_index += 1
        return get_test_claim(test_data_list, verbose, debug) # Recursively get next

    if verbose > 1:
        print(f"Getting Test Claim {global_test_data_index + 1}/{len(test_data_list)}: ID={claim_id}, Claim='{claim_text[:100]}...'")

    global_test_data_index += 1 # Increment for next call
    return claim_id, claim_text

# --- Main System Control Flow (Updated) ---

def module_0_sys_control(test_items_list, test_size, initial_sbert_thresh, min_sbert_thresh, thresh_decay, max_evidence, max_iterations, near_match_thresh, max_pages_to_fetch, num_search_results, query_client_temp, rephrase_client_temp, sentEx_client_temp, nli_client_temp, disambiguate_client_temp, verbose=0, debug=False):
    """
    **UPDATED:** Orchestrates the full fact-checking pipeline for test data.

    Args:
        test_items_list (list): List of dicts loaded from paper_test.jsonl.
        test_size (int): Number of items to process from the list.
        verbose (int): Verbosity level.
        debug (bool): Enable detailed debug printing.

    Returns:
        tuple: (list, pd.DataFrame):
                 - predictions_list: List of prediction dicts for output/scoring.
                 - run_report_df: DataFrame containing detailed logs for each claim.
    """
    global global_test_data_index
    global_test_data_index = 0 # Reset index at the start of a run

    predictions_list = [] # Stores final formatted predictions
    report_columns = [
        'id', 'claim', 'time_to_check', 'entities', 'keywords', 'retrieved_pages',
        'module2_status', 'predicted_evidence_ids', 'predicted_evidence_texts',
        'module3_result', 'module3_status', 'module3_prompt', 'module1_report_details',
        'module2_report_details' # Store the nested report dict from M2
    ]
    run_report_list = [] # Collect data for DataFrame

    # Store original documents fetched by Module 1 for later lookup
    # Useful for getting text for Module 3 without re-fetching/storing large texts repeatedly
    document_store = {} # { page_title: text }

    # Limit processing to test_size or available data
    actual_test_size = min(test_size, len(test_items_list))
    if actual_test_size <= 0:
         print("Error: No test data items to process.")
         return [], pd.DataFrame(columns=report_columns)

    print(f"Starting system control. Processing {actual_test_size} test claims...")

    for i in tqdm(range(actual_test_size), desc="Processing Claims"):
        # 0. Start timer
        start_time = time.time()

        # 1. Get Claim from Test Data
        claim_id, claim = get_test_claim(test_items_list, verbose=verbose, debug=debug)
        if claim_id is None: # Reached end or error
            break

        if verbose: print(f"\n--- Processing Claim ID: {claim_id} ---")
        if verbose > 1: print(f"Claim: {claim}")

        document_store.clear() # Clear store for new claim

        # 2. Extract Entities & Keywords
        entities = extract_entities(claim)
        keywords = extract_keywords(claim)
        if verbose > 1: print(f"Entities: {entities}, Keywords: {keywords}")

        # --- Module 1 ---

        # 3. Generate Potential Page Titles
        potential_titles = query_generator(claim, keywords, entities, max_pages_to_fetch, query_client_temp, debug=debug)

        # 4. Retrieve Documents
        if verbose: print("###### M1: Retrieving Documents ######")
        documents, doc_sources, total_document_tokens, mod_1_total_documents = retrieve_documents_from_wikipedia(potential_titles, claim, entities, num_search_results, disambiguate_client_temp, debug=debug)
        retrieved_pages_str = ", ".join(doc_sources) if doc_sources else "None"

        # Store fetched documents for Module 3 lookup
        for title, text in zip(doc_sources, documents):
            document_store[title] = text

        # Create the Module 1 report
        module_1_report = {
            "mod_1_total_pages": mod_1_total_documents,
            "total_document_tokens": total_document_tokens,
            "potential_titles": potential_titles,
            "retrieved_titles": retrieved_pages_str
        }

        if not documents:
            if verbose: print("M1: No documents retrieved. Cannot proceed.\n-------------------------------------------------------\n")
            # Handle case with no documents: classify as NEI directly
            predicted_evidence_ids = []
            predicted_evidence_texts = []
            classification_result = "NOT ENOUGH INFO"
            mod2_status = "NOT ENOUGH INFO"
            mod3_status = "NOT ENOUGH INFO"
            mod3_prompt = "Skipped - No documents from M1"
            mod2_report_details = {"status": "Skipped - No documents from M1"}
        else:
            # --- Module 2 ---
            # 5. Extract Evidence Sentences ([title, id])
            # Use appropriate thresholds
            predicted_evidence_ids, mod2_status, mod2_report = module_2_2_controls(
                claim, documents, doc_sources, entities, keywords,
                initial_sbert_thresh=initial_sbert_thresh, # Slightly higher initial threshold?
                min_sbert_thresh=min_sbert_thresh,
                thresh_decay=thresh_decay,
                max_evidence=max_evidence,
                max_iterations=max_iterations,
                near_match_thresh=near_match_thresh,
                rephrs_temp=rephrase_client_temp,
                sentEx_temp=sentEx_client_temp,
                verbose=verbose,
                debug=debug
            )
            predicted_evidence_texts = mod2_report.get("selected_evidence_texts", [])
            mod2_report_details = mod2_report # Store the whole M2 report

            # --- Module 3 ---
            # 6. Classify Claim based on evidence TEXT
            classification_result, mod3_status, mod3_prompt = module_3_classification(
                claim,
                predicted_evidence_texts, # Pass the actual text
                nli_client_temp=nli_client_temp,
                verbose=verbose,
                debug=debug
            )

        # 7. Format Output for FEVER Scorer / Final JSON
        # 7.1 Encode the brackets (-LRB-, -RRB-, -LSB-, -RSB-) and replace spaces with underscores for each page title
        bracket_mapping = {
            "(": "-LRB-",
            ")": "-RRB-",
            "[": "-LSB-",
            "]": "-RSB-"
        }
        for item in predicted_evidence_ids:
            # Encode the page title like the test set (brackets and unerscores)
            item[0] = "".join(bracket_mapping.get(c, c) for c in item[0])
            item[0] = item[0].replace(" ", "_")

        prediction_item = {
            "id": claim_id,
            "predicted_label": classification_result,
            "predicted_evidence": predicted_evidence_ids # List of [page_title, sentence_id]
        }
        # The FEVER scorer needs gold labels/evidence
        # We add dummy fields here if there is no gold evidence.
        if "label" in test_items_list[i]: # Check if gold data exists
            if verbose: print("Adding gold label/evidence to prediction.")
            prediction_item["label"] = test_items_list[i]["label"]
            # Set th efirst two items of each inner list of test_items_list to None and exclude duplicates
            unique_evdc_items = []
            for inner_list in test_items_list[i]["evidence"]:
                if inner_list not in unique_evdc_items:
                    unique_evdc_items.append(inner_list)
            for evidence_set in test_items_list[i]["evidence"]:
                for item in evidence_set:
                    item[0] = None
                    item[1] = None
            if test_items_list[i]["label"] == "NOT ENOUGH INFO":
                unique_evdc_items = []
            prediction_item["evidence"] = unique_evdc_items
            if verbose: print("##########################################################################\n")
        else:
            if verbose: print("Adding dummy gold label/evidence to prediction because gold data is missing.")
            # Add placeholder fields if running the scorer function is desired,
            # otherwise, these can be omitted if just generating predictions.
            prediction_item["label"] = "NOT ENOUGH INFO" # Dummy
            prediction_item["evidence"] = [] # Dummy
        if verbose: print("##########################################################################\n")

        predictions_list.append(prediction_item)

        time_to_check = time.time() - start_time
        if verbose: print(f"Time to process claim: {time_to_check:.2f} seconds.\n------------------------------------------------\n")

        # 8. Log Run Details
        run_report_list.append({
            'id': claim_id,
            'claim': claim,
            'time_to_check': time_to_check,
            'entities': ", ".join(entities) if entities else "",
            'keywords': ", ".join(keywords) if keywords else "",
            'max_pages_to_fetch': max_pages_to_fetch,
            'max_search_results': num_search_results,
            'retrieved_page_titles': retrieved_pages_str,
            'module2_status': mod2_status,
            'predicted_evidence_ids': json.dumps(predicted_evidence_ids), # Store as JSON string
            'predicted_evidence_texts': json.dumps(predicted_evidence_texts), # Store as JSON string
            'module3_result': classification_result,
            'module3_status': mod3_status,
            'module3_prompt': mod3_prompt,
            'module1_report_details': json.dumps(module_1_report), # Store M1 report dict as JSON string
            'module2_report_details': json.dumps(mod2_report_details) # Store M2 report dict as JSON string
        })

        # Optional: Garbage collect periodically if memory usage is high
        if i % 50 == 0:
            gc.collect()

    print(f"\nFinished processing {len(predictions_list)} claims.")
    run_report_df = pd.DataFrame(run_report_list, columns=report_columns)
    return predictions_list, run_report_df


In [9]:
# --- Execution & Scoring ---
# Get the time now (UTC + 8 hours)
time_str = datetime.datetime.now(datetime.timezone(datetime.timedelta(hours=8))).strftime("%y%m%d_%H%M")

# Ensure test_data is loaded
if test_data:
    # Run the system on a subset of the test data
    NUM_TEST_CLAIMS = 30
    initial_sbert_thresh=0.4
    min_sbert_thresh=0.1
    thresh_decay=0.05
    max_evidence=8
    max_iterations=10
    near_match_thresh=0.8
    max_pages_to_fetch=25
    num_search_results=20
    query_client_temp=0.5
    rephrase_client_temp=0.9 # RESET THIS HIGHER IF CONTINUING TESTING WITH CLAIM REPHRASING. OTHERWISE, THIS IS A WARNING PLACEHOLDER FOR VISUALIZATIONS LATER.
    sentEx_client_temp=0.2
    nli_client_temp=0.1
    disambiguate_client_temp=0.2
    predictions, report_df = module_0_sys_control(test_data, NUM_TEST_CLAIMS, initial_sbert_thresh, min_sbert_thresh, thresh_decay, max_evidence, max_iterations, near_match_thresh, max_pages_to_fetch, num_search_results, query_client_temp, rephrase_client_temp, sentEx_client_temp, nli_client_temp, disambiguate_client_temp, verbose=1, debug=True)

    # --- FEVER Scoring ---
    # NOTE: This will run the scorer, but scores are only meaningful if 'predictions'
    # includes the GOLD 'label' and 'evidence' fields.
    # The 'strict_score' might be somewhat informative if NEI predictions align.
    print("\n--- FEVER Scoring Results ---")
    if predictions:
        # The scorer expects 'evidence' to be a list of lists of possible evidence sets.
        for p in predictions:
            if "predicted_evidence" not in p or not isinstance(p["predicted_evidence"], list):
                 # If gold evidence wasn't loaded or is malformed, provide the expected structure.
                 # A list containing one element: a list of gold evidence items [ [ [None, None, title, id], ... ], ...]
                 # Or if the label is NEI, it expects evidence: []
                 if p["predicted_label"] == "NOT ENOUGH INFO":
                      p["predicted_evidence"] = []
                 else:
                     # For SUPPORTS/REFUTES where we lack gold data, technically the scorer
                     # expects at least one evidence set. Providing an empty set list
                     # signals no provable evidence was found/available in gold data.
                     p["predicted_evidence"] = [[]] # Represents verifiable but no specific gold sentences provided
            if p["predicted_label"] == "NOT ENOUGH INFO":
                p["predicted_evidence"] = []

        # Ensure predicted_evidence is always a list (even if empty)
        for p in predictions:
            if "predicted_evidence" not in p:
                p["predicted_evidence"] = []

        try:
            strict_score, label_accuracy, precision, recall, f1 = fever_score(predictions)
            print(f"Strict Score (Exact Match): {strict_score*100:.2f}%")
            print(f"Label Accuracy: {label_accuracy*100:.2f}%")
            print(f"Evidence Precision: {precision*100:.2f}%")
            print(f"Evidence Recall: {recall*100:.2f}%")
            print(f"Evidence F1 Score: {f1*100:.2f}%")
            print(f"Number of test cases scored: {len(predictions)}")
        except Exception as e:
            print(f"Error during FEVER scoring: {e}")
            print("Scoring skipped. Check prediction format and scorer compatibility.")

    else:
        print("No predictions generated to score.")

    # --- Display Results ---
    print("\n--- Sample Predictions (Output Format) ---")
    for i, item in enumerate(predictions[:20]): # Show first 20 predictions
        print(json.dumps(item, indent=2))
        if i >= 20: break # Limit output

    print("\n--- Run Report ---")
    # Configure pandas display options
    pd.set_option('display.max_rows', 10)
    pd.set_option('display.max_columns', 25)
    pd.set_option('display.width', 1000)
    pd.set_option('display.max_colwidth', 50) # Limit column width

    if not report_df.empty:
        # Add the strict_score, label_accuracy, precision, recall, f1
        # Create new columns
        report_df['query_client_temp'] = [query_client_temp] * len(report_df)
        report_df['rephrase_client_temp'] = [rephrase_client_temp] * len(report_df)
        report_df['sentEx_client_temp'] = [sentEx_client_temp] * len(report_df)
        report_df['nli_client_temp'] = [nli_client_temp] * len(report_df)
        report_df['disambiguate_client_temp'] = [disambiguate_client_temp] * len(report_df)
        report_df['strict_score'] = [strict_score] * len(report_df)
        report_df['label_accuracy'] = [label_accuracy] * len(report_df)
        report_df['precision'] = [precision] * len(report_df)
        report_df['recall'] = [recall] * len(report_df)
        report_df['f1'] = [f1] * len(report_df)
        print(report_df)
        # Save the report
        try:
             report_filename = f'/content/drive/MyDrive/SUNY_Poly_DSA598/datasets/FEVER/paper_test_results/tuned_GPT-sBERTn1024-sentEx-RephsHist-T4_run_report_test_n{len(predictions)}_{time_str}.csv'
             report_df.to_csv(report_filename, index=False)
             print(f"\nReport saved to: {report_filename}")
        except Exception as e:
            print(f"Error saving report: {e}")
    else:
        print("Report DataFrame is empty.")

else:
    print("Test data not loaded. Cannot run system.")

Starting system control. Processing 30 test claims...


Processing Claims:   0%|          | 0/30 [00:00<?, ?it/s]


--- Processing Claim ID: 113501 ---
DEBUG 1.0 (query_generator):
	Claim: Grease had bad reviews.
	Entities: ['Grease']
	LLM Output: ['grease_(film)', 'critical_reception', 'list_of_films_with_lowest_ratings', 'rotten_tomatoes', 'metacritic']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Grease']
	Generated Potential Titles: ['metacritic', 'grease_(film)', 'rotten_tomatoes', 'critical_reception', 'list_of_films_with_lowest_ratings']
	Selected Titles for Retrieval: ['metacritic', 'grease_(film)', 'rotten_tomatoes', 'critical_reception', 'list_of_films_with_lowest_ratings']
-_--_--_--_--_--_--_--_--_--_-
###### M1: Retrieving Documents ######
DEBUG 1.2: Successfully retrieved intro from 'List of video games notable for negative reception' (searched for 'metacritic')




  lis = BeautifulSoup(html).find_all('li')


DEBUG 1.2: DisambiguationError for title: 'grease_(film)'. Options: ['Grease (lubricant)', 'petroleum', 'Brown grease', 'Yellow grease', 'Hydrogenated vegetable oil', 'Vegetable shortening', 'bribe', 'killing', 'Pomade', 'Grease (musical)', 'Grease (film)', 'Grease 2', '"Grease" (song)', '1971 musical play', 'Grease: The Original Soundtrack from the Motion Picture', 'Grease: The New Broadway Cast Recording (2007 album)', 'Grease: Live', "Grease: You're the One that I Want!", 'Grease is the Word', 'Extreme Ghostbusters episode 14', 'The Keith & Paddy Picture Show season 2, episode 1', 'Grease (franchise)', 'Grease (video game)', 'Mud fever', 'Aglossa cuprina', 'Grease (networking)', 'All pages with titles beginning with Grease ', 'All pages with titles containing Grease', 'Greaser (disambiguation)', 'Greasy (disambiguation)', 'Greece (disambiguation)']...
DEBUG 1.2: Successfully retrieved intro from 'Grease (film)' (disambiguated to 'Grease (film)')
DEBUG 1.2: Successfully retrieved int

Processing Claims:   3%|▎         | 1/30 [00:44<21:22, 44.24s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: REFUTES
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 43.82 seconds.
------------------------------------------------


--- Processing Claim ID: 163803 ---
DEBUG 1.0 (query_generator):
	Claim: Ukrainian Soviet Socialist Republic was a founding participant of the UN.
	Entities: ['Socialist Republic', 'Ukrainian', 'Soviet', 'UN']
	LLM Output: ['ukrainian_soviet_socialist_republic', 'founding_members_of_the_un', 'united_nations', 'history_of_the_united_nations', 'soviet_union']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Socialist Republic', 'Ukrainian', 'Soviet', 'UN']
	Generated Potential Titles: ['soviet_union', 'united_nations', 'founding_members_of_the_un', 'history_of_the_united_nations

Processing Claims:   7%|▋         | 2/30 [01:30<21:19, 45.71s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: NOT ENOUGH INFO
	Exit Status: NOT ENOUGH INFO
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 46.74 seconds.
------------------------------------------------


--- Processing Claim ID: 70041 ---
DEBUG 1.0 (query_generator):
	Claim: 2 Hearts is a musical composition by Minogue.
	Entities: ['Minogue']
	LLM Output: ['2_hearts', 'kylie_minogue', 'musical_composition']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Minogue']
	Generated Potential Titles: ['2_hearts', 'kylie_minogue', 'musical_composition']
	Selected Titles for Retrieval: ['2_hearts', 'kylie_minogue', 'musical_composition']
-_--_--_--_--_--_--_--_--_--_-
###### M1: Retrieving Documents ######
DEBUG 1.2: Successfully retrieved intro from 'Two Hearts (K

Processing Claims:  10%|█         | 3/30 [02:36<24:34, 54.60s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: SUPPORTS
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 65.18 seconds.
------------------------------------------------


--- Processing Claim ID: 202314 ---
DEBUG 1.0 (query_generator):
	Claim: The New Jersey Turnpike has zero shoulders.
	Entities: ['New Jersey Turnpike']
	LLM Output: ['new_jersey_turnpike', 'highway_safety', 'road_design', 'traffic_engineering', 'roadway_geometry']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['New Jersey Turnpike']
	Generated Potential Titles: ['road_design', 'highway_safety', 'roadway_geometry', 'traffic_engineering', 'new_jersey_turnpike']
	Selected Titles for Retrieval: ['road_design', 'highway_safety', 'roadway_geometry', 'traffic_engineering', 'new_jer

Processing Claims:  13%|█▎        | 4/30 [03:22<22:15, 51.38s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: REFUTES
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 46.43 seconds.
------------------------------------------------


--- Processing Claim ID: 57085 ---
DEBUG 1.0 (query_generator):
	Claim: Legendary Entertainment is the owner of Wanda Cinemas.
	Entities: ['Wanda Cinemas']
	LLM Output: ['legendary_entertainment', 'wanda_cinemas', 'legendary_entertainment#ownership', 'wanda_group', 'film_industry']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Wanda Cinemas']
	Generated Potential Titles: ['wanda_group', 'film_industry', 'wanda_cinemas', 'legendary_entertainment', 'legendary_entertainment#ownership']
	Selected Titles for Retrieval: ['wanda_group', 'film_industry', 'wanda_cinemas', 'legendary

Processing Claims:  17%|█▋        | 5/30 [03:52<18:13, 43.75s/it]

DEBUG 2.3.3 (LLM Selection):
	LLM Raw Output:
NOT ENOUGH INFO
-_--_--_--_--_-
M2 Iter 3: LLM indicated 'NOT ENOUGH INFO' from the provided candidates.
M2 Iter 3: Stopping because LLM found no evidence after two tries.
M2: Finished iterations. No evidence selected.
-------------------------------------------------------------------

###### M3: Starting Classification ######
M3: No evidence text provided. Classifying as NOT ENOUGH INFO.
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 30.24 seconds.
------------------------------------------------


--- Processing Claim ID: 6032 ---
DEBUG 1.0 (query_generator):
	Claim: Aruba is the only ABC Island.
	Entities: ['ABC Island', 'Aruba']
	LLM Output: ['aruba', 'abc_islands', 'caribbean', 'islands_of_the_caribbean']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['ABC I

Processing Claims:  20%|██        | 6/30 [04:29<16:34, 41.43s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: REFUTES
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 36.92 seconds.
------------------------------------------------


--- Processing Claim ID: 176630 ---
DEBUG 1.0 (query_generator):
	Claim: Great white sharks do not prefer dolphins as prey.
	Entities: ['Great']
	LLM Output: ['great_white_shark', 'dolphin', 'predation', 'marine_ecosystem', 'shark_behavior', 'food_web', 'great_white_shark_behavior', 'shark_diet']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Great']
	Generated Potential Titles: ['dolphin', 'food_web', 'predation', 'shark_diet', 'shark_behavior', 'marine_ecosystem', 'great_white_shark', 'great_white_shark_behavior']
	Selected Titles for Retrieval: ['dolphin', 'food_web', 'pr

Processing Claims:  23%|██▎       | 7/30 [05:51<20:54, 54.53s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: NOT ENOUGH INFO
	Exit Status: NOT ENOUGH INFO
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 81.50 seconds.
------------------------------------------------


--- Processing Claim ID: 130048 ---
DEBUG 1.0 (query_generator):
	Claim: Burbank, California has always been completely void of industry.
	Entities: ['California', 'Burbank']
	LLM Output: ['burbank,_california', 'history_of_burbank,_california', 'economy_of_burbank,_california', 'list_of_industries_in_burbank,_california', 'burbank_studios', 'media_industry_in_burbank']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['California', 'Burbank']
	Generated Potential Titles: ['burbank_studios', 'burbank,_california', 'media_industry_in_burbank', 'economy_of_bur

Processing Claims:  27%|██▋       | 8/30 [06:24<17:28, 47.67s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: REFUTES
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 32.97 seconds.
------------------------------------------------


--- Processing Claim ID: 100046 ---
DEBUG 1.0 (query_generator):
	Claim: The Guthrie Theater's second building began operating in 1963.
	Entities: ['Guthrie']
	LLM Output: ['guthrie_theater', 'guthrie_theater#history', 'theater_building', '1963_in_architecture']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Guthrie']
	Generated Potential Titles: ['guthrie_theater', 'theater_building', '1963_in_architecture', 'guthrie_theater#history']
	Selected Titles for Retrieval: ['guthrie_theater', 'theater_building', '1963_in_architecture', 'guthrie_theater#history']
-_--_--_--_--_--_-

Processing Claims:  30%|███       | 9/30 [06:40<13:15, 37.86s/it]

DEBUG 2.3.3 (LLM Selection):
	LLM Raw Output:
NOT ENOUGH INFO
-_--_--_--_--_-
M2 Iter 3: LLM indicated 'NOT ENOUGH INFO' from the provided candidates.
M2 Iter 3: Stopping because LLM found no evidence after two tries.
M2: Finished iterations. No evidence selected.
-------------------------------------------------------------------

###### M3: Starting Classification ######
M3: No evidence text provided. Classifying as NOT ENOUGH INFO.
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 16.29 seconds.
------------------------------------------------


--- Processing Claim ID: 204575 ---
DEBUG 1.0 (query_generator):
	Claim: Commodore is ranked above a rear admiral.
	Entities: ['Commodore']
	LLM Output: ['commodore', 'rear_admiral', 'naval_ranks', 'military_rank', 'officer_rank']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	E

Processing Claims:  33%|███▎      | 10/30 [07:20<12:48, 38.41s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: REFUTES
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 39.65 seconds.
------------------------------------------------


--- Processing Claim ID: 107539 ---
DEBUG 1.0 (query_generator):
	Claim: Moscovium is a halogen.
	Entities: ['Moscovium']
	LLM Output: ['moscovium', 'halogen', 'periodic_table', 'chemical_element', 'list_of_chemical_elements']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Moscovium']
	Generated Potential Titles: ['halogen', 'moscovium', 'periodic_table', 'chemical_element', 'list_of_chemical_elements']
	Selected Titles for Retrieval: ['halogen', 'moscovium', 'periodic_table', 'chemical_element', 'list_of_chemical_elements']
-_--_--_--_--_--_--_--_--_--_-
###### M1: Retrievi

Processing Claims:  37%|███▋      | 11/30 [07:53<11:37, 36.72s/it]

DEBUG 2.3.3 (LLM Selection):
	LLM Raw Output:
NOT ENOUGH INFO
-_--_--_--_--_-
M2 Iter 3: LLM indicated 'NOT ENOUGH INFO' from the provided candidates.
M2 Iter 3: Stopping because LLM found no evidence after two tries.
M2: Finished iterations. No evidence selected.
-------------------------------------------------------------------

###### M3: Starting Classification ######
M3: No evidence text provided. Classifying as NOT ENOUGH INFO.
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 32.88 seconds.
------------------------------------------------


--- Processing Claim ID: 164883 ---
DEBUG 1.0 (query_generator):
	Claim: Hezbollah received a type of training from Iran.
	Entities: ['Hezbollah', 'Iran']
	LLM Output: ['hezbollah', 'iran', 'hezbollah–iran_relations', 'military_training', 'iranian_revolutionary_guard']
-_--_--_

Processing Claims:  40%|████      | 12/30 [09:03<14:05, 46.98s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: SUPPORTS
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 70.44 seconds.
------------------------------------------------


--- Processing Claim ID: 54298 ---
DEBUG 1.0 (query_generator):
	Claim: In states still employing the electric chair to execute people, the prisoner is allowed the choice of lethal injection as an alternative method.
	Entities: []
	LLM Output: ['electric_chair', 'lethal_injection', 'capital_punishment_in_the_United_States', 'death_penalty', 'methods_of_execution', 'prisoner_rights']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: []
	Generated Potential Titles: ['death_penalty', 'electric_chair', 'prisoner_rights', 'lethal_injection', 'methods_of_execution', 'capital_punishmen

Processing Claims:  43%|████▎     | 13/30 [10:11<15:06, 53.35s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: SUPPORTS
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 68.01 seconds.
------------------------------------------------


--- Processing Claim ID: 222749 ---
DEBUG 1.0 (query_generator):
	Claim: Practical Magic is an American romantic drama film.
	Entities: ['Practical', 'American', 'Magic']
	LLM Output: ['practical_magic', 'romantic_drama_film', 'list_of_american_romantic_drama_films', 'film_genre', 'film_adaptation']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Practical', 'American', 'Magic']
	Generated Potential Titles: ['film_genre', 'film_adaptation', 'practical_magic', 'romantic_drama_film', 'list_of_american_romantic_drama_films']
	Selected Titles for Retrieval: ['film_genre', 'film_

Processing Claims:  47%|████▋     | 14/30 [10:51<13:08, 49.29s/it]

DEBUG 2.3.3 (LLM Selection):
	LLM Raw Output:
NOT ENOUGH INFO
-_--_--_--_--_-
M2 Iter 3: LLM indicated 'NOT ENOUGH INFO' from the provided candidates.
M2 Iter 3: Stopping because LLM found no evidence after two tries.
M2: Finished iterations. No evidence selected.
-------------------------------------------------------------------

###### M3: Starting Classification ######
M3: No evidence text provided. Classifying as NOT ENOUGH INFO.
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 39.90 seconds.
------------------------------------------------


--- Processing Claim ID: 219675 ---
DEBUG 1.0 (query_generator):
	Claim: Corsica belongs to Italy.
	Entities: ['Corsica', 'Italy']
	LLM Output: ['corsica', 'italy', 'history_of_corsica', 'political_status_of_corsica', 'territorial_claims_of_italy']
-_--_--_--_--_-
DEBUG 1.1 (qu

Processing Claims:  50%|█████     | 15/30 [11:31<11:39, 46.65s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: REFUTES
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 40.55 seconds.
------------------------------------------------


--- Processing Claim ID: 134850 ---
DEBUG 1.0 (query_generator):
	Claim: Ice-T refused to ever make hip-hop music.
	Entities: []
	LLM Output: ['ice-t', 'hip_hop_music', 'ice-t_discography', 'ice-t_career', 'hip_hop', 'rap_music']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: []
	Generated Potential Titles: ['ice-t', 'hip_hop', 'rap_music', 'ice-t_career', 'hip_hop_music', 'ice-t_discography']
	Selected Titles for Retrieval: ['ice-t', 'hip_hop', 'rap_music', 'ice-t_career', 'hip_hop_music', 'ice-t_discography']
-_--_--_--_--_--_--_--_--_--_-
###### M1: Retrieving Documents ###

Processing Claims:  53%|█████▎    | 16/30 [12:20<11:01, 47.28s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: REFUTES
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 48.73 seconds.
------------------------------------------------


--- Processing Claim ID: 124578 ---
DEBUG 1.0 (query_generator):
	Claim: The Gettysburg Address is a speech.
	Entities: ['Gettysburg Address']
	LLM Output: ['gettysburg_address', 'speech', 'abraham_lincoln', 'american_civil_war', 'historical_speeches']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Gettysburg Address']
	Generated Potential Titles: ['speech', 'abraham_lincoln', 'american_civil_war', 'gettysburg_address', 'historical_speeches']
	Selected Titles for Retrieval: ['speech', 'abraham_lincoln', 'american_civil_war', 'gettysburg_address', 'historical_speeches']
-_--_

Processing Claims:  57%|█████▋    | 17/30 [14:26<15:20, 70.84s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: SUPPORTS
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 125.62 seconds.
------------------------------------------------


--- Processing Claim ID: 134126 ---
DEBUG 1.0 (query_generator):
	Claim: Jason Bourne removed Riz Ahmed from the movie's cast.
	Entities: ['Riz Ahmed', 'Bourne', 'Jason']
	LLM Output: ['jason_bourne', 'riz_ahmed', 'movie_cast', 'the_bourne_identity', 'the_bourne_suppremacy', 'the_bourne_ultimatum', 'the_bourne_legacy', 'jason_bourne_(film)']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Riz Ahmed', 'Bourne', 'Jason']
	Generated Potential Titles: ['riz_ahmed', 'movie_cast', 'jason_bourne', 'the_bourne_legacy', 'the_bourne_identity', 'jason_bourne_(film)', 'the_bourne_ultim

Processing Claims:  60%|██████    | 18/30 [14:52<11:29, 57.50s/it]

DEBUG 2.3.3 (LLM Selection):
	LLM Raw Output:
NOT ENOUGH INFO
-_--_--_--_--_-
M2 Iter 3: LLM indicated 'NOT ENOUGH INFO' from the provided candidates.
M2 Iter 3: Stopping because LLM found no evidence after two tries.
M2: Finished iterations. No evidence selected.
-------------------------------------------------------------------

###### M3: Starting Classification ######
M3: No evidence text provided. Classifying as NOT ENOUGH INFO.
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 26.44 seconds.
------------------------------------------------


--- Processing Claim ID: 125577 ---
DEBUG 1.0 (query_generator):
	Claim: Ron Dennis is unemployed.
	Entities: ['Dennis', 'Ron']
	LLM Output: ['ron_dennis', 'mclaren', 'formula_one', 'automobile_racing']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Dennis', 'Ron']
	

Processing Claims:  63%|██████▎   | 19/30 [15:32<09:32, 52.03s/it]

DEBUG 2.3.3 (LLM Selection):
	LLM Raw Output:
NOT ENOUGH INFO
-_--_--_--_--_-
M2 Iter 3: LLM indicated 'NOT ENOUGH INFO' from the provided candidates.
M2 Iter 3: Stopping because LLM found no evidence after two tries.
M2: Finished iterations. No evidence selected.
-------------------------------------------------------------------

###### M3: Starting Classification ######
M3: No evidence text provided. Classifying as NOT ENOUGH INFO.
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 39.29 seconds.
------------------------------------------------


--- Processing Claim ID: 132244 ---
DEBUG 1.0 (query_generator):
	Claim: Wolfgang Amadeus Mozart showed he was a child protege.
	Entities: ['Amadeus Mozart', 'Wolfgang']
	LLM Output: ['wolfgang_amadeus_mozart', 'child_prodigy', 'history_of_classical_music', 'classical_music', '

Processing Claims:  67%|██████▋   | 20/30 [16:34<09:11, 55.13s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: SUPPORTS
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 62.37 seconds.
------------------------------------------------


--- Processing Claim ID: 225798 ---
DEBUG 1.0 (query_generator):
	Claim: Chinatown's writer is a convicted statutory rapist.
	Entities: ['Chinatown']
	LLM Output: ['chinatown', 'robert_towne', 'statutory_rape', 'conviction', 'criminal_law', 'sexual_offenses']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Chinatown']
	Generated Potential Titles: ['chinatown', 'conviction', 'criminal_law', 'robert_towne', 'statutory_rape', 'sexual_offenses']
	Selected Titles for Retrieval: ['chinatown', 'conviction', 'criminal_law', 'robert_towne', 'statutory_rape', 'sexual_offenses']
-_--_-

Processing Claims:  70%|███████   | 21/30 [17:18<07:46, 51.80s/it]

DEBUG 2.3.3 (LLM Selection):
	LLM Raw Output:
NOT ENOUGH INFO
-_--_--_--_--_-
M2 Iter 3: LLM indicated 'NOT ENOUGH INFO' from the provided candidates.
M2 Iter 3: Stopping because LLM found no evidence after two tries.
M2: Finished iterations. No evidence selected.
-------------------------------------------------------------------

###### M3: Starting Classification ######
M3: No evidence text provided. Classifying as NOT ENOUGH INFO.
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 44.03 seconds.
------------------------------------------------


--- Processing Claim ID: 46810 ---
DEBUG 1.0 (query_generator):
	Claim: One Dance has always been banned in the Netherlands.
	Entities: ['Netherlands', 'Dance']
	LLM Output: ['one_dance', 'netherlands', 'music_ban', 'drake', 'censorship_in_music', 'dance_music']
-_--_--_--_--_-

Processing Claims:  73%|███████▎  | 22/30 [17:59<06:27, 48.47s/it]

DEBUG 2.3.3 (LLM Selection):
	LLM Raw Output:
NOT ENOUGH INFO
-_--_--_--_--_-
M2 Iter 3: LLM indicated 'NOT ENOUGH INFO' from the provided candidates.
M2 Iter 3: Stopping because LLM found no evidence after two tries.
M2: Finished iterations. No evidence selected.
-------------------------------------------------------------------

###### M3: Starting Classification ######
M3: No evidence text provided. Classifying as NOT ENOUGH INFO.
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 40.69 seconds.
------------------------------------------------


--- Processing Claim ID: 85923 ---
DEBUG 1.0 (query_generator):
	Claim: Adidas designs items.
	Entities: ['Adidas']
	LLM Output: ['adidas', 'adidas_products', 'sportswear', 'fashion_design', 'athletic_shoes']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Adidas']
	G

Processing Claims:  77%|███████▋  | 23/30 [18:26<04:54, 42.12s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: SUPPORTS
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 27.32 seconds.
------------------------------------------------


--- Processing Claim ID: 181252 ---
DEBUG 1.0 (query_generator):
	Claim: Sean Gunn is an American poet.
	Entities: ['American', 'Gunn', 'Sean']
	LLM Output: ['sean_gunn', 'american_poets', 'list_of_american_poets']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['American', 'Gunn', 'Sean']
	Generated Potential Titles: ['sean_gunn', 'american_poets', 'list_of_american_poets']
	Selected Titles for Retrieval: ['sean_gunn', 'american_poets', 'list_of_american_poets']
-_--_--_--_--_--_--_--_--_--_-
###### M1: Retrieving Documents ######
DEBUG 1.2: Successfully retrieved intro from

Processing Claims:  80%|████████  | 24/30 [18:49<03:37, 36.29s/it]

DEBUG 2.3.3 (LLM Selection):
	LLM Raw Output:
NOT ENOUGH INFO
-_--_--_--_--_-
M2 Iter 3: LLM indicated 'NOT ENOUGH INFO' from the provided candidates.
M2 Iter 3: Stopping because LLM found no evidence after two tries.
M2: Finished iterations. No evidence selected.
-------------------------------------------------------------------

###### M3: Starting Classification ######
M3: No evidence text provided. Classifying as NOT ENOUGH INFO.
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 22.71 seconds.
------------------------------------------------


--- Processing Claim ID: 1933 ---
DEBUG 1.0 (query_generator):
	Claim: Dissociative identity disorder is known as multiple personality disorder.
	Entities: []
	LLM Output: ['dissociative_identity_disorder', 'multiple_personality_disorder', 'mental_disorders', 'psychological_con

Processing Claims:  83%|████████▎ | 25/30 [19:31<03:10, 38.05s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: SUPPORTS
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 42.16 seconds.
------------------------------------------------


--- Processing Claim ID: 88894 ---
DEBUG 1.0 (query_generator):
	Claim: Zoe Saldana is a Leo.
	Entities: ['Saldana', 'Zoe']
	LLM Output: ['zoe_saldana', 'leo_(astrology)', 'zodiac_signs', 'astrology']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Saldana', 'Zoe']
	Generated Potential Titles: ['astrology', 'zoe_saldana', 'zodiac_signs', 'leo_(astrology)']
	Selected Titles for Retrieval: ['astrology', 'zoe_saldana', 'zodiac_signs', 'leo_(astrology)']
-_--_--_--_--_--_--_--_--_--_-
###### M1: Retrieving Documents ######
DEBUG 1.2: Successfully retrieved intro from 'Astrology'

Processing Claims:  87%|████████▋ | 26/30 [20:06<02:28, 37.15s/it]

DEBUG 2.3.3 (LLM Selection):
	LLM Raw Output:
NOT ENOUGH INFO
-_--_--_--_--_-
M2 Iter 3: LLM indicated 'NOT ENOUGH INFO' from the provided candidates.
M2 Iter 3: Stopping because LLM found no evidence after two tries.
M2: Finished iterations. No evidence selected.
-------------------------------------------------------------------

###### M3: Starting Classification ######
M3: No evidence text provided. Classifying as NOT ENOUGH INFO.
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 35.03 seconds.
------------------------------------------------


--- Processing Claim ID: 17915 ---
DEBUG 1.0 (query_generator):
	Claim: Fred Seibert has produced comedy programs.
	Entities: ['Seibert', 'Fred']
	LLM Output: ['fred_seibert', 'comedy_television', 'animation', 'television_producer', 'list_of_animated_television_series']
-_--_--

Processing Claims:  90%|█████████ | 27/30 [20:39<01:47, 35.89s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: NOT ENOUGH INFO
	Exit Status: NOT ENOUGH INFO
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 32.94 seconds.
------------------------------------------------


--- Processing Claim ID: 58396 ---
DEBUG 1.0 (query_generator):
	Claim: Konidela Production Company was established.
	Entities: ['Production Company', 'Konidela']
	LLM Output: ['konidela_production_company', 'chiranjeevi', 'telugu_cinema', 'film_production_companies_in_india']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Production Company', 'Konidela']
	Generated Potential Titles: ['chiranjeevi', 'telugu_cinema', 'konidela_production_company', 'film_production_companies_in_india']
	Selected Titles for Retrieval: ['chiranjeevi', 'telugu_cinema', 'konid

Processing Claims:  93%|█████████▎| 28/30 [21:08<01:07, 33.81s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: SUPPORTS
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 28.96 seconds.
------------------------------------------------


--- Processing Claim ID: 150751 ---
DEBUG 1.0 (query_generator):
	Claim: Paul von Hindenburg was a man.
	Entities: ['Paul']
	LLM Output: ['paul_von_hindenburg', 'man', 'german_field_marshals', 'politicians_of_the_weimar_republic', 'presidents_of_germany']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Paul']
	Generated Potential Titles: ['man', 'paul_von_hindenburg', 'presidents_of_germany', 'german_field_marshals', 'politicians_of_the_weimar_republic']
	Selected Titles for Retrieval: ['man', 'paul_von_hindenburg', 'presidents_of_germany', 'german_field_marshals', 'politici

Processing Claims:  97%|█████████▋| 29/30 [21:43<00:34, 34.36s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: SUPPORTS
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 35.66 seconds.
------------------------------------------------


--- Processing Claim ID: 179831 ---
DEBUG 1.0 (query_generator):
	Claim: Vic Mensa was born June 12, 1993.
	Entities: ['Mensa', 'Vic']
	LLM Output: ['vic_mensa', 'list_of_rappers_from_chicago', '2010s_rap_music', 'hip_hop_music']
-_--_--_--_--_-
DEBUG 1.1 (query_generator):
	Entities: ['Mensa', 'Vic']
	Generated Potential Titles: ['vic_mensa', 'hip_hop_music', '2010s_rap_music', 'list_of_rappers_from_chicago']
	Selected Titles for Retrieval: ['vic_mensa', 'hip_hop_music', '2010s_rap_music', 'list_of_rappers_from_chicago']
-_--_--_--_--_--_--_--_--_--_-
###### M1: Retrieving Documents 

Processing Claims: 100%|██████████| 30/30 [22:11<00:00, 44.38s/it]

DEBUG 3.2/3.3 (module_3_classification):
	LLM Classification Result: REFUTES
	Exit Status: OK
-_--_--_--_--_--_--_--_--_--_-
Adding gold label/evidence to prediction.
##########################################################################

##########################################################################

Time to process claim: 27.34 seconds.
------------------------------------------------


Finished processing 30 claims.

--- FEVER Scoring Results ---
Strict Score (Exact Match): 50.00%
Label Accuracy: 76.67%
Evidence Precision: 47.50%
Evidence Recall: 35.00%
Evidence F1 Score: 40.30%
Number of test cases scored: 30

--- Sample Predictions (Output Format) ---
{
  "id": 113501,
  "predicted_label": "REFUTES",
  "predicted_evidence": [
    [
      "Grease_-LRB-film-RRB-",
      3
    ],
    [
      "Grease_-LRB-film-RRB-",
      5
    ]
  ],
  "label": "NOT ENOUGH INFO",
  "evidence": []
}
{
  "id": 163803,
  "predicted_label": "NOT ENOUGH INFO",
  "predicted_evidence": [],






Report saved to: /content/drive/MyDrive/SUNY_Poly_DSA598/datasets/FEVER/paper_test_results/tuned_GPT-sBERTn1024-sentEx-RephsHist-T4_run_report_test_n30_250504_2348.csv
