# This is a *quickstart* notebook for the benchmark **GreekBarBench** and the accompanying judge meta-evaluation benchmark **GBB-JME**.

This notebook is part of the research presented in the paper:

***GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations***

The code is available in <https://github.com/nlpaueb/greek-bar-bench>.

The dataset is available in <https://huggingface.co/datasets/AUEB-NLP/greek-bar-bench>.

Please cite this paper using the following BibTeX entry:

```bibtex
@misc{chlapanis2025greekbarbenchchallengingbenchmarkfreetext,
      title={GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations},
      author={Odysseas S. Chlapanis and Dimitrios Galanis and Nikolaos Aletras and Ion Androutsopoulos},
      year={2025},
      eprint={2505.17267},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.17267},
}
```

MIT License

Copyright (c) 2025 NLP AUEB

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

<small>
The implementation for the SPA calculation is taken as is from:
</small>

https://github.com/google-research/mt-metrics-eval/blob/main/mt_metrics_eval/pce.py

<small>
Based on the method described in the following paper:

*Brian Thompson, Nitika Mathur, Daniel Deutsch, and Huda Khayrallah. 2024. Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy. In Proceedings of the Ninth Conference on Machine Translation, pages 1222–1234, Miami, Florida, USA. Association for Computational Linguistics.*
</small>

## Upgrade datasets

We need to upgrade the `datasets` library for compatibility and to avoid path issues in Google Colab.

**Important:** After this cell runs, you **must** restart the runtime (`Runtime > Restart runtime`) before proceeding. Then, you can run the rest of the notebook (`Runtime > Run all`).

In [None]:
! pip install -q --upgrade datasets

## Initialize parameters

In [33]:
MODEL = "gemini-2.0-flash-001"
JUDGE = "gpt-4.1-2025-04-14"
JUDGE_TYPE = "span"

OPENAI_API_KEY = "YOUR_API_KEY"
GOOGLE_API_KEY = "YOUR_API_KEY"

In [3]:
import os


os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY

## Load datasets

In [4]:
from datasets import load_dataset

dataset_name = "AUEB-NLP/greek-bar-bench"
split_name = "test"
name = "greekbarbench"
name_jme = "gbb-jme"
dataset = load_dataset(dataset_name, name=name, split=split_name)

dataset_jme = load_dataset(dataset_name, name=name_jme, split=split_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/12.8k [00:00<?, ?B/s]

greekbarbench.csv:   0%|          | 0.00/42.3M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/284 [00:00<?, ? examples/s]

gbb_jme.csv:   0%|          | 0.00/550k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/305 [00:00<?, ? examples/s]

In [58]:
dataset

Dataset({
    features: ['facts', 'question', 'answer', 'chapters', 'spans', 'area', 'date', 'articles', 'number', 'index'],
    num_rows: 284
})

In [59]:
dataset_jme

Dataset({
    features: ['number', 'model', 'response', 'facts', 'articles', 'analysis', 'avg', 'area', 'date', 'reasoning', 'index'],
    num_rows: 305
})

## Setup OpenAI and Google generation

In [5]:
import os
from openai import OpenAI


def gpt_prompt(system, prompt):
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt},
    ]
    return messages


def call_gpt(model, system, prompt):
    client = OpenAI()
    # Generate response using client.responses
    response = client.responses.create(
        model=model,
        input=gpt_prompt(system, prompt),
        text={"format": {"type": "text"}},
        tools=[],
        temperature=1,
        max_output_tokens=8192,
        top_p=1,
    )
    # Return the assistant's response text
    return response.output[0].content[0].text

In [6]:
# # Example usage
# result = call_gpt("gpt-4.1-mini", "", "What is the Greek Constitution?")
# print(result)

The Greek Constitution is the fundamental law of Greece that establishes the structure, functions, and principles of the Greek state. It defines the organization of government, the separation of powers, the rights and duties of citizens, and the framework within which laws are made and enforced. The Constitution serves as the supreme legal authority in Greece, meaning that all laws and government actions must comply with its provisions.

The current Greek Constitution, often referred to as the Constitution of 1975 (with several subsequent amendments), was adopted after the fall of the military junta in 1974 and the restoration of democracy. It enshrines democratic principles, the rule of law, human rights, and the role of Greece as a parliamentary republic. The Constitution also establishes important institutions such as the Parliament, the Presidency, the Judiciary, and defines Greece's relationship with the European Union.

In summary, the Greek Constitution is the foundational legal

In [7]:
from google import genai
from google.genai import types



def call_gemini(model, system, prompt):
    client = genai.Client()
    response = client.models.generate_content(
        model=model,
        config=types.GenerateContentConfig(
            system_instruction=system),
        contents=[prompt]
    )

    return response.text

In [8]:
# result = call_gemini("gemini-2.0-flash", "", "What is the Greek Constitution?")
# print(result)

The Greek Constitution is the supreme law of Greece. It outlines the fundamental principles of the state, the rights and freedoms of its citizens, and the structure and powers of the government. Here's a breakdown of key aspects:

*   **Foundation of the Greek State:** It establishes Greece as a parliamentary republic.
*   **Fundamental Rights:** It guarantees a wide range of human rights and fundamental freedoms, including freedom of speech, assembly, religion, and the press, as well as protection from discrimination and arbitrary arrest.
*   **Separation of Powers:** It divides governmental power among three branches:
    *   **Legislative:** The Hellenic Parliament, elected by the people, is responsible for making laws.
    *   **Executive:** The President of the Republic, elected by Parliament, is the head of state. The Prime Minister, appointed by the President, is the head of government and leads the cabinet.
    *   **Judicial:** The courts are responsible for interpreting and a

# Evaluate model on **GreekBarBench**

## Prompts for candidates

In [9]:
candidate_system_prompt = '''You are a legal assistant who answers questions in Greek, focusing on the legal system and the laws of Greece. You analyze your reasoning and respond with well-supported answers and correct references. You only respond in txt format and with only one short paragraph without headings.
'''

In [10]:
candidate_user_prompt = '''You are given the numbered facts of a legal case, the current relevant legislation of Greece, and a question regarding this case. After carefully reading the entire text, you are to provide a comprehensive answer to the question, analyzing your reasoning. You should answer with references to the relevant legislation using the appropriate abbreviations for the laws (for example, you can say: "according to article X CC" to refer to article "X" of the Civil Code), where necessary. Additionally, you must provide references to the facts of the case (for example, you can say: "according to statement Y of the case data"), where necessary. Answer strictly within the extent of one short paragraph.
'''

## Get user prompt for sample

In [11]:
def get_user_prompt(instance, user_prompt):
    """
    Construct a prompt from a dataset sample by combining the user prompt with the facts, question, and legal code chapters of the instance.
    """
    prompt = (
        f"{user_prompt}\n"
        f"Case facts:\n{instance['facts']}\n"
        f"Question:\n{instance['question']}\n\n"
        f"{instance['chapters']}"
    )
    return prompt

## Generation

We are going to select the subset with the `A_2023` date for demonstration purposes. We are going to evaluate the `gemini-2.0-flash` model.

In [12]:
subset = dataset.filter(
    lambda example: example['date'] == 'A_2023'
)
print(len(subset))

Filter:   0%|          | 0/284 [00:00<?, ? examples/s]

30


In [13]:
from pprint import pprint


instance = subset[0]
prompt = get_user_prompt(instance, candidate_user_prompt)
# print(prompt)
print(f"Question:\n{instance['question']}")
answer = call_gemini(MODEL, candidate_system_prompt, prompt)
pprint(answer)

Question:
Πώς αξιολογείτε από άποψη νομικής βασιμότητας την αγωγή καθ' όλα τα ως άνω αιτήματα και βάσεις της;
('Η αγωγή του Α αξιολογείται ως νομικά βάσιμη ως προς την απόδοση του '
 'γεωργικού μηχανήματος, δεδομένου ότι, σύμφωνα με τα αναφερόμενα στην αγωγή '
 '(1-4), στοιχειοθετείται σύμβαση χρησιδανείου κατά το άρθρο 810 ΑΚ, η οποία '
 'έληξε, αλλά η εναγόμενη αρνείται την απόδοση του πράγματος. Ωστόσο, όσον '
 'αφορά την αξίωση αποζημίωσης για την μη επιστροφή του μηχανήματος (5-6), η '
 'νομική της βασιμότητα εξαρτάται από την απόδειξη της υπαιτιότητας της '
 'εναγόμενης για την καθυστέρηση και της ζημίας του ενάγοντος, σύμφωνα με τα '
 'άρθρα 330, 335, 340 και 343 ΑΚ, καθώς και την ύπαρξη αιτιώδους συνάφειας '
 'μεταξύ της καθυστέρησης και της ζημίας. Η ένσταση έλλειψης ενεργητικής '
 'νομιμοποίησης (7) θα κριθεί από το δικαστήριο βάσει των αποδείξεων περί του '
 'ποιος ήταν ο συμβαλλόμενος στη σύμβαση χρησιδανείου, ενώ ο ισχυρισμός της '
 'εναγόμενης περί οικονομικής αδυναμίας (

In [69]:
# from tqdm import tqdm
# import json

# answers = []

# for instance in tqdm(subset, desc="Processing dataset"):
#   prompt = get_user_prompt(instance, candidate_user_prompt)
#   answer = call_gemini(MODEL, candidate_system_prompt, prompt)
#   answers.append(answer)

# output_filename = f"answers_{MODEL}.json"

# try:
#     with open(output_filename, 'w', encoding='utf-8') as f:
#         json.dump(answers, f, indent=4, ensure_ascii=False)
#     print(f"Successfully saved answers (JSON) to {output_filename}")
# except IOError as e:
#     print(f"Error saving JSON file {output_filename}: {e}")
# except Exception as e:
#     print(f"An error occurred during JSON serialization: {e}")

In [22]:
import json


input_filename = f"answers_{MODEL}.json"

loaded_answers_from_json = []

try:
    with open(input_filename, 'r', encoding='utf-8') as f:
        loaded_answers_from_json = json.load(f)

    if isinstance(loaded_answers_from_json, list) and all(isinstance(item, str) for item in loaded_answers_from_json):
        print(f"Successfully loaded answers (JSON) from {input_filename}")
    else:
        print(f"Warning: Data loaded from {input_filename} is not a list of strings as expected.")

except FileNotFoundError:
    print(f"Error: The file {input_filename} was not found.")
except json.JSONDecodeError:
    print(f"Error: Could not decode JSON from {input_filename}. The file might be corrupted or not valid JSON.")
except IOError as e:
    print(f"Error reading file {input_filename}: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

answers = loaded_answers_from_json
pprint(loaded_answers_from_json[:2])

Successfully loaded answers (JSON) from answers_gemini-2.0-flash-001.json
['Η αγωγή του Α φαίνεται να έχει νομική βάση, ειδικά όσον αφορά το αίτημα για '
 'την απόδοση του γεωργικού μηχανήματος, καθώς σύμφωνα με το άρθρο 810 AK, ο '
 'χρησάμενος (εν προκειμένω η εταιρεία X) έχει υποχρέωση να αποδώσει το πράγμα '
 'μετά τη λήξη της σύμβασης χρησιδανείου, η οποία έληξε στις 12-9-2019 σύμφωνα '
 'με το στοιχείο 2 της υπόθεσης. Όσον αφορά την αποζημίωση για τη μη επιστροφή '
 'του μηχανήματος, ο ενάγων θα πρέπει να αποδείξει τη ζημία του και την '
 'αιτιώδη συνάφεια με την καθυστέρηση της απόδοσης, σύμφωνα με τα άρθρα 298, '
 '330 και 343 AK, κάτι που φαίνεται να επιχειρεί με την αναφορά στη μίσθωση '
 'άλλου μηχανήματος και την απώλεια κερδών, σύμφωνα με το στοιχείο 5. Η '
 'ένσταση της εναγόμενης περί έλλειψης ενεργητικής νομιμοποίησης (στοιχείο 7) '
 'είναι κρίσιμη και θα κριθεί από το δικαστήριο βάσει των αποδείξεων. Η '
 'πρόσθετη παρέμβαση του Δ υπέρ του Α βασίζεται στην από 30.8.202

## Evaluation

### Evaluation prompts

#### Prompt for Simple-Judge

In [15]:
simple_judge_system_prompt = '''You are a legal exam evaluator. You will be given the following:
    1. The facts of a case
    2. The relevant legislation
    3. A question
    4. An ideal reference answer
    5. An answer for evaluation
You will need to evaluate the answer with three scores and an explanation for each. Each score consists of an integer from 1 to 10, with 10 being excellent. The reference answer is considered excellent (10 in all).
The Facts Score concerns the facts of the case. If the ideal reference answer mentions certain specific facts from the case, while the answer for evaluation does not mention them, points should be deducted. Similarly, if the answer for evaluation mentions facts that are not useful for the answer, points should also be deducted.
The Legislation Score concerns the answer's references to the relevant articles of laws. It is necessary to refer to specific articles of laws. If no such reference is made or if reference is made to wrong articles, points should be deducted from the Legislation Score. Also, points should be deducted if the interpretation of the law is wrong.
The Analysis Score concerns a general evaluation of whether the answer for evaluation covered the original question, with correct and valid legal argumentation. Points already evaluated in the above criteria are not assessed here. At this point, the final conclusion of the answer is also evaluated. If the answer for evaluation reaches a wrong conclusion or if it does not mention a critical argument, points should be deducted.
Use plain txt text, without markdown.
Your answer should follow the template shown below, where X, Y, Z are integers (1-10):
Explanation of Facts Score: <your explanation for the score...>
Facts Score: X
Explanation of Legislation Score: <...>
Legislation Score: Y
Explanation of Analysis Score: <...>
Analysis Score: Z
'''

#### Prompt for Span-Judge

In [16]:
span_judge_system_prompt= '''You are a legal exam evaluator. You will be given the following:
    1. The facts of a case
    2. The relevant legislation
    3. A question
    4. An ideal reference answer
    5. An answer for evaluation
    6. The evaluation spans (json file)
The evaluation spans are verbatim excerpts from the text of the ideal reference answer with tags that refer to each of the three scores (facts, articles, analysis). This means that for the evaluation of a score, emphasis should be placed on whether the information from the corresponding span of the ideal reference answer exists in the answer for evaluation, and thus the appropriate score should be given. For example, for the Facts Score (facts), the spans should be present in the answer for evaluation. If spans are not present, it means that very important facts (or laws or analysis) that absolutely must be mentioned are missing. However, points can still be deducted if the answer for evaluation adds facts (or laws or analysis) that are incorrect. There are also important spans, which indicate which parts of the answer are more important for the evaluation.
You will need to evaluate the answer with three scores and an explanation for each. Each score consists of an integer from 1 to 10, with 10 being excellent. The reference answer is considered excellent (10 in all).
The Facts Score concerns the facts of the case. If the ideal reference answer mentions certain specific facts from the case, while the answer for evaluation does not mention them, points should be deducted. Similarly, if the answer for evaluation mentions facts that are not useful for the answer, points should also be deducted.
The Cited Articles Score concerns the answer's references to the relevant articles of laws. It is necessary to refer to specific articles of laws. If no such reference is made or if reference is made to wrong articles, points should be deducted from the Cited Articles Score. Also, points should be deducted if the interpretation of the law is wrong.
The Analysis Score concerns a general evaluation of whether the answer for evaluation covered the original question, with correct and valid legal argumentation. Points already evaluated in the above criteria are not assessed here. At this point, the final conclusion of the answer is also evaluated. If the answer for evaluation reaches a wrong conclusion or if it does not mention a critical argument, points should be deducted.
Use plain txt text, without markdown.
Your answer should follow the template shown below, where X, Y, Z are integers (1-10):
Explanation of Facts Score: <your explanation for the score...>
Facts Score: X
Explanation of Cited Articles Score: <...>
Cited Articles Score: Y
Explanation of Analysis Score: <...>
Analysis Score: Z
'''

### Evaluation utils

In [18]:
import re

def parse_judgement(text):
    """
    Parses a text containing judgement explanations and scores, extracting
    the explanations and the corresponding integer scores using regex.
    Ensures that extracted scores are integers between 1 and 10.

    Args:
        text: The input string containing the judgement breakdown in a specific format.

    Returns:
        A tuple containing six elements in the order:
        facts_explanation (str), articles_explanation (str), analysis_explanation (str),
        facts (int or None), articles (int or None), analysis (int or None).
        Returns empty string for missing explanations and None for missing scores
        or scores outside the 1-10 range.
    """
    # Use re.DOTALL flag so '.' matches newline characters within the explanation text
    flags = re.DOTALL

    # Define patterns for each section.
    # Each pattern looks for the Explanation title, captures the text (non-greedily .*?),
    # then looks for the corresponding Score title, and captures the digits (\d+).
    # \s* handles any whitespace (including newlines) around the titles and scores.
    # Added \b around the score title to prevent partial matches like "Total Score:"
    facts_pattern = r"Explanation of Facts Score:\s*(.*?)\s*\bFacts Score:\s*(\d+)\b"
    articles_pattern = r"Explanation of Cited Articles Score:\s*(.*?)\s*\bCited Articles Score:\s*(\d+)\b"
    analysis_pattern = r"Explanation of Analysis Score:\s*(.*?)\s*\bAnalysis Score:\s*(\d+)\b"

    # Search for each pattern in the text
    facts_match = re.search(facts_pattern, text, flags)
    articles_match = re.search(articles_pattern, text, flags)
    analysis_match = re.search(analysis_pattern, text, flags)

    # --- Extract and validate Facts score and explanation ---
    facts_explanation = ""
    facts = None
    if facts_match:
        facts_explanation = facts_match.group(1).strip()
        score_str = facts_match.group(2)
        try:
            score_int = int(score_str)
            # Validate the score is between 1 and 10
            if 1 <= score_int <= 10:
                facts = score_int
            else:
                # Score found but outside the valid range
                facts = None
        except ValueError:
            # Should not happen with \d+ pattern, but handle defensively
            facts = None

    # --- Extract and validate Cited Articles score and explanation ---
    articles_explanation = ""
    articles = None
    if articles_match:
        articles_explanation = articles_match.group(1).strip()
        score_str = articles_match.group(2)
        try:
            score_int = int(score_str)
            # Validate the score is between 1 and 10
            if 1 <= score_int <= 10:
                articles = score_int
            else:
                 # Score found but outside the valid range
                articles = None
        except ValueError:
             # Should not happen with \d+ pattern, but handle defensively
            articles = None

    # --- Extract and validate Analysis score and explanation ---
    analysis_explanation = ""
    analysis = None
    if analysis_match:
        analysis_explanation = analysis_match.group(1).strip()
        score_str = analysis_match.group(2)
        try:
            score_int = int(score_str)
            # Validate the score is between 1 and 10
            if 1 <= score_int <= 10:
                analysis = score_int
            else:
                 # Score found but outside the valid range
                analysis = None
        except ValueError:
             # Should not happen with \d+ pattern, but handle defensively
            analysis = None

    # Return the results in the specified order
    return (facts_explanation, articles_explanation, analysis_explanation,
            facts, articles, analysis)


In [19]:
import json
from typing import Optional, Tuple


def evaluate_judge_instance(
    instance: dict,
    candidate_answer: str,
    model_id: str,
    judge_model: str,
    judge_type: str,       # Pass JUDGE_TYPE here ("spans" or "simple")
    system_prompt: str,    # This will be the English system prompt text
) -> dict | None:
    """
    Evaluates a single candidate answer against a single dataset instance
    using an LLM judge based on the provided English system prompt (1-10 scores),
    with a retry mechanism if initial parsing fails for key aspects.

    Args:
        instance: A dictionary containing data for one evaluation item
                  (e.g., {'facts': '...', 'number': '...', 'question': '...',
                   'answer': '...', 'chapters': '...', 'spans': {...}}).
        candidate_answer: A single string, the candidate answer to evaluate.
        model_id: A string identifier for the model that generated this answer.
        judge_model: The name of the LLM model to use as the judge.
        system_prompt: The system instructions for the judge model (should be the English text expecting 1-10 scores).
        judge_type: Indicates the type of judge ("spans" or "simple").

    Returns:
        A dictionary with the evaluation result for this answer (with scores 1-10),
        or None if critical data is missing (like spans when required).
        The dictionary will contain None values for explanations/scores if parsing
        failed after all attempts or the LLM call failed.
    """

    # --- Extract necessary data from the dataset instance ---
    number = instance.get('number', 'N/A') # Use .get for safety
    date = instance.get('date', 'N/A') # Use .get for safety
    try:
        facts = instance.get('facts', '')
        question = instance.get('question', '')
        gold_answer = instance.get('answer', '')
        chapters = instance.get('chapters', '')

        spans = instance.get('spans', None)

        # Check if spans are required and missing
        if judge_type == "spans" and spans is None:
             print(f"Error: 'spans' key not found in instance for Question {number} and JUDGE_TYPE is 'spans'. Skipping evaluation for model_id: {model_id}.")
             return None # Return None if spans are critically missing when needed

    except Exception as e:
        print(f"An unexpected error occurred while processing dataset instance (Question {number}): {e}. Cannot proceed with evaluation for model_id: {model_id}.")
        return None # Return None for critical data extraction errors

    # --- Initialize variables for parsed results (will be updated in the loop) ---
    facts_explanation, articles_explanation, analysis_explanation = None, None, None
    fact_score, articles_score, analysis_score = None, None, None # Will store 1-10 ints or None

    # --- Construct the user prompt for the judge ---
    # Ensure field names match what the *English* system prompt expects
    spans_str = ""
    if judge_type == "spans" and spans is not None:
         # Assuming the system prompt refers to "Evaluation Spans"
         # Use json.dumps to format the spans dictionary
         try:
             spans_str = "\nEvaluation Spans:\n" + json.dumps(spans, indent=2, ensure_ascii=False)
         except Exception as e:
             print(f"Warning: Could not JSON dump spans for Question {number}, Model: {model_id}: {e}. Proceeding without spans in prompt.")
             spans_str = "" # Ensure it's empty if dumping fails

    # If judge_type is 'simple', we don't add the spans JSON.

    prompt = (f"Facts:\n{facts}\n\nRelevant Legislation:\n{chapters}\n\n"
              f"Question:\n{question}\n\nReference Answer:\n{gold_answer}\n\n"
              f"Answer for Evaluation:\n{candidate_answer}"
              f"{spans_str}") # Append spans_str only if non-empty and judge_type is spans

    # --- Retry Logic ---
    max_retries = 1 # 0 means no retries (1 attempt), 1 means 1 retry (2 attempts total)
    response = None # Initialize response outside the loop

    for attempt in range(max_retries + 1):
        # print(f"--- Judging Question {number}, Model: {model_id} (Attempt {attempt + 1}/{max_retries + 1}) ---")

        # --- Call the Judge LLM Model ---
        # call_gpt should handle potential internal API errors and return None if it cannot get a response
        response = call_gpt(judge_model, system_prompt, prompt)

        # --- Parse the Judge's Response ---
        if response:
            (temp_facts_explanation, temp_articles_explanation, temp_analysis_explanation,
             temp_fact_score, temp_articles_score, temp_analysis_score) = parse_judgement(response)

            # Check if parsing was successful for all *required* elements (explanation and score for each aspect)
            # The condition is: NONE of the 6 primary results are None.
            all_parsed_ok = all(x is not None for x in [
                temp_facts_explanation, temp_articles_explanation, temp_analysis_explanation,
                temp_fact_score, temp_articles_score, temp_analysis_score
            ])

            if all_parsed_ok:
                # print(f"Parsing successful on attempt {attempt + 1}. All fields found.")
                # Assign the successfully parsed temporary results to the main variables
                facts_explanation, articles_explanation, analysis_explanation = temp_facts_explanation, temp_articles_explanation, temp_analysis_explanation
                fact_score, articles_score, analysis_score = temp_fact_score, temp_articles_score, temp_analysis_score
                break # Exit the retry loop as we have a complete result

            else:
                # Identify which fields are None for logging
                missing_fields = []
                if temp_facts_explanation is None: missing_fields.append('facts_explanation')
                if temp_articles_explanation is None: missing_fields.append('articles_explanation')
                if temp_analysis_explanation is None: missing_fields.append('analysis_explanation')
                if temp_fact_score is None: missing_fields.append('fact_score')
                if temp_articles_score is None: missing_fields.append('articles_score')
                if temp_analysis_score is None: missing_fields.append('analysis_score')

                print(f"Parsing failed on attempt {attempt + 1} ({', '.join(missing_fields)} are None).")
                if attempt < max_retries:
                    print("Retrying judge call...")
                else:
                    print("Max retries reached. Using the result from the last attempt (may contain None values).")
                    # Even if parsing failed, we keep the results from this last attempt,
                    # which might be partially valid, rather than sticking to the initial None.
                    # This is important if the second attempt parsed *some* fields correctly.
                    # Let's update the main variables with the results from the last attempt,
                    # even if partial.
                    facts_explanation, articles_explanation, analysis_explanation = temp_facts_explanation, temp_articles_explanation, temp_analysis_explanation
                    fact_score, articles_score, analysis_score = temp_fact_score, temp_articles_score, temp_analysis_score
                    break # Exit the loop after the last attempt

        else:
            # LLM call failed (response is None)
            print(f"LLM call failed on attempt {attempt + 1} (response is None).")
            if attempt < max_retries:
                print("Retrying judge call...")
            else:
                print("Max retries reached. LLM call failed.")
                # Variables remain None as initialized before the loop
                break # Exit the loop after the last attempt

    # --- After the loop: Calculate Average Score ---
    # Average is (fact_score + articles_score + analysis_score) / 3.0 if all are valid numbers (1-10)
    avg = None # Default to None
    # Collect scores that are not None and are numeric (integers are fine)
    # Use the *final* values of the score variables (fact_score, etc.) which might be None
    valid_scores = [score for score in [fact_score, articles_score, analysis_score] if isinstance(score, (int, float)) and score is not None]

    if len(valid_scores) == 3: # Only calculate average if all three scores are valid numbers
         try:
             avg = sum(valid_scores) / 3.0
         except Exception as e: # Catch any potential errors during summation/division
              print(f"Error calculating average for Question {number}, Model: {model_id}. Scores: {valid_scores}. Error: {e}. Keeping avg as None.")
              avg = None
    elif response is None and all(x is None for x in [facts_explanation, articles_explanation, analysis_explanation, fact_score, articles_score, analysis_score]):
         # This check is more specific after the retry loop
         print(f"Warning: LLM call failed after retries or resulted in completely unparseable response for Question {number}, Model: {model_id}. All scores/explanations are None. Result will be stored with None scores.")
         # We still proceed to store the judgement dict with None scores.
    elif any(x is None for x in [facts_explanation, articles_explanation, analysis_explanation, fact_score, articles_score, analysis_score]):
        # Some fields are None after retries due to partial parsing
        score_values = {'facts_exp': facts_explanation, 'articles_exp': articles_explanation, 'analysis_exp': analysis_explanation,
                        'facts_score': fact_score, 'articles_score': articles_score, 'analysis_score': analysis_score}
        none_fields = {k: v for k, v in score_values.items() if v is None}
        print(f"Warning: Some fields are None after parsing retries for Question {number}, Model: {model_id}. Missing fields: {list(none_fields.keys())}. Average cannot be calculated and is None.")
        # We still proceed to store the judgement dict with partial results and None average.
    else:
         # This case should only happen if all parsed fields are NOT None, but somehow valid_scores isn't 3 items.
         # This indicates a potential logic error in valid_scores or the all_parsed_ok check,
         # but print a warning just in case.
         print(f"Warning: Unexpected state - all parsed fields are not None, but could not calculate average for Question {number}, Model: {model_id} ({valid_scores}). Keeping avg as None.")


    # --- Create the result dictionary for this single answer ---
    # This dictionary is created regardless of whether scores were successfully parsed or averaged.
    # Missing/invalid data will be represented by None values.
    judgement = {
        'number': number,
        'date': date,
        'model': model_id,
        'judge_model': judge_model,
        'judge_type': judge_type,
        'facts_explanation': facts_explanation,
        'articles_explanation': articles_explanation,
        'analysis_explanation': analysis_explanation,
        'facts': fact_score,     # Facts Score (1-10 or None)
        'articles': articles_score, # Cited Articles Score (1-10 or None)
        'analysis': analysis_score, # Analysis Score (1-10 or None)
        'avg': avg               # Average (calculated on 1-10 scale or None)
    }

    return judgement


### Visualize results

In [20]:
import statistics

def print_results_table(judgements_list):
    """
    Calculates and prints the average scores for facts, articles, analysis,
    and avg from a list of judgement dictionaries, ignoring None values.
    Prints the results in a formatted table along with model and judge model.

    Args:
        judgements_list (list): A list of dictionaries, each representing a judgement.
                                 Expected structure: {'model': ..., 'judge_model': ...,
                                 'facts': float or None, 'articles': float or None,
                                 'analysis': float or None, 'avg': float or None, ...}
    """
    if not judgements_list:
        print("The list of judgements is empty.")
        return

    # Assume model and judge_model are the same for all judgements,
    # so we can just take them from the first one.
    first_judgement = judgements_list[0]
    model = first_judgement.get('model', 'N/A')
    judge_model = first_judgement.get('judge_model', 'N/A')

    # Collect all non-None scores for each category
    facts_scores = [j['facts'] for j in judgements_list if j.get('facts') is not None]
    articles_scores = [j['articles'] for j in judgements_list if j.get('articles') is not None]
    analysis_scores = [j['analysis'] for j in judgements_list if j.get('analysis') is not None]
    avg_scores = [j['avg'] for j in judgements_list if j.get('avg') is not None] # Avg of the calculated avgs

    # Calculate averages, handling cases where there are no non-None scores
    avg_facts = statistics.mean(facts_scores) if facts_scores else 'N/A'
    avg_articles = statistics.mean(articles_scores) if articles_scores else 'N/A'
    avg_analysis = statistics.mean(analysis_scores) if analysis_scores else 'N/A'
    avg_overall_avg = statistics.mean(avg_scores) if avg_scores else 'N/A'

    # Format the output as a table
    headers = ["Model", "Judge Model", "Facts", "Cited Articles", "Analysis", "Avg"]
    # Define column widths for alignment
    col_widths = [25, 25, 12, 14, 14, 18]

    # Print header
    header_line = " | ".join(f"{{:<{w}}}" for w in col_widths).format(*headers)
    print(header_line)
    print("-|-".join("-" * w for w in col_widths)) # Separator line

    # Format the average values for printing (e.g., 2 decimal places)
    # Handle 'N/A' string case
    formatted_avg_facts = f"{avg_facts:.2f}" if isinstance(avg_facts, (int, float)) else avg_facts
    formatted_avg_articles = f"{avg_articles:.2f}" if isinstance(avg_articles, (int, float)) else avg_articles
    formatted_avg_analysis = f"{avg_analysis:.2f}" if isinstance(avg_analysis, (int, float)) else avg_analysis
    formatted_avg_overall_avg = f"{avg_overall_avg:.2f}" if isinstance(avg_overall_avg, (int, float)) else avg_overall_avg


    # Print data row
    data_row = [
        model,
        judge_model,
        formatted_avg_facts,
        formatted_avg_articles,
        formatted_avg_analysis,
        formatted_avg_overall_avg
    ]
    print(" | ".join(f"{{:<{w}}}" for w in col_widths).format(*data_row))


### Evaluate with LLM-Judge

In [77]:
# from tqdm import tqdm
# import json

# judgements = []

# for data, answer in tqdm(zip(subset, answers), desc="Judging answers", total=len(subset)):
#     # Assuming judge_instance function handles prompt creation and model call internally
#     judgement = evaluate_judge_instance(data, answer, MODEL, JUDGE, JUDGE_TYPE, span_judge_system_prompt) # Replace with actual judge function call
#     judgements.append(judgement)

# output_filename = f"judgements_{JUDGE}.json"

# try:
#     with open(output_filename, 'w', encoding='utf-8') as f:
#         json.dump(judgements, f, indent=4, ensure_ascii=False)
#     print(f"\nSuccessfully saved judgements (JSON) to {output_filename}")
# except IOError as e:
#     print(f"Error saving JSON file {output_filename}: {e}")
# except Exception as e:
#     print(f"An error occurred during JSON serialization: {e}")

In [23]:
import json


input_filename = f"judgements_{JUDGE}.json"

loaded_judgements_from_json = []

try:
    with open(input_filename, 'r', encoding='utf-8') as f:
        loaded_judgements_from_json = json.load(f)

    if isinstance(loaded_judgements_from_json, list) and all(isinstance(item, dict) for item in loaded_judgements_from_json):
        print(f"Successfully loaded judgements (JSON) from {input_filename}")
    else:
        print(f"Warning: Data loaded from {input_filename} is not a list of dicts as expected.")

except FileNotFoundError:
    print(f"Error: The file {input_filename} was not found.")
except json.JSONDecodeError:
    print(f"Error: Could not decode JSON from {input_filename}. The file might be corrupted or not valid JSON.")
except IOError as e:
    print(f"Error reading file {input_filename}: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

pprint(loaded_judgements_from_json[0])

Successfully loaded judgements (JSON) from judgements_gpt-4.1-2025-04-14.json
{'analysis': 6,
 'analysis_explanation': 'Η ανάλυση ακολουθεί σωστή νομική πορεία ως προς την '
                         'κύρια βάση της απαίτησης περί απόδοσης του '
                         'πράγματος, και αναγνωρίζει την αναγκαιότητα ύπαρξης '
                         'σύμβασης και τις ασυμφωνίες επί της ενεργητικής '
                         'νομιμοποίησης. Ωστόσο, δεν διακρίνει καθαρά την '
                         'επικουρικότητα της αγωγής από αδικαιολόγητο '
                         'πλουτισμό (ότι δεν μπορεί να στοιχειοθετηθεί αν '
                         'υπάρχει ισχυρή ενοχή και αγωγή ex contractu), κάτι '
                         'που είναι βασικό κρίσιμο σημείο του reference. Ως '
                         'προς την αποζημίωση, γίνεται αναφορά στην ανάγκη '
                         'αιτιώδους συνάφειας και απόδειξης ζημίας, αλλά '
                         'λείπει η κριτική στην επάρκεια των αγωγι

In [24]:
print_results_table(loaded_judgements_from_json)

Model                     | Judge Model               | Facts        | Cited Articles | Analysis       | Avg               
--------------------------|---------------------------|--------------|----------------|----------------|-------------------
gemini-2.0-flash-001      | gpt-4.1-2025-04-14        | 8.30         | 7.37           | 7.33           | 7.67              


# Meta-evaluate judge-model on **GBB-JME**

## Generate judgements with an LLM-judge on GBB-JME

### subset

In [25]:
subset_jme = dataset_jme.filter(
    lambda example: example['date'] == 'A_2023' and example['area'] == 'criminal'
)
print(len(subset_jme))
pprint(subset_jme[0])

Filter:   0%|          | 0/305 [00:00<?, ? examples/s]

25
{'analysis': 9.0,
 'area': 'criminal',
 'articles': 9.0,
 'avg': 9.0,
 'date': 'A_2023',
 'facts': 9.0,
 'index': 160,
 'model': 'gemini-2.0-flash-001',
 'number': 1,
 'reasoning': None,
 'response': 'Η ποινική αξιολόγηση της συμπεριφοράς των Α και Β έχει ως εξής: '
             'Ο Β, εισερχόμενος στην οικία της Π με σκοπό την αφαίρεση '
             "χρημάτων και τιμαλφών (προτάσεις 1, 3), διαπράττει κλοπή κατ' "
             "άρθρο 372 ΠΚ. Η πράξη του μετατρέπεται σε ληστεία κατ' άρθρο 380 "
             'παρ. 1 ΠΚ, καθώς ασκεί σωματική βία εναντίον της Π για να '
             'αφαιρέσει τα χρήματα και τα τιμαλφή (προτάσεις 5, 7). Επιπλέον, '
             'η πράξη τελέστηκε με καλυμμένα ή αλλοιωμένα χαρακτηριστικά, '
             'γεγονός που συνιστά επιβαρυντική περίσταση. Ο Α, ενεργώντας με '
             'δόλο, προκαλεί τον Β να διαπράξει τη ληστεία (προτάσεις 1, 2), '
             'συνεπώς είναι ηθικός αυτουργός κατά το άρθρο 46 παρ. 1α ΠΚ, και '
             'τιμωρείται με τη

### GBB-JME judgements

In [84]:
# from tqdm import tqdm
# import json

# jme_judgements = []

# for jme_data in tqdm(subset_jme, desc=f"Judging GBB-JME answers using the {JUDGE} judge"):
#     index = jme_data.get('index', 'N/A')
#     model = jme_data.get('model', 'N/A')
#     answer = jme_data.get('response', 'N/A')
#     data = dataset[index]
#     judgement = evaluate_judge_instance(data, answer, model, JUDGE, JUDGE_TYPE, span_judge_system_prompt)
#     jme_judgements.append(judgement)

# output_filename = f"jme_judgements_{JUDGE}.json"

# try:
#     with open(output_filename, 'w', encoding='utf-8') as f:
#         json.dump(jme_judgements, f, indent=4, ensure_ascii=False)
#     print(f"\nSuccessfully saved jme_judgements (JSON) to {output_filename}")
# except IOError as e:
#     print(f"Error saving JSON file {output_filename}: {e}")
# except Exception as e:
#     print(f"An error occurred during JSON serialization: {e}")

In [28]:
import json


input_filename = f"jme_judgements_{JUDGE}.json"

jme_judgements = []

try:
    with open(input_filename, 'r', encoding='utf-8') as f:
        jme_judgements = json.load(f)

    if isinstance(jme_judgements, list) and all(isinstance(item, dict) for item in jme_judgements):
        print(f"Successfully loaded judgements (JSON) from {input_filename}")
    else:
        print(f"Warning: Data loaded from {input_filename} is not a list of dicts as expected.")

except FileNotFoundError:
    print(f"Error: The file {input_filename} was not found.")
except json.JSONDecodeError:
    print(f"Error: Could not decode JSON from {input_filename}. The file might be corrupted or not valid JSON.")
except IOError as e:
    print(f"Error reading file {input_filename}: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

pprint(jme_judgements[0])

Successfully loaded judgements (JSON) from jme_judgements_gpt-4.1-2025-04-14.json
{'analysis': 7,
 'analysis_explanation': 'Η απάντηση τελειώνει σωστά με το συμπέρασμα ότι ο Β '
                         'είναι αυτουργός ληστείας και ο Α ηθικός αυτουργός '
                         'στη ληστεία, συνδέοντας επαρκώς τα πραγματικά '
                         'περιστατικά με τη νομική κατάταξη και τους σχετικούς '
                         'ρόλους (αυτουργός-ηθικός αυτουργός). Η ανάλυση είναι '
                         'ωστόσο λιγότερο λεπτομερής από το ιδανικό ως προς τη '
                         'δομή και δεν επεξηγεί πλήρως, αφενός, το ζήτημα της '
                         'βαρύτερης πράξης (αν ο Α θα ήταν ηθικός αυτουργός '
                         'μόνο για κλοπή ή και για ληστεία – δεν εξηγεί το '
                         'σκεπτικό του ενδεχόμενου δόλου), αφετέρου, δεν '
                         'αναφέρει το κλασικό όριο της "ειδικότητα προτροπής" '
                         '(ότι αρκεί 

## Meta-evaluation of the LLM-judge on GBB-JME

#### SPA

In [29]:
# Copyright 2024 Brian Thompson. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


import numpy as np


def compute_pairwise_p_values(seg_scores, num_permutations=1000, seed: int = 4):
    """
    Author: Brian Thompson
    Date: June 2024

    Suppose we have test set consisting of L=5 segments, and two systems, systemsA and systemB,
    for which we have segment-level scores scoresA and scoresB:
       scoresA = [0.8, 0.9, 0.7, 1.0, 0.6]
       scoresB = [0.2, 0.3, 0.1, 0.4, 0.0]

    Typically we would average segment-level scores to get system level scores, but for convenience later on
    we will define system scores to be the sum of segment-level scores. This gives us a delta system-level score of:
        test_delta = sum(scoresA) - sum(scoresB) = 4.0 - 1.0 = 3.0

    To run a paired permutation test, we first generate a new set of scores scores0,
    where each score0[i] is randomly selected from either scoresA[i] or scoresB[i].
    Let's define a random boolean mask:
       m = [1, 0, 0, 1, 1]

    and used it to select scores0:
       scores0 = m.*scoresA + (1-m).*scoresB = [0.8, 0.3, 0.1, 1.0, 0.6]   # selected from [A, B, B, A, A], respectively

    Likewise, we compose scores1 using all the scores which were not selected for scores0:
       scores1 = (1-m).*scoresA + m.*scoresB = [0.2, 0.9, 0.7, 0.4, 0.0]   # selected from [B, A, A, B, B], respectively

    To get the delta system-level score for our two mock systems, we need to compute:
       null_delta = sum(scores0) - sum(scores1)
                  = sum(m.*scoresA + (1-m).*scoresB) - sum((1-m).*scoresA + m.*scoresB)
                  = sum((2m-1).*scoresA) - sum((2m-1).*scoresB
                  = (2m-1) * scoresA.T - (2m-1) * scoresB.T
                  = [ 1, -1, -1,  1,  1] * [[0.8],  -  [ 1, -1, -1,  1,  1] * [[0.2],  =  0.8 - 0.2  =  0.6
                                            [0.9],                             [0.3],
                                            [0.7],                             [0.1],
                                            [1.0],                             [0.4],
                                            [0.6]]                             [0.0]]

    To compute many different permutations, we replace the vector m with a matrix of size (num_permutations, L):
       null_delta = [[ 1,  1, -1, -1, -1], * [[0.8],  -  [[ 1,  1, -1, -1, -1], * [[0.2],  = [[-0.6],  - [[ 0.0],   =  [[-0.6]
                     [ 1, -1,  1, -1,  1],    [0.9],      [ 1, -1,  1, -1,  1],    [0.3],     [ 0.2],     [-0.4],       [ 0.6],
                     [ 1, -1,  1,  1, -1],    [0.7],      [ 1, -1,  1,  1, -1],    [0.1],     [ 1.0],     [ 0.4],       [ 0.6],
                     [-1,  1, -1, -1,  1],    [1.0],      [-1,  1, -1, -1,  1],    [0.4],     [-1.0],     [-0.4],       [-0.6],
                     [ 1,  1,  1, -1,  1],    [0.6]]      [ 1,  1,  1, -1,  1],    [0.0]]     [ 2.0],     [ 0.2],       [ 1.8],
                     [-1,  1, -1,  1, -1],                [-1,  1, -1,  1, -1],               [-0.2],     [ 0.4],       [-0.6],
                     [ 1,  1,  1,  1,  1],                [ 1,  1,  1,  1,  1],               [ 4.0],     [ 1.0],       [ 3.0],
                     [ 1, -1,  1, -1,  1],                [ 1, -1,  1, -1,  1],               [ 0.2],     [-0.4],       [ 0.6],
                     [ 1,  1, -1, -1,  1],                [ 1,  1, -1, -1,  1],               [ 0.6],     [ 0.0],       [ 0.6],
                     [-1,  1, -1, -1, -1]]                [ 1, -1, -1,  1, -1]]               [-2.2]]     [-0.4]]       [-1.8]]

    To test the significance that system A is better than system B, we compute:
       null_delta >= test_delta  =  [[-0.6]  >= 3   =   [[False],
                                     [ 0.6],             [False],
                                     [ 0.6],             [False],
                                     [-0.6],             [False],
                                     [ 1.8],             [False],
                                     [-0.6],             [False],
                                     [ 3.0],             [True ],
                                     [ 0.6],             [False],
                                     [ 0.6],             [False],
                                     [-1.8]]             [False]]

    The p value is the fraction of the time that null_delta >= test_delta, in this case 1/10 = 0.1

    The above discussion was for a single system pair, but we actually need to compute p values for each pairwise
    within a set systems systemA, systemB, ... systemN. In practice, the computation bottleneck is generating
    the random boolean vector m, so we generate m once and use it for all pairs of systems.

    Reusing m also allows us to avoid most of the N^2 computations by pre-computing (2m-1) * scoresA.T,
    (2m-1) * scoresB.T, ..., (2m-1) * scoresN.T.

    Test speed:
    python -m timeit -s "import numpy as np; from pairwise_paired_permutation_test import compute_pairwise_p_values; x=np.random.random(size=(14,1300))" "compute_pairwise_p_values(x, num_permutations=1000)"

    :param seg_scores: segment-level scores, with shape (num_systems, num_segments)
    :param num_permutations: Number of permutations for permutation test
    :param seed: The random seed
    :return: np.array of size (num_systems, num_systems), where the upper triangle has been populated
       with p-values for the hypothesis that system[i] > system[j]
    """
    num_systems, num_segments = seg_scores.shape

    rng = np.random.default_rng(seed)
    # initialize in range [0, 1)
    two_m_minus_one = rng.random(size=(num_permutations, num_segments), dtype=np.float32)
    # quantize to 0 or 1, in place
    np.rint(two_m_minus_one, out=two_m_minus_one, casting='same_kind')
    # scale and shift to get -1.0 and +1.0, in place
    two_m_minus_one *= 2.0
    two_m_minus_one -= 1.0

    seg_scores = seg_scores.astype(np.float32)  # shape: (num_systems, num_segments)
    sys_scores = np.sum(seg_scores, axis=1)  # shape: (num_systems, )

    partial = np.matmul(two_m_minus_one, seg_scores.T)  # shape: (num_permutations, num_systems)

    # initialize p value matrix to NaN
    p_vals = np.empty((num_systems, num_systems,)) * np.nan
    # populate upper triangle
    for ii in range(num_systems):
        for jj in range(ii + 1, num_systems):
            null_delta = partial[:, ii] - partial[:, jj]  # shape: (num_permutations, )
            test_delta = sys_scores[ii] - sys_scores[jj]  # float
            p_vals[ii, jj] = np.sum(null_delta >= test_delta) / num_permutations

    return p_vals


def compute_one_minus_pce(human_pairwise_p_vals, metric_pairwise_p_vals):
    """
    Author: Brian Thompson
    Date: June 2024

    Pairwise Confidence Error (PCE) is the absolute difference between
      the p value for the conclusion that one system is better than another given human judgements and
      the p value for the conclusion for the same system comparison given metric judgements,
      averaged over all system pairings for a set of systems.

    We return 1-PCE to be comparable with pairwise accuracy [i.e. range from 0 to 1, higher is better]

    :param human_pairwise_p_vals: np.array of shape (num_systems, num_systems),
        where the upper triangle has been populated with p-values for system[i] > system[j]
        computed from human judgements
    :param metric_pairwise_p_vals: np.array of shape (num_systems, num_systems),
        where the opper triangle has been populated with p-values for system[i] > system[j]
        computed from metric scores
    :return: 1-PCE
    """
    num_systems = human_pairwise_p_vals.shape[0]
    upper_tri_idxs = np.triu_indices(num_systems, 1)
    return 1.0 - np.mean(np.abs(human_pairwise_p_vals - metric_pairwise_p_vals)[upper_tri_idxs])

In [30]:
# coding=utf-8
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PairwiseConfidenceError or Soft Pairwise Accuracy (SPA)
"""


from typing import Callable

import numpy as np
import numpy.typing



ArrayLike = numpy.typing.ArrayLike


def PairwiseConfidenceError(
    gold_scores: list[ArrayLike],
    metric_scores: list[ArrayLike],
    num_sys: int,
    num_permutations: int = 1000,
) -> tuple[float, ...]:
  """Calculates pairwise confidence error (PCE)."""

  # Convert the gold and metric scores into N x M matrices where N is the
  # number of systems and M is the number of segments.
  gold = _Reshape(gold_scores, num_sys, 'sys')
  metric = _Reshape(metric_scores, num_sys, 'sys')

  gold_pvalues = compute_pairwise_p_values(
      gold, num_permutations=num_permutations
  )
  metric_pvalues = compute_pairwise_p_values(
      metric, num_permutations=num_permutations
  )
  return (compute_one_minus_pce(gold_pvalues, metric_pvalues),)


def _Reshape(vector, num_sys, average_by):
  """Reshape a packed vector into a matrix for row averaging."""
  if average_by == 'none':
    return np.asarray(vector).reshape(1, -1)
  elif average_by == 'sys':
    return np.asarray(vector).reshape(num_sys, -1)
  elif average_by == 'item':
    return np.asarray(vector).reshape(num_sys, -1).transpose()
  else:
    raise ValueError(f'Unknown averaging option: {average_by}')

In [31]:
import pandas as pd


def gather_scores_for_pce(systems, judgements):
    """
    Gathers scores from a list of judgements for given systems and structures them
    for PCE calculation. Returns four lists:
    1. total_flat: [s1q1f, s1q1r, s1q1a, s2q1f, ...] (length N_s * N_q * 3)
    2. facts_pce:  [s1q1f, s2q1f, ..., snq1f, s1q2f, ...] (length N_s * N_q)
    3. articles_pce:  [s1q1r, s2q1r, ..., snq1r, s1q2r, ...] (length N_s * N_q)
    4. analysis_pce: [s1q1a, s2q1a, ..., snq1a, s1q2a, ...] (length N_s * N_q)

    Where sX is system X, qY is question Y, f/r/a are facts/articles/analysis scores,
    N_s is number of systems, N_q is number of questions.
    Scores are ordered by question (ascending 'number'), then by system (as provided in 'systems' list)
    within the metric-specific lists (facts_pce, articles_pce, analysis_pce).
    Scores for total_flat are ordered by question, then by system, then by metric (facts, articles, analysis).
    Assumes each judgement dict contains 'number', 'model', 'facts', 'articles', 'analysis'.
    Handles missing judgements for a system/question by using None.
    """
    total_flat = []
    facts_pce = []
    articles_pce = []
    analysis_pce = []

    if not judgements: # Check if the input list is empty
        print("Warning: Input judgements list is empty.")
        return [], [], [], []

    # Structure data by question number and then by system name
    # { q_id: { sys_name: { 'facts': score, 'articles': score, 'analysis': score } } }
    data_by_question = {}
    all_q_ids_set = set() # Use a set to collect unique question IDs efficiently

    for judgement in judgements:
        q_id = judgement.get('number')
        sys = judgement.get('model')
        f_score = judgement.get('facts')
        r_score = judgement.get('articles')
        a_score = judgement.get('analysis')

        # Ensure we have necessary keys
        if q_id is None or sys is None:
             # print(f"Warning: Skipping judgement with missing 'number' or 'model': {judgement}")
             continue # Skip entries without essential info

        all_q_ids_set.add(q_id)

        if q_id not in data_by_question:
            data_by_question[q_id] = {}
        if sys not in data_by_question[q_id]:
             data_by_question[q_id][sys] = {}

        # Store scores, converting to float and handling None or potential non-numeric types
        try:
            data_by_question[q_id][sys]['facts'] = float(f_score) if f_score is not None else None
        except (ValueError, TypeError):
            print(f"Warning: Could not convert facts score '{f_score}' to float for number {q_id}, model {sys}. Using None.")
            data_by_question[q_id][sys]['facts'] = None

        try:
            data_by_question[q_id][sys]['articles'] = float(r_score) if r_score is not None else None
        except (ValueError, TypeError):
             print(f"Warning: Could not convert articles score '{r_score}' to float for number {q_id}, model {sys}. Using None.")
             data_by_question[q_id][sys]['articles'] = None

        try:
            data_by_question[q_id][sys]['analysis'] = float(a_score) if a_score is not None else None
        except (ValueError, TypeError):
             print(f"Warning: Could not convert analysis score '{a_score}' to float for number {q_id}, model {sys}. Using None.")
             data_by_question[q_id][sys]['analysis'] = None


    # Get sorted unique question IDs for consistent ordering
    all_q_ids = sorted(list(all_q_ids_set))

    # Now flatten the scores into the required list structures
    # Ensure ordering by q_id then sys according to the 'systems' list
    for q_id in all_q_ids:
        for sys in systems: # Iterate through the *expected* systems list
            # Retrieve scores for this q_id and sys. Use .get() with default
            # to handle cases where a system from the input 'systems' list
            # doesn't appear in the judgements for this specific question.
            scores = data_by_question.get(q_id, {}).get(sys, {'facts': None, 'articles': None, 'analysis': None})

            f_score = scores['facts']
            r_score = scores['articles']
            a_score = scores['analysis']

            # total_flat structure: [s1q1f, s1q1r, s1q1a, s2q1f, s2q1r, s2q1a, ...]
            # Append scores for the current system/question combination
            total_flat.extend([f_score, r_score, a_score])

            # metric_pce structure: [s1q1, s2q1, ..., snq1, s1q2, s2q2, ...]
            # Append the specific score for the current metric (ordered by Q then S)
            facts_pce.append(f_score)
            articles_pce.append(r_score)
            analysis_pce.append(a_score)

    return total_flat, facts_pce, articles_pce, analysis_pce


#### Meta-evaluation results

In [32]:
N = 1000
systems = ["gemini-2.0-flash-001", "o1-2024-12-17", "us.anthropic.claude-3-7-sonnet-20250219-v1:0", "gpt-4o-2024-11-20", "us.deepseek.r1-v1:0"]

judge_total_scores, judge_facts_scores, judge_articles_scores, judge_analysis_scores = gather_scores_for_pce(systems, jme_judgements)

annotators_total_scores, annotators_facts_scores, annotators_articles_scores, annotators_analysis_scores = gather_scores_for_pce(systems, subset_jme)

num_systems = len(systems) # The number of systems being compared pairwise

# Calculate SPA for each scoring dimension by comparing judge's scores to annotators' scores
total_spa = PairwiseConfidenceError(annotators_total_scores, judge_total_scores, num_systems, N)[0]
facts_spa = PairwiseConfidenceError(annotators_facts_scores, judge_facts_scores, num_systems, N)[0]
articles_spa = PairwiseConfidenceError(annotators_articles_scores, judge_articles_scores, num_systems, N)[0]
analysis_spa = PairwiseConfidenceError(annotators_analysis_scores, judge_analysis_scores, num_systems, N)[0]

horizontal_spa_results_dict = {
    "Judge": [JUDGE],
    "Facts": [facts_spa],
    "Articles": [articles_spa],
    "Analysis": [analysis_spa],
    "Total": [total_spa],

}

df_spa_results_horizontal = pd.DataFrame(
    horizontal_spa_results_dict
)

df_spa_results_horizontal

Unnamed: 0,Judge,Facts,Articles,Analysis,Total
0,gpt-4.1-2025-04-14,0.6266,0.6745,0.8344,0.7064
