# Acting Like Humans? Evaluating Large Language Models as Proxies in Linguistic Experiments


This notebook refers to the paper "Acting Like Humans? Evaluating Large Language Models as Proxies in Linguistic Experiments", which aims to replicate linguistic experimental pipelines with human participants using LLMs.


It is intended to be used for further research.



# Code structure: #

In the **first block** of the code, some libraries (such as OpenAI) are imported. They provide us with certain functions/applications that are already "ready-to-use," so we don’t have to code them explicitly.

Furthermore, the second cell is intended to be used as a test to ensure that everything has been imported correctly.

In [None]:
%%capture
!pip install openai==1.55.3 httpx==0.27.2 --force-reinstall --quiet

In [None]:
import os

os.environ["OPENAI_API_KEY"] =
from openai import OpenAI

client = OpenAI()

# prompt example, a test:
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user", "content": "Say this is a test, it works",
        }
    ],
    model="o4-mini",
)
print(chat_completion.choices[0].message.content)

In the **second block** of the code, the functions for prompt engineering are written, which we will use for our analysis. Example functions include zero-shot and few-shot prompting. However, you are welcome to try other prompting techniques that we have explored in the seminar or that you have found online. This helps in conducting new and improved experiments.

In [None]:
import openai
from openai import OpenAI

# Function to perform zero-shot prompting
def zero_shot_prompting(task, prompt):
    """
    Performs zero-shot prompting by sending a prompt to the language model
    without providing any prior examples.

    Args:
    task (str): The task to be performed.
    prompt (str): The text input sent to the language model.

    Returns:
    str: The response from the language model.
    """
    # Create the chat message and send it to the language model
    response = client.chat.completions.create(
        model="o4-mini",
        messages=[
            {"role": "system", "content": "You are an assistant"},  # CHANGE HERE IF DESIRED
            {"role": "user", "content": prompt}  # User message with the actual prompt
        ]
    )
    # Return the content of the model’s first response message
    return response.choices[0].message.content


# Function to perform few-shot prompting
def few_shot_prompting(task, examples, prompt):
    """
    Performs few-shot prompting by providing some examples
    before sending the actual prompt to the language model.

    Args:
    task (str): The task to be performed.
    examples (list of dict): A list of examples, each example being a dictionary with 'input' and 'output'.
    prompt (str): The text input sent to the language model.

    Returns:
    str: The response from the language model.
    """
    # Initialize the messages list with a system message
    messages = [{"role": "system", "content": "You are an assistant"}]  # CHANGE HERE IF DESIRED

    # Add the examples to the messages
    for example in examples:
        messages.append({"role": "user", "content": example['input']})
        messages.append({"role": "assistant", "content": example['output']})

    # Add the actual prompt to the messages list
    messages.append({"role": "user", "content": prompt})

    # Create the chat message and send it to the language model
    response = client.chat.completions.create(
        model="o4-mini",
        messages=messages
    )
    # Return the content of the model’s first response message
    return response.choices[0].message.content


# Replication pipeline step 1
In the **third block** of the code you find code so that you can read in your uploaded data. This will be helpful in presenting the data to the LLM during prompting.

Data handling code is provided for both replications. According to the data read in the cell, choose the corresponding passage and comment the other one out.

In [None]:
import pandas as pd
import time
import json
import gc
from psutil import virtual_memory
from datetime import datetime
import random

# Load data
daten = 'Barbosa&Cat_2019.csv'  # Adjust the filename as needed
df = pd.read_csv(daten)

# Extract specific columns and pack them into a dictionary
columns_to_extract = ['French Sentences', 'syntactic position', 'trace position', 'status', 'gold_ratings']
selected_data = df[columns_to_extract].astype(str)
stimuli_dict = selected_data.to_dict(orient="index")

# Create list of items
all_items = [entry['French Sentences'] for entry in stimuli_dict.values()]
random.shuffle(all_items)
print("The materials for the study are:")
print(all_items)

# Replication pipeline step 2
In the **fourth block** we can apply the prompt engineering functions of block 2. We will formulate our prompts in this cell. In this block, we can test different prompting strategies on one LLM-query and on a limited subset of the dataset.

In [None]:
# To test one "LLM-participant" with a subset of the items

# Preparing to store the model's responses
answers_zero_shot = {}

# === Zero-Shot Prompting ===
print("=== Zero-Shot ===")

# Iterating through the subset for Zero-Shot
for i, text in enumerate(all_items[:3]):
    zero_shot_prompt = f"Insert your instructions here: '{text}'"

    # Performing Zero-Shot Prompting
    zero_shot = zero_shot_prompting("Task description", zero_shot_prompt)

    # Storing the response in the dictionary with the index as the key
    answers_zero_shot[i] = {
        "Prompt": zero_shot_prompt,
        "Response": zero_shot
    }

    # Printing the input and the corresponding output
    print(f"Input: {text}")
    print(f"Output: {zero_shot}")
    print()

In the **fifth block** we save the results in an Excel file. This file stores the answers of ONE LLM-participant.

In [None]:
import pandas as pd
import openpyxl
from openpyxl.utils import get_column_letter
from datetime import datetime
import csv


def convert_csv_to_excel(csv_path, excel_path):
    """
    Konvertiert eine CSV-Datei in eine Excel-Datei (.xlsx).
    """
    df = pd.read_csv(csv_path)
    df.to_excel(excel_path, index=False)

def load_or_create_excel(file_path):
    """
    Lädt eine bestehende Excel-Datei oder erstellt eine neue, falls die Datei nicht existiert.
    """
    try:
        workbook = openpyxl.load_workbook(file_path)
        sheet = workbook.active
    except FileNotFoundError:
        workbook = openpyxl.Workbook()
        sheet = workbook.active
        print(f"Datei '{file_path}' nicht gefunden. Das ist ein Fehler, schauen Sie den Namen der Datei, die im 3. Block hochgeladen wird")
    return workbook, sheet

def add_columns_to_excel(sheet, new_columns):
    """
    Fügt neue Spalten zur Excel-Datei hinzu.
    """
    existing_columns = sheet.max_column
    for idx, col_name in enumerate(new_columns, start=existing_columns + 1):
        sheet[f"{get_column_letter(idx)}1"] = col_name

def add_data_to_excel(sheet, data_dict, start_row):
    """
    Fügt die Inhalte eines Dictionaries zur Excel-Tabelle hinzu.
    """
    for idx, (key, entry) in enumerate(data_dict.items(), start=start_row):
        sheet[f"A{idx}"] = key + 1
        sheet[f"B{idx}"] = entry['Antwort']
        sheet[f"C{idx}"] = entry['Prompt']
        sheet[f"D{idx}"] = entry['Antwort']  # Antwort-Original
        sheet[f"E{idx}"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print("Daten hinzugefügt.")

def save_excel(workbook, file_path):
    """
    Speichert das Workbook in die angegebene Datei.
    """
    workbook.save(file_path)
    print(f"Datei erfolgreich gespeichert unter: {file_path}")

def extend_csv(csv_input_path, csv_output_path, data_dict):
    """
    Erweitert eine CSV-Datei direkt um neue Spalten und speichert sie als neue CSV-Datei.
    """
    with open(csv_input_path, mode='r', newline='') as infile, open(csv_output_path, mode='w', newline='') as outfile:
        reader = csv.DictReader(infile)
        fieldnames = reader.fieldnames + ["Prompt", "Antwort_vom_Modell", "Datum"]
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)

        writer.writeheader()

        # Wir gehen nur so weit wie das data_dict Einträge hat
        for idx, row in enumerate(reader):
            if idx < len(data_dict):  # Nur verarbeiten, wenn ein entsprechender Eintrag im Dictionary existiert
                row["Prompt"] = data_dict[idx]["Prompt"]
                row["Antwort_vom_Modell"] = data_dict[idx]["Antwort"]
                row["Datum"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                writer.writerow(row)

    print(f"Erweiterte CSV-Datei gespeichert unter: {csv_output_path}")


# --- Parameter ---
csv_file_path = daten
excel_file_path = "ergebnisse_ein_prob.xlsx"
csv_output_path = "ergebnisse_ein_prob.csv" # diese Datei kann auch hier auf Google Colab mit Doppelklick geöffnet werden

# 1. Konvertiere CSV -> Excel und erweitere diese
convert_csv_to_excel(csv_file_path, excel_file_path)
workbook, sheet = load_or_create_excel(excel_file_path)
add_columns_to_excel(sheet, ["Prompt", "Antwort_vom_Modell", "Datum"])
add_data_to_excel(sheet, antworten_null_shot, start_row=sheet.max_row + 1) # HIER ÄNDERN: antworten_null_shot --> antworten_few_shot, etc.
save_excel(workbook, excel_file_path)

# 2. Alternativ: CSV direkt erweitern
extend_csv(csv_file_path, csv_output_path, antworten_null_shot) # HIER ÄNDERN: antworten_null_shot --> antworten_few_shot, etc.




# Replication pipeline step 3
In the **sixth block** we repeat the prompting that worked best for as many times as we have (or want to have) subjects and with all data in the corpus.

In [None]:
antworten_null_shot_all = {}
probanden_zahl= 60
reaction_time_per_participant=list()

import time  # Import the time module

for iteration in range(1, probanden_zahl + 1):  # Repeat the process
    print(f"### LLM-informant {iteration} ###")

    antworten_null_shot = {}  # Dictionary for null-shot responses in this iteration

    # === Null-Shot ===
    # Record the start time
    start_time_pro_part = time.time()

    for i, text in enumerate(all_items):
        start_time = time.time()
        zero_shot_prompt = f"Vous participez à une étude. Vous êtes de langue maternelle française. Tache: Vous allez maintenant voir deux phrases. La phrase a. constitue le contexte, la phrase b. est la phrase que vous devez juger, pas la phrase a. Vous devez exprimer votre jugement seulement sur la phrase b. en utilisant le schéma suivant: a. Je pourrais dire cela. b. Je pourrais dire cela, mais dans un autre contexte. c. Je ne pourrais pas dire cela, mais je connais des gens qui pourraient le faire. d. Personne ne dirait cela. e. Je ne sais pas. Vous donnez seulement la lettre, pas la justification. Penser: Vous lisez le contenu (phrase a), puis vous vous concentrez sur la phrase b. Vous la lisez attentivement et vous vous demandez si cette formulation est acceptable. Vous vous posez la question: Est-ce que je, ou quelqu’un d’autre, pourrais dire cela ? Votre réponse vous aide à choisir la lettre qui exprimera votre jugement en la tache. Voici les deux phrases: '{text}' Réponse:"
        zero_shot = zero_shot_prompting("jugement d'acceptabilité", zero_shot_prompt)

        # Record the end time and calculate the elapsed time for this sentence
        elapsed_time = time.time() - start_time

        # Save the response in the dictionary with the index as the key
        antworten_null_shot[i] = {
            "Prompt": zero_shot_prompt,
            "Antwort": zero_shot,
            "Zeit": elapsed_time
        }

        #print(f"Input: {text}")
        print(f"Output: {zero_shot}")
        print(i,f"Time taken: {elapsed_time:.2f} seconds")
        print()

    # Calculate elapsed time for the iteration
    elapsed_time_pro_part = time.time() - start_time_pro_part

    # Save the null-shot results of this iteration in the parent dictionary
    antworten_null_shot_all[iteration] = antworten_null_shot
    reaction_time_per_participant.append(elapsed_time_pro_part)
    print(f"Elapsed time for iteration {iteration}: {elapsed_time_pro_part:.2f} seconds")
    print()

print(antworten_null_shot_all)
print("Mean reaction time per participant:", sum(reaction_time_per_participant)/len(reaction_time_per_participant))




In the **seventh block** we save the results from all test subjects in an Excel and csv file. These files subsequently store the answers of MULTIPLE test LLMs-participants.

In [None]:
import pandas as pd
from datetime import datetime
columns_to_extract = ['French Sentences', 'syntactic position', 'trace position', 'status', 'gold_ratings']
# Dictionary to DataFrame conversion
def dict_to_dataframe(antworten_dict):
    rows = []
    for iteration, prompts in antworten_dict.items():
        for index, entry in prompts.items():
            for val in stimuli_dict.values():
                prompt_text = entry['Prompt']
                #print(val['French Sentences'])
                #print(prompt_text.split("deux phrases:")[1].strip().replace("'", ""))
                prompt_cot=prompt_text.split("deux phrases:")[1].strip().replace("'", "")
                #print(prompt_cot.split("Réponse")[0].strip())
                if val['French Sentences'].strip() == prompt_cot.split("Réponse")[0].strip():
                  #print("yes")
                  rows.append({
                        "French Sentences": val['French Sentences'],
                        "syntactic position": val['syntactic position'],
                        "trace position": val['trace position'],
                        "status": val['status'],
                        "gold_ratings": val['gold_ratings'],
                        "Informant": iteration,
                        "Index": index,
                        "Prompt": entry["Prompt"],
                        "models_answer": entry["Antwort"],
                        "time": entry["Zeit"],
                        "date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                    })
                else:
                  continue
                  #print("no",val['target_sent'].replace("'","").strip(),prompt_text.split("'non':")[1].split("Si")[0].strip().replace("'", ""))
    return pd.DataFrame(rows)

# Convert the dictionary to a DataFrame
new_data_df = dict_to_dataframe(antworten_null_shot_all)

# Save the DataFrame as new files
new_data_df.to_excel('all_accept_cotshot_gpt-4o-mini.xlsx', index=False)
new_data_df.to_csv('all_accept_cotshot_gpt-4o-mini.csv', index=False)

print("Data successfully saved!")


# Replication pipeline step 4
Finally, we evaluate the results.

In [None]:
import csv
from collections import defaultdict, Counter
import numpy as np
import string
import unicodedata

results = 'all_accept_cotshot_o4-mini.csv'
experimental_file = 'experimental_items_X1_to_X68.csv'

sent_gold_ratings = {}
ratings_per_sentence = defaultdict(list)
status_dict = {}
syntactic_dict = {}
trace_dict = {}
label_dict = {}
grade_map = {'a': 5, 'b': 3, 'c': 1, 'd': -5, 'e': 0}
translated_scores = defaultdict(list)
rating_diff_data = []
critical_sentence_set = set()  # Track sentences with "critical" in column 6 (index 6)
baseline_sentence_set = set()

def normalize(text):
    if not text:
        return ""
    text = unicodedata.normalize('NFKD', text)
    text = text.encode('ascii', 'ignore').decode('utf-8')
    text = text.replace("’", "'").replace("‘", "'").replace("“", '"').replace("”", '"')
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text.strip().lower()

# Step 1: Read gold labels

with open(experimental_file, mode='r', encoding='utf-8') as label_file:
    label_reader = csv.reader(label_file)
    next(label_reader)
    for row in label_reader:
        sentence_id = normalize(row[0])
        sentence_text = normalize(row[1])
        label_dict[sentence_text] = sentence_id
        status = row[6].strip().lower()  # Column 7 contains 'critical' or 'baseline'

        if status == 'critical':
            critical_sentence_set.add(sentence_text)
        elif status == 'baseline':
            baseline_sentence_set.add(sentence_text)

# Step 2: Read model results

with open(results, mode='r', encoding='utf-8') as file:
    csv_reader = csv.reader(file)
    next(csv_reader)

    for row in csv_reader:
        sentence = normalize(row[0])
        syntactic_position = row[1]
        trace_position = row[2]
        status = row[3]
        gold_rating = row[4]
        model_answer = row[8]

        sent_gold_ratings[sentence] = gold_rating
        ratings_per_sentence[sentence].append({
            'answer': model_answer,
            'gold': gold_rating,
            'status': status,
            'syntactic': syntactic_position,
            'trace': trace_position
        })

        status_dict[sentence] = status
        syntactic_dict[sentence] = syntactic_position
        trace_dict[sentence] = trace_position

# Step 3: Score comparison and collect rating differences
model_ratings_letter_all=[]
model_ratings_letter_critical=[]
model_ratings_letter_baseline=[]
for sentence, answers in ratings_per_sentence.items():
    model_ratings = []
    gold_ratings = []
    model_ratings_permodel = []

    for entry in answers:
        raw = entry['answer']
        cleaned = normalize(raw)
        if cleaned:
            cleaned = cleaned[0]
            model_ratings_letter_all.extend(cleaned)
            model_rating = grade_map.get(cleaned, None)
            if model_rating is not None:
                gold_rating = float(sent_gold_ratings.get(sentence, 0))
                model_ratings.append(model_rating)
                gold_ratings.append(gold_rating)

        if  entry['status'] == 'critical':
          model_ratings_letter_critical.extend(cleaned)
        elif  entry['status'] == 'baseline':
          model_ratings_letter_baseline.extend(cleaned)

    if model_ratings and gold_ratings:
        mean_model_rating = np.mean(model_ratings)
        mean_gold_rating = np.mean(gold_ratings)
        rating_diff_abs = abs(mean_model_rating - mean_gold_rating)
        rating_diff_signed = mean_model_rating - mean_gold_rating
        label = label_dict.get(sentence, 'unknown')
        status = answers[0]['status'] if answers else 'unknown'
        rating_diff_data.append({
            'sentence': sentence,
            'label': label,
            'mean_model': mean_model_rating,
            'mean_gold': mean_gold_rating,
            'rating_diff_abs': rating_diff_abs,
            'rating_diff_signed': rating_diff_signed,
            'status': status
        })

print("here!!!",rating_diff_data)
# Step 4: Check for sentences missing in model results
missing_model_sentences = set(label_dict.keys()) - set(ratings_per_sentence.keys())
print(f"Sentences in gold data but missing in model results: {len(missing_model_sentences)}")
for s in missing_model_sentences:
    print(f"Missing: {s}")

# Step 5: Summary stats
max_rating_diff_entry = max(rating_diff_data, key=lambda x: x['rating_diff_abs']) if rating_diff_data else None
count_large_diffs_4 = (
    sum(1 for item in rating_diff_data if item['rating_diff_abs'] >= 4),
    [item['label'] for item in rating_diff_data if item['rating_diff_abs'] >= 4]
)
count_large_diffs_2 = sum(1 for item in rating_diff_data if item['rating_diff_abs'] >= 2)

# Step 6: Aggregate rating differences by condition
def rating_diff_stats_by_condition(attribute_dict):
    grouped = defaultdict(list)
    model_ratings_group = defaultdict(list)
    gold_ratings_group = defaultdict(list)

    for item in rating_diff_data:
        sentence = item['sentence']
        key = attribute_dict.get(sentence, 'unknown')
        grouped[key].append(item['rating_diff_abs'])
        model_ratings_group[key].append(item['mean_model'])
        gold_ratings_group[key].append(item['mean_gold'])

    stats = {}
    for group in grouped:
        mismatches = grouped[group]
        model_vals = model_ratings_group[group]
        gold_vals = gold_ratings_group[group]

        mean_model = round(np.mean(model_vals), 2)
        mean_gold = round(np.mean(gold_vals), 2)
        mean_diff = round(abs(mean_model - mean_gold), 2)

        stats[group] = {
            'count': len(mismatches),
            'mean_rating_diff_abs': round(np.mean(mismatches), 2),
            'std_rating_diff': round(np.std(mismatches), 2),
            'mean_model': mean_model,
            'mean_gold': mean_gold,
            'mean_model_vs_gold_diff': mean_diff
        }

    return stats

status_rating_diff_stats = rating_diff_stats_by_condition(status_dict)
syntactic_rating_diff_stats = rating_diff_stats_by_condition(syntactic_dict)
trace_rating_diff_stats = rating_diff_stats_by_condition(trace_dict)
label_rating_diff_stats = rating_diff_stats_by_condition(label_dict)

# Step 7: Print summary
print("\n--- Global Summary ---")
print("Total sentences analyzed:", len(rating_diff_data))
print(f"Max Rating Gap: Sentence '{max_rating_diff_entry['sentence']}' → Model: {max_rating_diff_entry['mean_model']}, Gold: {max_rating_diff_entry['mean_gold']}, Gap: {max_rating_diff_entry['rating_diff_abs']:.2f}")
print(f"Count of Rating Gaps ≥ 4: {count_large_diffs_4}")
print(f"Count of Rating Gaps ≥ 2: {count_large_diffs_2}")

def print_rating_diff_stats(name, stats):
    print(f"\n{name} Rating Difference Stats:")
    for key, val in sorted(stats.items()):
        print(f"{key:<20} → gap: {val['mean_rating_diff_abs']:.2f}±{val['std_rating_diff']:.2f}, "
              f"model: {val['mean_model']}, gold: {val['mean_gold']}, "
              f"|Δ model-gold|: {val['mean_model_vs_gold_diff']}, n={val['count']}")

print_rating_diff_stats("Status", status_rating_diff_stats)
print_rating_diff_stats("Syntactic Position", syntactic_rating_diff_stats)
print_rating_diff_stats("Trace Position", trace_rating_diff_stats)
print_rating_diff_stats("Label", label_rating_diff_stats)

binned_rating_diffs_all = {
    "0-1": 0,
    "1-2": 0,
    "2-3": 0,
    "3-4": 0,
    ">4": 0
}

binned_rating_diffs_critical = {
    "0-1": 0,
    "1-2": 0,
    "2-3": 0,
    "3-4": 0,
    ">4": 0
}

binned_rating_diffs_baseline = {
    "0-1": 0,
    "1-2": 0,
    "2-3": 0,
    "3-4": 0,
    ">4": 0
}

# Separate the data into critical and baseline
critical_rating_diff_data = []
baseline_rating_diff_data = []

for entry in rating_diff_data:
    sentence = entry['sentence']
    if sentence in critical_sentence_set:
        critical_rating_diff_data.append(entry)
    elif sentence in baseline_sentence_set:
        baseline_rating_diff_data.append(entry)

# Now, we can print the results for both categories:
print("ALL rating diff data count:", len(rating_diff_data))
print("Critical rating diff data count:", len(critical_rating_diff_data))
print("Baseline rating diff data count:", len(baseline_rating_diff_data))

# Bin the rating differences for critical sentences
for item in rating_diff_data:
    rating_diff = item['rating_diff_abs']
    #print(rating_diff)
    if rating_diff < 1:
        binned_rating_diffs_all["0-1"] += 1
    elif rating_diff < 2:
        binned_rating_diffs_all["1-2"] += 1
    elif rating_diff < 3:
        binned_rating_diffs_all["2-3"] += 1
    elif rating_diff < 4:
        binned_rating_diffs_all["3-4"] += 1
    else:
        binned_rating_diffs_all[">4"] += 1

# Bin the rating differences for critical sentences
for item in critical_rating_diff_data:
    rating_diff = item['rating_diff_abs']
    #print(rating_diff)
    if rating_diff < 1:
        binned_rating_diffs_critical["0-1"] += 1
    elif rating_diff < 2:
        binned_rating_diffs_critical["1-2"] += 1
    elif rating_diff < 3:
        binned_rating_diffs_critical["2-3"] += 1
    elif rating_diff < 4:
        binned_rating_diffs_critical["3-4"] += 1
    else:
        binned_rating_diffs_critical[">4"] += 1

# Bin the rating differences for baseline sentences
for item in baseline_rating_diff_data:
    rating_diff = item['rating_diff_abs']
    #print(rating_diff)
    if rating_diff < 1:
        binned_rating_diffs_baseline["0-1"] += 1
    elif rating_diff < 2:
        binned_rating_diffs_baseline["1-2"] += 1
    elif rating_diff < 3:
        binned_rating_diffs_baseline["2-3"] += 1
    elif rating_diff < 4:
        binned_rating_diffs_baseline["3-4"] += 1
    else:
        binned_rating_diffs_baseline[">4"] += 1

# Print the results for all sentences
total_all = len(rating_diff_data)
print("\n--- Binned Rating Differences ALL ---")
for bin_label, count in binned_rating_diffs_all.items():
    percentage = (count / total_all * 100) if total_all else 0
    print(f"{bin_label} → {count} sentences ({percentage:.2f}%)")

# Print the results for critical sentences
total_critical = len(critical_rating_diff_data)
print("\n--- Binned Rating Differences (Critical only) ---")
for bin_label, count in binned_rating_diffs_critical.items():
    percentage = (count / total_critical * 100) if total_critical else 0
    print(f"{bin_label} → {count} sentences ({percentage:.2f}%)")

# Print the results for baseline sentences
total_baseline = len(baseline_rating_diff_data)
print("\n--- Binned Rating Differences (Baseline only) ---")
for bin_label, count in binned_rating_diffs_baseline.items():
    percentage = (count / total_baseline * 100) if total_baseline else 0
    print(f"{bin_label} → {count} sentences ({percentage:.2f}%)")

# Step 5: Model Rating Usage (Critical vs Baseline)
grade_letters = ['a', 'b', 'c', 'd', 'e']

def print_grade_distribution(rating_list, label):
    counter = Counter(rating_list)
    total = sum(counter.values())

    print(f"\nGrade Distribution for {label}:")
    for grade in grade_letters:
        count = counter.get(grade, 0)
        percentage = (count / total * 100) if total > 0 else 0
        print(f"{grade.upper()} → {count} times ({percentage:.2f}%)")

print_grade_distribution(model_ratings_letter_all, "All")
print_grade_distribution(model_ratings_letter_critical, "Critical")
print_grade_distribution(model_ratings_letter_baseline, "Baseline")

