# Beyond Correlation: Why Likert-Scale Calibration Challenges LLM-as-Participant Designs.



# Code structure: #

In the **first block** of the code, some libraries (such as openai) are imported and the own API-key of OpenAI is defined.

In [None]:
import os
from openai import OpenAI

os.environ["OPENAI_API_KEY"] = "" # insert here OpenAI API key


This is a test, it works! How can I assist you further?


The **second block** of the code contains the prompt engineering functions that we will use later. You will find zero and few shot prompting as example functions. However, only the zero-shot function is applied in the next steps, since few-shot prompting performed poorely in first pilot studies.

In [None]:
import openai
from openai import OpenAI

# Funktion zur Durchführung von Null-Shot-Prompting
def zero_shot_prompting(task, prompt):
    """
    Führt Null-Shot-Prompting durch, indem ein Prompt an das Sprachmodell gesendet wird,
    ohne dass vorherige Beispiele gegeben werden.

    Args:
    task (str): Die Aufgabe, die durchgeführt werden soll.
    prompt (str): Der Text, der als Eingabe an das Sprachmodell gesendet wird.

    Returns:
    str: Die Antwort des Sprachmodells.
    """
    # Erstellen der Chat-Nachricht und Senden an das Sprachmodell
    response = client.chat.completions.create(
        model="o4-mini",
        messages=[
            {"role": "system", "content": "Tu es un(e) assistant(e)"}, # HIER ÄNDERN, WENN GEWÜNSCHT
            {"role": "user", "content": prompt} # Benutzernachricht mit dem eigentlichen Prompt
        ]
    )
    # Rückgabe des Inhalts der ersten Antwortnachricht des Modells
    return response.choices[0].message.content



# Funktion zur Durchführung von Few-Shot-Prompting
def few_shot_prompting(task, examples, prompt):
    """
    Führt Few-Shot-Prompting durch, indem einige Beispiele gegeben werden,
    bevor der eigentliche Prompt an das Sprachmodell gesendet wird.

    Args:
    task (str): Die Aufgabe, die durchgeführt werden soll.
    examples (list of dict): Eine Liste von Beispielen, wobei jedes Beispiel ein Wörterbuch mit 'input' und 'output' enthält.
    prompt (str): Der Text, der als Eingabe an das Sprachmodell gesendet wird.

    Returns:
    str: Die Antwort des Sprachmodells.
    """
    # Initialisieren der Nachrichtenliste mit einer Systemnachricht
    messages = [{"role": "system", "content": "Tu es un(e) assistant(e)"}] # HIER ÄNDERN, WENN GEWÜNSCHT

    # Hinzufügen der Beispiele zu den Nachrichten
    for example in examples:
        messages.append({"role": "user", "content": example['input']})
        messages.append({"role": "assistant", "content": example['output']})

    # Hinzufügen des eigentlichen Prompts zur Nachrichtenliste
    messages.append({"role": "user", "content": prompt})

    # Erstellen der Chat-Nachricht und Senden an das Sprachmodell
    response = client.chat.completions.create(
        model="o4-mini",
        messages=messages
    )
    # Rückgabe des Inhalts der ersten Antwortnachricht des Modells
    return response.choices[0].message.content


In the **third block** of the code you find code so that you can read in your uploaded data. This will be helpful in presenting the data to the LLM during prompting.

In [None]:
import pandas as pd
import time
import json
import gc
from psutil import virtual_memory
from datetime import datetime
import random

# Load data
daten = '630similarFrenchWordPair.csv'  # Adjust the filename as needed
df = pd.read_csv(daten)

# Extract specific columns and pack them into a dictionary
columns_to_extract = ['wordPairs', 'PairMeanConcreteness', 'clusters', 'MeanPairSimilarity']
selected_data = df[columns_to_extract].astype(str)
stimuli_dict = selected_data.to_dict(orient="index")

# Create list of items
all_items = [entry['wordPairs'] for entry in stimuli_dict.values()]
#random.shuffle(all_items)
print("The materials for the study are:")

n = len(all_items) // 6

part1 = all_items[0*n : 1*n]
print("list 1:", part1)
part2 = all_items[1*n : 2*n]
part3 = all_items[2*n : 3*n]
part4 = all_items[3*n : 4*n]
part5 = all_items[4*n : 5*n]
part6 = all_items[5*n : 6*n]

The materials for the study are:
list 1: ['abdiquer-abandonner', 'abdomen-ventre', 'abeille-guêpe', 'abolir-annuler', 'absence-manque', 'abstrait-irréel', 'absurde-ridicule', 'acceptation-accord', 'activisme-militantisme', 'addition-somme', 'admirer-célébrer', 'adulte-enfant', 'aérien-vaporeux', 'aéroport-gare', 'affichage-présentation', 'agence-bureau', 'agile-adroit', 'agilité-habileté', 'agitation-troubles', 'agiter-secouer', 'agriculteur-chasseur', 'aigle-faucon', 'album-classeur', 'algèbre-maths', 'alias-pseudonyme', 'allégorie-métaphore', 'alphabet-chiffre', 'altruisme-bonté', 'ambassade-consulat', 'ambition-désir', 'amer-acide', 'amour-adoration', 'ampoule-bougie', 'analogie-exemple', 'anatomie-forme', 'ancre-grappin', 'animal-bête', 'anormal-étrange', 'anxiété-crainte', 'apathie-ennui', 'appartenir-dépendre', 'apprendre-savoir', 'après-futur', 'arbitraire-injuste', 'arbitre-juge', 'architecture-construction', 'argent-monnaie', 'arrivée-approche', 'arrogance-vanité', 'artisanat-

In the **fourth block** we can apply the prompt engineering functions of block 2. We will formulate our prompts in this cell. In this block, we can test different prompting strategies. We prompt the model as many times as we have (or want to have) particpants tu simulate.

In [None]:
# Prompting mit allen Probanden und mit dem ganzen korpus

antworten_null_shot_all = {}
probanden_zahl= 62
items = part1
reaction_time_per_participant=list()

import time  # Import the time module

for iteration in range(probanden_zahl + 1):  # Repeat the process
    print(f"### LLM-informant {iteration} ###")

    antworten_null_shot = {}  # Dictionary for null-shot responses in this iteration

    # === Null-Shot ===
    # Record the start time
    start_time_pro_part = time.time()

    for i, text in enumerate(items):
        start_time = time.time()
        zero_shot_prompt = f"Nous allons vous présenter deux mots separées de un -. Votre tâche consiste à réfléchir au sens de ces mots et indiquer sur une échelle en 7 points à quel point vous considérez que ces deux mots sont similaires sémantiquement (qu’ils ont le même sens). Si vous considérez que les deux mots de la paire sont très similaires sémantiquement, vous pouvez choisir le point 7 (« tout à fait similaires »). Si vous considérez que les deux mots de la paire sont très différents sémantiquement, vous pouvez choisir le point 1 (« pas du tout similaires »). Vous pouvez utiliser tous les points de l’échelle. Par exemple, si la paire présentée est professeur – enseignant, vous pouvez choisir le point 6. Si la paire présentée est lune – route, vous pouvez choisir le point 1. N’hésitez pas à utiliser toute la gamme des chiffres comprise entre 1 et 7 mais ne vous préoccupez pas du nombre de fois que vous avez utilisé un chiffre en particulier tant qu’il fait référence à votre véritable jugement. Il n'y a pas de bonne ou de mauvaise réponse. Veuillez donc répondre le plus spontanément possible. Ne donnez pas d'éxplications. Les mots sont: '{text}'"
        zero_shot = zero_shot_prompting("jugement", zero_shot_prompt)

        # Record the end time and calculate the elapsed time for this sentence
        elapsed_time = time.time() - start_time

        # Save the response in the dictionary with the index as the key
        antworten_null_shot[i] = {
            "Prompt": zero_shot_prompt,
            "Antwort": zero_shot,
            "Zeit": elapsed_time
        }

        #print(f"Input: {text}")
        print(f"Output: {zero_shot}")
        print(i,f"Time taken: {elapsed_time:.2f} seconds")
        print()

    # Calculate elapsed time for the iteration
    elapsed_time_pro_part = time.time() - start_time_pro_part

    # Save the null-shot results of this iteration in the parent dictionary
    antworten_null_shot_all[iteration] = antworten_null_shot
    reaction_time_per_participant.append(elapsed_time_pro_part)
    print(f"Elapsed time for iteration {iteration}: {elapsed_time_pro_part:.2f} seconds")
    print()

print(antworten_null_shot_all)
print("Mean reaction time per participant:", sum(reaction_time_per_participant)/len(reaction_time_per_participant))




### LLM-informant 51 ###
Output: 7
0 Time taken: 2.35 seconds

Output: 3
1 Time taken: 2.50 seconds

Output: 5
2 Time taken: 2.35 seconds

Output: 4
3 Time taken: 2.14 seconds

Output: 3
4 Time taken: 3.59 seconds

Output: 5
5 Time taken: 3.70 seconds

Output: 4
6 Time taken: 2.71 seconds

Output: 4
7 Time taken: 2.87 seconds

Output: 5
8 Time taken: 2.96 seconds

Output: 6
9 Time taken: 2.65 seconds

Output: 6
10 Time taken: 1.50 seconds

Output: 7
11 Time taken: 1.52 seconds

Output: 5
12 Time taken: 3.20 seconds

Output: 4
13 Time taken: 5.78 seconds

Output: 3
14 Time taken: 2.40 seconds

Output: 7
15 Time taken: 2.08 seconds

Output: 4
16 Time taken: 2.78 seconds

Output: 5
17 Time taken: 3.60 seconds

Output: 4
18 Time taken: 3.48 seconds

Output: 2
19 Time taken: 3.12 seconds

Output: 7
20 Time taken: 2.43 seconds

Output: 3
21 Time taken: 2.83 seconds

Output: 5
22 Time taken: 3.09 seconds

Output: 5
23 Time taken: 2.27 seconds

Output: 4
24 Time taken: 2.97 seconds

Output: 5


In the **fifth block** we save the results from all test subjects in an Excel and csv file. These files subsequently store the answers of all LLMs-participants.

In [None]:
import pandas as pd
from datetime import datetime
import ast


# Dictionary to DataFrame conversion
def dict_to_dataframe(antworten_dict):
    rows = []
    for iteration, prompts in antworten_dict.items():
        for index, entry in prompts.items():
            for val in stimuli_dict.values():
                prompt_text = entry['Prompt']
                #print(val['wordPairs'])
                #print(prompt_text)#.split("'non':")[1].split("Si")[0].strip().replace("'", ""))
                if val['wordPairs'].strip() == prompt_text.split("Les mots sont:")[1].strip().replace("'", ""):
                  #print("yes")
                  rows.append({
                        "wordPairs": val['wordPairs'],
                        "PairMeanConcreteness": val['PairMeanConcreteness'],
                        "clusters": val['clusters'],
                        "MeanPairSimilarity": val['MeanPairSimilarity'],
                        "Informant": iteration,
                        "Index": index,
                        "Prompt": entry["Prompt"],
                        "models_answer": entry["Antwort"],
                        "time": 0,
                        "date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                    })
                else:
                  continue
                  #print("no",val['target_sent'].replace("'","").strip(),prompt_text.split("'non':")[1].split("Si")[0].strip().replace("'", ""))
    return pd.DataFrame(rows)

# Convert the dictionary to a DataFrame
new_data_df = dict_to_dataframe(antworten_null_shot_all)

# Save the DataFrame as new files
#new_data_df.to_excel('list1_Lakhzoum_gpt-40-mini_fewshot.xlsx', index=False)
new_data_df.to_csv('list1_Lakhzoum_o4-mini_fewshot.csv', index=False)

print("Data successfully saved!")


Data successfully saved!
