# Acting Like Humans? Evaluating Large Language Models as Proxies in Linguistic Experiments


This notebook refers to the paper ***Acting Like Humans? Evaluating Large Language Models as Proxies in Linguistic Experiments***, which aims to replicate linguistic experimental pipelines with human participants using LLMs.


It is intended to be used for further research.



# Code structure: #

In the **first block** of the code, some libraries (such as openai) are imported and the own API-key of OpenAI is defined.

In [None]:
import os
from openai import OpenAI

os.environ["OPENAI_API_KEY"] = "" # insert here OpenAI API key

The **second block** of the code contains the prompt engineering functions that we will use later. You will find zero and few shot prompting as example functions. However, only the zero-shot function is applied in the next steps, since few-shot prompting performed poorely in first pilot studies.

In [None]:
# Function for Zero-Shot Prompting
def zero_shot_prompting(task, prompt):
    """
    Performs zero-shot prompting by sending a prompt to the language model
    without providing any previous examples.

    Args:
    task (str): The task to be performed.
    prompt (str): The text to be sent as input to the language model.

    Returns:
    str: The response from the language model.
    """
    # Creating the chat message and sending it to the language model
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an assistant"},
            {"role": "user", "content": prompt}
        ]
    )
    # Returning the content of the first response message from the model
    return response.choices[0].message.content



# Function for Few-Shot Prompting
def few_shot_prompting(task, examples, prompt):
    """
    Performs few-shot prompting by providing some examples before sending the actual
    prompt to the language model.

    Args:
    task (str): The task to be performed.
    examples (list of dict): A list of examples, where each example is a dictionary
                              containing 'input' and 'output' keys.
    prompt (str): The text to be sent as input to the language model.

    Returns:
    str: The response from the language model.
    """
    # Initializing the messages list with a system message
    messages = [{"role": "system", "content": "You are an assistant"}]

    # Adding the examples to the messages
    for example in examples:
        messages.append({"role": "user", "content": example['input']})
        messages.append({"role": "assistant", "content": example['output']})

    # Adding the actual prompt to the messages list
    messages.append({"role": "user", "content": prompt})

    # Creating the chat message and sending it to the language model
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    # Returning the content of the first response message from the model
    return response.choices[0].message.content


# Replication pipeline step 1

In the **third block** of the code you find code so that you can read in your uploaded data. This will be helpful in presenting the data to the LLM during prompting.

Data handling code is provided for both replications. According to the data read in the cell, choose the corresponding passage and comment the other one out.

In [None]:
import pandas as pd
import time
import json
import gc
from psutil import virtual_memory
from datetime import datetime

# Load data
daten = 'Lombard_replicat_appendix.csv'  # Adjust the filename as needed
df = pd.read_csv(daten)
shuffled_df = df.sample(frac=1).reset_index(drop=True)

# Data handling for replication 1 (Cruz_23)
columns_to_extract = ['sentences', 'target_item ', 'animacy', 'congruent-gender', 'status']
selected_data = shuffled_df[columns_to_extract].astype(str)
stimuli_dict = selected_data.to_dict(orient="index")

all_items=list()
print("The materials for the study are:")
for entry in stimuli_dict.values():
    all_items.append(entry['sentences'])
print(all_items[:5])

'''# Data handling for replication 2 (Lombard_21)
columns_to_extract = ['change', 'regularity', 'process', 'neologism', 'target_sent']
selected_data = shuffled_df[columns_to_extract].astype(str)
stimuli_dict = selected_data.to_dict(orient="index")

# Create list of items
all_items = [entry['target_sent'] for entry in stimuli_dict.values()]
print("The materials for the study are:")
print(all_items)'''

# Replication pipeline step 2

In the **fourth block** we can apply the prompt engineering functions of block 2. We will formulate our prompts in this cell. In this block, we can test different prompting strategies on one LLM-query and on a limited subset of the dataset.

In [None]:
# To test one "LLM-participant" with a subset of the items

# Preparing to store the model's responses
answers_zero_shot = {}

# === Zero-Shot Prompting ===
print("=== Zero-Shot ===")

# Iterating through the subset for Zero-Shot
for i, text in enumerate(all_items[:3]):
    zero_shot_prompt = f"Insert your instructions here: '{text}'"

    # Performing Zero-Shot Prompting
    zero_shot = zero_shot_prompting("Task description", zero_shot_prompt)

    # Storing the response in the dictionary with the index as the key
    answers_zero_shot[i] = {
        "Prompt": zero_shot_prompt,
        "Response": zero_shot
    }

    # Printing the input and the corresponding output
    print(f"Input: {text}")
    print(f"Output: {zero_shot}")
    print()


In the **fifth block** we save the results in an Excel file. This file stores the answers of ***ONE*** LLM-participant.

In [None]:
# Here to store the responses of ++ONE++ PARTICIPANT

import pandas as pd
import openpyxl
from openpyxl.utils import get_column_letter
from datetime import datetime
import csv


def convert_csv_to_excel(csv_path, excel_path):
    """
    Converts a CSV file into an Excel file (.xlsx).
    """
    df = pd.read_csv(csv_path)
    df.to_excel(excel_path, index=False)

def load_or_create_excel(file_path):
    """
    Loads an existing Excel file or creates a new one if the file does not exist.
    """
    try:
        workbook = openpyxl.load_workbook(file_path)
        sheet = workbook.active
    except FileNotFoundError:
        workbook = openpyxl.Workbook()
        sheet = workbook.active
        print(f"File '{file_path}' not found. This is an error, check the name of the file uploaded in Block 3.")
    return workbook, sheet

def add_columns_to_excel(sheet, new_columns):
    """
    Adds new columns to the Excel file.
    """
    existing_columns = sheet.max_column
    for idx, col_name in enumerate(new_columns, start=existing_columns + 1):
        sheet[f"{get_column_letter(idx)}1"] = col_name

def add_data_to_excel(sheet, data_dict, start_row):
    """
    Adds the contents of a dictionary to the Excel sheet.
    """
    for idx, (key, entry) in enumerate(data_dict.items(), start=start_row):
        sheet[f"A{idx}"] = key + 1
        sheet[f"B{idx}"] = entry['Response']
        sheet[f"C{idx}"] = entry['Prompt']
        sheet[f"D{idx}"] = entry['Response_Original']
        sheet[f"E{idx}"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print("Data added.")

def save_excel(workbook, file_path):
    """
    Saves the workbook to the specified file.
    """
    workbook.save(file_path)
    print(f"File successfully saved at: {file_path}")

def extend_csv(csv_input_path, csv_output_path, data_dict):
    """
    Extends a CSV file by adding new columns and saves it as a new CSV file.
    """
    with open(csv_input_path, mode='r', newline='') as infile, open(csv_output_path, mode='w', newline='') as outfile:
        reader = csv.DictReader(infile)
        fieldnames = reader.fieldnames + ["Prompt", "Response_from_Model", "Date"]
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)

        writer.writeheader()

        # Only process as many rows as there are entries in the data_dict
        for idx, row in enumerate(reader):
            if idx < len(data_dict):  # Process only if there is a corresponding entry in the dictionary
                row["Prompt"] = data_dict[idx]["Prompt"]
                row["Response_from_Model"] = data_dict[idx]["Response"]
                row["Date"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                writer.writerow(row)

    print(f"Extended CSV file saved at: {csv_output_path}")


# --- Parameters ---
csv_file_path = daten
excel_file_path = "results_one_participant.xlsx"
csv_output_path = "results_one_participant.csv"
# 1. Convert CSV -> Excel and extend it
convert_csv_to_excel(csv_file_path, excel_file_path)
workbook, sheet = load_or_create_excel(excel_file_path)
add_columns_to_excel(sheet, ["Prompt", "Response_from_Model", "Date"])
add_data_to_excel(sheet, answers_zero_shot, start_row=sheet.max_row + 1)
save_excel(workbook, excel_file_path)

# 2. Alternatively: Extend CSV directly
extend_csv(csv_file_path, csv_output_path, answers_zero_shot)


# Replication pipeline step 3

In the **sixth block** we repeat the prompting that worked best for as many times as we have (or want to have) subjects and with all data in the corpus.

In [None]:
# Prompting with all participants and the entire corpus

answers_null_shot_all = {}
number_of_participants = 10
reaction_time_per_participant = list()

import time  # Import the time module

for iteration in range(1, number_of_participants + 1):  # Repeat the process for each participant
    print(f"### LLM-informant {iteration} ###")

    answers_null_shot = {}  # Dictionary to store null-shot responses for this iteration

    # === Null-Shot ===
    # Record the start time for the participant
    start_time_for_participant = time.time()

    for i, text in enumerate(all_items):  # Iterate through each item in the corpus
        start_time = time.time()  # Record the start time for each prompt
        zero_shot_prompt = f"Insert your instructions here:: '{text}'"
        zero_shot = zero_shot_prompting("Task description", zero_shot_prompt)  # Call the zero-shot function

        # Record the end time and calculate the elapsed time for this sentence
        elapsed_time = time.time() - start_time

        # Store the response in the dictionary with the index as the key
        answers_null_shot[i] = {
            "Prompt": zero_shot_prompt,
            "Answer": zero_shot,
            "Time": elapsed_time
        }

        # Print the time taken for this particular prompt
        print(i, f"Time taken: {elapsed_time:.2f} seconds")
        print()

    # Calculate the total time taken for this participant's iteration
    elapsed_time_for_participant = time.time() - start_time_for_participant

    # Save the null-shot results for this iteration in the parent dictionary
    answers_null_shot_all[iteration] = answers_null_shot
    reaction_time_per_participant.append(elapsed_time_for_participant)

    # Print the total time taken for the current participant
    print(f"Elapsed time for iteration {iteration}: {elapsed_time_for_participant:.2f} seconds")
    print()

# Print the results for all participants
print(answers_null_shot_all)

# Calculate and print the mean reaction time per participant
print("Mean reaction time per participant:", sum(reaction_time_per_participant)/len(reaction_time_per_participant))


In the **seventh** block we save the results from all test subjects in an Excel and csv file. These files subsequently store the answers of ***MULTIPLE*** test LLMs-participants.

Code is provided for both replications. According to the case study, choose the corresponding passage and comment the other one out.

In [None]:
import pandas as pd
from datetime import datetime

## To save results from replication 1 (Cruz_23)
# Dictionary to DataFrame conversion
def dict_to_dataframe(antworten_dict):
    rows = []
    for iteration, prompts in antworten_dict.items():
        for index, entry in prompts.items():
            for val in stimuli_dict.values():
                if val['sentences'] == entry['Prompt'].split(": ")[-1].replace("'", "").strip():
                  #print(val)
                  rows.append({
                        "sentences": val['sentences'],
                        "target_item": val['target_item '],
                        "animacy": val['animacy'],
                        "congruent-gender": val['congruent-gender'],
                        "status": val['status'],
                        "Informant": iteration,
                        "Index": index,
                        "Prompt": entry["Prompt"],
                        "models_answer": entry["Antwort"],
                        "time": entry["Zeit"],
                        "date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                    })
    return pd.DataFrame(rows)

## To sav results from replication 2 (Lombard_21)
'''# Dictionary to DataFrame conversion
def dict_to_dataframe(antworten_dict):
    rows = []
    for iteration, prompts in antworten_dict.items():
        for index, entry in prompts.items():
            for val in stimuli_dict.values():
                prompt_text = entry['Prompt']
                #print(val['target_sent'].replace("'",""))
                #print(prompt_text.split("'non':")[1].split("Si")[0].strip().replace("'", ""))
                if val['target_sent'].replace("'","").strip() == prompt_text.split("'non':")[1].split("Si")[0].strip().replace("'", ""):
                  #print("yes",val['target_sent'].replace("'","").strip(),prompt_text.split("'non':")[1].split("Si")[0].strip().replace("'", ""))
                  rows.append({
                        "neologism": val['neologism'],
                        "sentences": val['target_sent'],
                        "change": val['change'],
                        "regularity": val['regularity'],
                        "process": val['process'],
                        "Informant": iteration,
                        "Index": index,
                        "Prompt": entry["Prompt"],
                        "models_answer": entry["Antwort"],
                        "time": entry["Zeit"],
                        "date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                    })
                else:
                  continue
                  #print("no",val['target_sent'].replace("'","").strip(),prompt_text.split("'non':")[1].split("Si")[0].strip().replace("'", ""))
    return pd.DataFrame(rows)'''

# Convert the dictionary to a DataFrame
new_data_df = dict_to_dataframe(antworten_null_shot_all)

# Save the DataFrame as new files
new_data_df.to_excel('all_results.xlsx', index=False)
new_data_df.to_csv('all_results.csv', index=False)

print("Data successfully saved!")


# Replication pipeline step 4

Finally, we evaluate the results.

Code is provided for both replications. According to the case study, choose the corresponding passage and comment the other one out.

In [None]:
import pandas as pd
import csv
from collections import Counter

# Upload the results
results= 'results.csv' # Adjust the name of the results document

## To evaluate replication 1 (Cruz_23)
correct = 0
mistakes = {}

# Open the CSV file
with open(results, mode='r', encoding='utf-8') as file:
    csv_reader = csv.reader(file)

    # Optional: Get the header if the file has one
    header = next(csv_reader, None)  # Skip the header if present

    # Access and process each line
    for row in csv_reader:
      if row[4] == "critical":

        if row[3] == "feminine":
          if row[8] == "la":
            correct +=1
          else:
            if row[8] != "la":
              key= row[5], row[6]
              mistakes[key] = row[1],row[2],row[3]

        if row[3] == "masculine":
          if row[8] == "el":
            correct +=1
          else:
            if row[8] != "el":
              key= row[5], row[6]
              mistakes[key] = row[1],row[2],row[3]

      else:
        continue

wrong_target_words = []
for m in mistakes.values():
  wrong_target_words.append(m[0])
mistakes_counts = dict(Counter(wrong_target_words))

whole_duration=sum(reaction_time_per_participant)
print("correct",correct)
print("tot responses", correct+len(mistakes))
print("mistakes",len(mistakes))#, ":", mistakes)
print("wrong target words",set(wrong_target_words))
print("wrong target words freq",mistakes_counts)
print("mean reaction time per participant:", sum(reaction_time_per_participant)/len(reaction_time_per_participant))
print("whole study duration:", str(datetime.timedelta(seconds=whole_duration)))

# error analysis
male_animate=0
female_animate=0
male_inanimate=0
female_inanimate=0
for val in mistakes.values():
  if val[2] == 'masculine' and val[1] == 'animate':
      male_animate +=1
  elif val[2] == 'feminine' and val[1] == 'animate':
      female_animate +=1
  elif val[2] == 'masculine' and val[1] == 'inanimate':
      male_inanimate +=1
  elif val[2] == 'feminine' and val[1] == 'inanimate':
      female_inanimate +=1
  else:
    print(val)

print("male animate", male_animate)
print("female animate", female_animate)
print("male inanimate", male_inanimate)
print("female inanimate", female_inanimate)


## To evaluate replication 2 (Lombard_21)
'''correct_oui = 0
correct_neo = 0
fillers = 0
fillers_wrong = 0
mistakes = {}
filler_mistakes = {}
filler_error_list=[]

# Open the CSV file
with open(results, mode='r', encoding='utf-8') as file:
    csv_reader = csv.reader(file)

    # Optional: Get the header if the file has one
    header = next(csv_reader, None)  # Skip the header if present

    # Access and process each line
    for row in csv_reader:
        #print(row)
        neologism = row[0].lower().strip()
        answer = row[8].lower().strip()
        #print(answer)

        if row[2] != "Filler":

          if answer.startswith("oui"):
            correct_oui +=1

            neo = answer.split(" ")[1].strip()
            #print(neologism, neo)
            if neo.startswith(neologism):
              correct_neo +=1

          else:
            key= row[5], row[6]
            mistakes[key] = row[0].strip(),row[2].strip(),row[3].strip(),row[4].strip()

        elif row[2] == "Filler":
          fillers +=1
          if answer.startswith("oui"):
            fillers_wrong +=1
            if row[1] not in filler_mistakes.keys():
              filler_mistakes[row[1]] = [answer]
            else:
              filler_mistakes[row[1]].append(answer)

            filler_error_list.append(row[1])

        else:
          print(row[1], answer)

wrong_fillers_counts= dict(Counter(filler_error_list))
wrong_target_words = []
for m in mistakes.values():
  wrong_target_words.append(m[0])
mistakes_counts = dict(Counter(wrong_target_words))

print("wrong fillers",fillers_wrong)
print(wrong_fillers_counts)


whole_duration=sum(reaction_time_per_participant)
print("tot responses", correct_oui+len(mistakes))
print("correct",correct_oui)
print("fillers", fillers) #40
print("mistakes",len(mistakes), ":", mistakes)
print("wrong target words",set(wrong_target_words))
print("wrong target words freq",mistakes_counts)
print("whole study duration:", str(datetime.timedelta(seconds=whole_duration)))

# error analysis
morph_reg=0
morph_irreg=0
sem_reg=0
sem_irreg=0
for val in mistakes.values():
  #print(val)
  if val[1] == 'Morphological' and val[2] == 'Irregular':
      morph_reg +=1
  elif val[1] == 'Morphological' and val[2] == 'Regular':
      morph_irreg +=1
  elif val[1] == 'Semantic' and val[2] == 'Irregular':
      sem_reg +=1
  elif val[1] == 'Semantic' and val[2] == 'Regular':
      sem_irreg +=1
  else:
    print(val)

print("Morphological irregular", morph_irreg)
print("Morphological regular", morph_reg)
print("Semantic irregular", sem_irreg)
print("Semantic regular", sem_reg)'''