# Conditional Acceptability in Large Language Models and Humans – Data Processing

This notebook prepares judgement data from humans and language models for analysis. The human data section is typically run once, while the model section can be re-run for new outputs. Both datasets are processed into a shared dataframe format to enable direct comparison in statistical models.

<br>

| **Aspect** | **Human Judgement Data** | **LLM Judgement Data** |
|------------|---------------------------|-------------------------|
| **Source** | Skovgaard-Olsen et al. (2016), [OSF](https://osf.io/7axdv/files/osfstorage) | Outputs from the prompting step (`results_{metric}_(no_)context_{style}.jsonl`) |
| **Files Used** | • `dat_finaldw1.csv`: includes P(B\|A) and **Prob(If A, then B)**  <br> • `dat_finaldw2.csv`: includes P(B\|A) and **Acc(If A, then B)** | One `.jsonl` file per configuration (metric/context/style) |
| **Processing** | Rearranged into unified format for easier handling | Regex applied to extract numeric probability or acceptability judgements |


#### Unified format
Both datasets are converted to the same structure. Each row contains:
+ **Identifiers:** `sample_number`, `scenario_number`, `statement_type`, `relation_type`
+ **Judgements:**
    + `c_prob`: P(B|A)
    + `if_prob`: Prob(If A, then B)
    + `if_acc` Acc(If A, then B)
+ **Instance:** `instance_id`: participant ID (humans) or prompt cycle (LLMs)

> For LLMs, `instance_id` reflects the prompt repetition index (1–5), since models are tested individually.

## Setup

We begin by importing all necessary packages for loading data, extracting scores, and handling dataframes.

In [6]:
import json
import re
import numpy as np
import pandas as pd

## Load Scenario Metadata and Intialise Reference Dataframe

This function loads metadata about the statements such as statement type and relation type from `statements.jsonl` and initializes the base dataframe.

The resulting dataframe serves as a reference structure for building both the human and model datasets.

In [7]:
def initialize_base_dataframe():
    
    with open("statements.jsonl", "r", encoding="utf8") as file:
        data = [json.loads(line) for line in file]

    # Extract the relevant fields into a list of dicts
    extracted = [
        {
            "sample_number": item["sample_number"],
            "scenario_number": item["scenario_number"],
            "statement_type": item["statement_type"],
            "relation_type": item["relation_type"]
        }
        for item in data
    ]

    return pd.DataFrame(extracted)

## Human Data *(run once)*

+ **Reads:** `dat_finaldw1.csv`, `dat_finaldw2.csv`
+ **Produces:** `human_df`, `human_data.csv`

This section processes human judgement data from Skovgaard-Olsen et al. (2016) into a structured format for comparison with LLM responses.

##### Purpose

Participants in the original study rated conditional statements ("If A, then B") for either **probability** or **acceptability** across various scenarios. This section extracts and formats those ratings for analysis alongside model outputs.

##### Raw Data Summary

- The original dataset (in R format) contains additional material not used here
- We extract only the relevant data and structure it to match the LLM format, enabling consistent analysis

---

### Load and Format Human Judgement Data

This function loads probability and acceptability judgements from the original study’s CSV files. Each entry is matched to its corresponding scenario in the reference dataframe (initialised from `statements.jsonl`) based on scenario number, statement type, and relation type.

The resulting dataframe is structured identically to the LLM data, ensuring compatibility for subsequent analysis.


In [None]:
def load_human_judgement_data():
    reference_df = initialize_base_dataframe()
    all_rows = []

    # Source files and associated judgement type
    judgement_sources = [
        ("dat_finaldw1.csv", "prob"),  # Probability judgements
        ("dat_finaldw2.csv", "acc")    # Acceptability judgements
    ]

    for csv_file, metric in judgement_sources:
        df = pd.read_csv(csv_file, index_col=0)

        for _, row in df.iterrows():
            scenario_number = row["le_nr"]
            statement_type = row["type"][-2:]  # Last two characters
            relation_type = row["rel_cond"]
            c_prob = row.get("CgivenA", None)
            instance_id = row.get("lfdn", None)

            # Extract the relevant judgement based on the source
            judgement = row.get("P", None) if metric == "prob" else row.get("ACC", None)

            # Match against reference rows
            mask = (
                (reference_df["scenario_number"] == scenario_number) &
                (reference_df["statement_type"] == statement_type) &
                (reference_df["relation_type"].str[:2] == relation_type)
            )

            for _, matched_row in reference_df[mask].iterrows():
                entry = matched_row.to_dict()
                entry["c_prob"] = c_prob
                entry["if_prob"] = judgement if metric == "prob" else np.nan
                entry["if_acc"] = judgement if metric == "acc" else np.nan
                entry["instance_id"] = instance_id
                all_rows.append(entry)

    return pd.DataFrame(all_rows)

#load_human_judgement_data()

### Execution

Load and process the human judgement data, then save the dataframe to a `.csv` file.

In [None]:
human_df = load_human_judgement_data()

human_df.to_csv("dataframe_human.csv", encoding='utf-8', index=False, sep = ";")
print (f"Successfully saved dataframe_human.csv.")

Successfully saved dataframe_human.csv.


## Part 2: Model Data *(re-runnable)*

+ **Reads:** `results_{metric}_(no_)context_{style}.jsonl`
+ **Produces:** `model_df`, `eval_{model_name}_(no_)context_{style}.csv`

This section processes LLM judgement data from the prompting step into a structured format for comparison with human responses.

##### Purpose

The language model was prompted to provide a numeric rating from 0-100 in response to one of the following:

+ A conditional probability query (*Assume A. How probable is B?*)

+ A conditional statement judgement (*How probable/acceptable is “If A, then B”?*)

These outputs are parsed and formatted for analysis.

##### Raw Data Summary

- Model responses are expected to include a numeric score (0–100), typically at the beginning of the response.
- Each statement was given to the model a total of five times (we will use the prompt cycle index (1-5) as an ID to mirror the human data's participant ID)
- Each `.jsonl` file represents a specific configuration (combination of metric, context, and prompt style) for one model

---

### Configuration

**Changeable Parameters:**
- `context`: Whether context is included in the prompt (True/False)
- `style`: The prompt style used ("vanilla", "fewshot", "cot")
- `model_name`: Name of the model being evaluated (used for filenames)

**Fixed Settings:**
- `metrics`: List of evaluation metrics (`"c_prob"`, `"if_prob"`, `"if_acc"`)
- `pattern`: Regex pattern to extract numeric scores from the model output
- `evaluation_file`: Output file for evaluation results, based on model configuration

**Purpose:**
These parameters allow for flexible evaluation across different configurations, metrics, and models.

In [8]:
# Changeable:
context = True      # Options: True = context included, False = context excluded
style = "cot"   # Options: "vanilla", "fewshot", "cot"

tested_model = "llama70b"     # Name of the evaluated model (used for filenames) # Options: llama3, qwen2, llama70b, qwen72b


# Fixed:
# All metrics that we need to iterate over
metrics = ["c_prob", "if_prob", "if_acc"]

# Regex pattern to match score from the model output (number after "answer" or "assistant", optional colon, case-insensitive)
pattern = r"(?:answer:|assistant)\s*[:\-]?\s*(\d+)"

# File to store evaluation results
output_file = f"dataframe_{tested_model}_context_{style}.csv" if context else f"dataframe_{tested_model}_no_context_{style}.csv"

### Extract Scores from Model Outputs

This function reads `.jsonl` files containing model responses for a specific evaluation metric. It extracts numeric scores (0–100) using regex. If a valid score isn't found or is out of bounds, a warning is printed.

Each line in the file corresponds to a sample; each sample has 5 responses. The function returns a dataframe with `sample_number`, `instance_id`, `metric`, and `judgement`.


In [9]:
def extract_scores(metric): 
    filename = (
        f"output_{tested_model}_{metric}_context_{style}.jsonl"
        if context
        else f"output_{tested_model}_{metric}_no_context_{style}.jsonl"
    )

    rows = []

    with open(filename, "r", encoding="utf8") as file:
        lines = [json.loads(line) for line in file]
        outputs = [entry["output"] for entry in lines]

    for sample_number, output_list in enumerate(outputs):
        for instance_id, response in enumerate(output_list):
            # Try to match numeric score using regex
            match = re.findall(pattern, response, flags=re.IGNORECASE)

            judgement = np.nan # default value

            if match:

                if len(match) in {1, 4}: # we expect either 1 matched number (vanilla) or 4 matched numbers (few-shot, CoT)
                    try:
                        score = int(match[-1])      # Use the last matched number
                        if 0 <= score <= 100:
                            judgement = score
                        else:
                            print(f"Invalid score in sample {sample_number}, instance {instance_id}: {response[-100:]}")
                    except ValueError:
                        print(f"Could not convert to int in sample {sample_number}, instance {instance_id}: {response[-100:]}")
                else:
                    print(f"Unexpected number of matches ({len(match)}) in sample {sample_number}, instance {instance_id}: {response[-100:]}")
            else:
                print(f"No match in sample {sample_number}, instance {instance_id}: {response[-100:]}")


        #        try:
         #           score = int(match[-1])      # Use the last matched number
          #          judgement = score if 0 <= score <= 100 else np.nan
           #         if np.isnan(judgement):
            #            print(f"Invalid score in sample {sample_number}, instance {instance_id}: {response[:100]}")
             #   except ValueError:
              #      print(f"Could not convert to int in sample {sample_number}, instance {instance_id}: {response[:100]}")
               #     judgement = np.nan
         #   else:
          #      print(f"No match in sample {sample_number}, instance {instance_id}: {response[:100]}")
           #     judgement = np.nan

            rows.append({
                "sample_number": sample_number,
                "instance_id": instance_id,
                "metric": metric,
                "judgement": judgement
            })

    return pd.DataFrame(rows)

### Combine Model Scores Across Metrics

This function iterates over all metrics (`c_prob`, `if_prob`, `if_acc`) and uses `extract_scores()` to extract model judgements from `.jsonl` files.

Each judgement is mapped back onto the reference structure (`sample_number`, `scenario_number`, etc.), and the scores are inserted into the appropriate field. The resulting dataframe has the same format as the human data, which simplifies later comparisons.

In [None]:
def load_model_judgement_data():
    reference_df = initialize_base_dataframe()
    combined_rows = {}

    for metric in metrics:
        judgement_df = extract_scores(metric)

        for _, row in judgement_df.iterrows():
            sample_number = row["sample_number"]
            instance_id = row["instance_id"]
            judgement = row["judgement"]

            key = sample_number, instance_id

            if key not in combined_rows:
                # Start from base row copy
                base_row = reference_df.iloc[sample_number].copy()
                # Initialize all metrics to NaN
                base_row["c_prob"] = np.nan
                base_row["if_prob"] = np.nan
                base_row["if_acc"] = np.nan
                base_row["instance_id"] = instance_id
                combined_rows[key] = base_row

            # Assign judgement to correct metric column
            if metric == "c_prob":
                combined_rows[key]["c_prob"] = judgement
            elif metric == "if_prob":
                combined_rows[key]["if_prob"] = judgement
            elif metric == "if_acc":
                combined_rows[key]["if_acc"] = judgement

    return pd.DataFrame(list(combined_rows.values()))


load_model_judgement_data()

Unnamed: 0,sample_number,scenario_number,statement_type,relation_type,c_prob,if_prob,if_acc,instance_id
0,1,1,HH,POS,99,99,98,0
0,1,1,HH,POS,98,98,100,1
0,1,1,HH,POS,98,98,99,2
0,1,1,HH,POS,99,98,99,3
0,1,1,HH,POS,98,99,100,4
...,...,...,...,...,...,...,...,...
143,144,12,LH,IRR,50,0,0,0
143,144,12,LH,IRR,42,4,0,1
143,144,12,LH,IRR,50,4,0,2
143,144,12,LH,IRR,50,0,0,3


### Execution

Load and process the model judgement data, then save the dataframe to a `.csv` file.

In [11]:
#execution cell
model_df = load_model_judgement_data()

model_df.to_csv(output_file, encoding='utf-8', index=False, sep = ";")
print (f"Successfully saved {output_file}.")

Successfully saved dataframe_llama70b_context_cot.csv.
