# Conditional Acceptability in Large Language Models and Humans - Post-Hoc Analyses

This notebook is used to perform additional analyses, i.e.,
* analysis of rating consistency by computing the intraclass correlation
* correlation analysis of our ratings to the models' sentence probability and perplexity

### Imports

In [2]:
import pandas as pd
import json
import pingouin as pg

## Rating Consistency

As LLM inference is non-deterministic, the consistency of numerical ratings is an important aspect to consider. In fact, to account for potential instabilities in model ratings, we conduct repeated trials by prompting each model five times per data sample.

Additionally, we perform an analysis of rating consistency by computing the intraclass correlation across the five model scores per sample for each score type and model.

In [None]:
models = ["llama3", "llama70b", "qwen2", "qwen72b"]
metrics = ["vanilla", "fewshot", "cot"]

collections = {}

for metric in metrics:
    for model in models:
    
        if metric == "cot" and model in ["llama3", "qwen2"]:
            continue

        df = pd.read_csv(f"dataframe_{model}_context_{metric}.csv", sep=";")

        icc_overall = {}

        for var in ["c_prob", "if_prob", "if_acc"]:
            # Check if target has more than 1 rater (expected but still a nice safeguard )
            counts = df.groupby("sample_number")["instance_id"].count()
            valid_samples = counts[counts > 1].index
            df_valid = df[df["sample_number"].isin(valid_samples)]

            # Reshape datframe for pingouin
            long = df_valid[[var, "instance_id", "sample_number"]].rename(
                columns={"instance_id":"rater", "sample_number":"target", var:"score"}
            )

            icc = pg.intraclass_corr(data=long, targets="target", raters="rater", ratings="score")
            
            icc_overall[var] = icc.loc[icc["Type"]=="ICC2", "ICC"].values[0]

        collections[f"{model} ({metric})"] = icc_overall
print(pd.DataFrame(collections).transpose().round(4))


                    c_prob  if_prob  if_acc
llama3 (vanilla)    0.7685   0.7724  0.7529
llama70b (vanilla)  0.9468   0.9344  0.9201
qwen2 (vanilla)     0.9768   0.9716  0.9450
qwen72b (vanilla)   0.9693   0.9673  0.9784
llama3 (fewshot)    0.7930   0.7538  0.7596
llama70b (fewshot)  0.9438   0.9556  0.8820
qwen2 (fewshot)     0.9542   0.9556  0.9596
qwen72b (fewshot)   0.9559   0.9437  0.9768
llama70b (cot)      0.9014   0.8497  0.8757
qwen72b (cot)       0.9331   0.9127  0.9661


## Sentence Probability and Perplexity Correlation


In our study, we retrieve LLM probability and acceptability judgments of conditional statements through direct elicitation. Judgments could also be elicited through sentence probabilities (the probability a model assigns to a given string).

Both direct elicitation and sentence probability are commonly used approaches to assess the probability that a model assigns to a given statement. While both methods come with advantages and disadvantages, there are key differences between the methods that should be considered:
* **Sentence probability**:
    * measures the probability a model assigns to a given string
    * captures string form rather than semantics
    * does not necessarily reflect how likely the model assesses the given statement to be true
    * highly sensitive to paraphrasing and tokenization and exhibits a strong length bias
* **Direct elicitation**:
    * can be used to directly ask the model for the probability that a given statement is true
    * especially for instruct/chat models, direct elicitation has been shown to be better calibrated than token-likelihoods, as it leverages the model’s internal latent knowledge and reasoning


Nonetheless, we implement an additional analysis in this notebook, where we compare the sentence probability of both the conditional probability $P(B|A)$ and the if-probability $P($*If $A$, then $B$*$)$ to our directly elicited ratings. \
Because sentence probability is highly dependent on sentence length, we additionally compute the model’s perplexity over the statements. To study how well these probability-based judgements correlate with the model’s judgements obtained via direct elicitation, we compute both Pearson and Spearman correlations for each metric.


In [None]:
def compute_correlations(csv_filename, jsonl_filename, dir_varname, metric_varname):

    dir_df = pd.read_csv(csv_filename, sep=";")
    dir_df = dir_df.groupby('sample_number')[[dir_varname]].mean().reset_index()

    collection = []
    with open (jsonl_filename) as file:
        for line in file:
            data = json.loads(line)
            unnest = data [metric_varname]
            collect = {"prob": unnest["prob"],
                    "perplexity": unnest["perplexity"]
            }
            collection.append(collect)

    new_df = pd.DataFrame(collection)


    df = pd.DataFrame({
        "dir_elicit": dir_df[dir_varname].values,
        "model_prob": new_df["prob"].values,
        "perplexity": new_df["perplexity"].values
    })

    return {
        "sent_prob_pearson": df["dir_elicit"].corr(df["model_prob"], method="pearson"),
        "sent_prob_spearman": df["dir_elicit"].corr(df["model_prob"], method="spearman"),
        "perplex_pearson": df["dir_elicit"].corr(df["perplexity"], method="pearson"),
        "perplex_spearman": df["dir_elicit"].corr(df["perplexity"], method="spearman"),
        }

In [None]:
configs = [
    ("Llama 8B cond. prob.", "dataframe_llama3_context_vanilla.csv",
     "llama3_8B_sentence_probs.jsonl", "c_prob", "sentence_conditional_probability"),

    ("Llama 8B if-prob.", "dataframe_llama3_context_vanilla.csv",
     "llama3_8B_sentence_probs.jsonl", "if_prob", "sentence_if_probability"),

    ("Llama 70B cond. prob.", "dataframe_llama70b_context_vanilla.csv",
     "llama31_70B_sentence_probs.jsonl", "c_prob", "sentence_conditional_probability"),

    ("Llama 70B if-prob.", "dataframe_llama70b_context_vanilla.csv",
     "llama31_70B_sentence_probs.jsonl", "if_prob", "sentence_if_probability"),

    ("Qwen 7B cond. prob.", "dataframe_qwen2_context_vanilla.csv",
     "qwen25_7B_sentence_probs.jsonl", "c_prob", "sentence_conditional_probability"),

    ("Qwen 7B if-prob.", "dataframe_qwen2_context_vanilla.csv",
     "qwen25_7B_sentence_probs.jsonl", "if_prob", "sentence_if_probability"),

    ("Qwen 72B cond. prob.", "dataframe_qwen72b_context_vanilla.csv",
     "qwen25_72B_sentence_probs.jsonl", "c_prob", "sentence_conditional_probability"),

    ("Qwen 72B if-prob.", "dataframe_qwen72b_context_vanilla.csv",
     "qwen25_72B_sentence_probs.jsonl", "if_prob", "sentence_if_probability"),
]

In [11]:
rows = []

for label, csv_fn, jsonl_fn, dir_var, metric_var in configs:
    corr = compute_correlations(
        csv_fn, jsonl_fn, dir_var, metric_var
    )
    corr["Model"] = label
    rows.append(corr)

df_table = pd.DataFrame(rows).set_index("Model")


In [None]:
df_table = df_table[[
    "sent_prob_pearson",
    "sent_prob_spearman",
    "perplex_pearson",
    "perplex_spearman"
]]

df_table.columns = pd.MultiIndex.from_product(
    [["Sentence Probability", "Perplexity"],
     ["Pearson", "Spearman"]]
)

print(df_table.round(2))

                      Sentence Probability          Perplexity         
                                   Pearson Spearman    Pearson Spearman
Model                                                                  
Llama 8B cond. prob.                  0.15     0.25      -0.28    -0.32
Llama 8B if-prob.                     0.19     0.34      -0.32    -0.48
Llama 70B cond. prob.                 0.16     0.26      -0.28    -0.38
Llama 70B if-prob.                    0.11     0.31      -0.23    -0.31
Qwen 7B cond. prob.                   0.08     0.22      -0.17    -0.27
Qwen 7B if-prob.                      0.07     0.20      -0.08    -0.27
Qwen 72B cond. prob.                  0.16     0.33      -0.21    -0.36
Qwen 72B if-prob.                     0.14     0.19      -0.19    -0.31
