# Data Wrangling
Starting point:
- 848 datasets (18 tasks per model (minus the NON_IDEAL_OUTPUTS)) with all logprobs for all answer alternatives of each subtask for all ~1.500 participants. 

What does this script do
- Read all survey data sets
- normalise log-prob scores so that they are probabilities that sum up to 1 over all answer options
- filter out the probability that the LLm assigned to the answer option which the participant actually chose (for each item)
- flip back answers where scale was flipped to cope with prompt sensitivity issues
- weigh probabiliies with human answers and divide by all probabilities for that item, to have one number (in the realm if not exact the number of the answer alternatives) per item per model
- second weighing strategy, where score of model per item is only weighed with top n most likely probabilities that model assigned
- for some scales: add subcategories (each item belongs to a subcategory)
- for some tasks, but unsure, reverse back some scores that are asked on a reverse scale due to the nature of the task
- concat the itemwise scores for each task per model all together in one `all_data` final data frame and save it.

Specials:
- for every scale I compared the frey materials quest_raw.csv and quest_proc.csv and tried to trace back how they got from one to the other. Then I did the same transformations
- therefore, for AUDIT and FTND and some more the scores are mapped onto a different scale (point system depending on given answer) on the `human_number` level (before mapped to LLM assigned probabilities)
- ! for other scales the scores might have to be transfered into another point system as well!
- ! SSSV must be reversed in some items, I think, but resluts are awful when done!?!?!?!

Goal:
- first have one value per item per model
- then transform those values in "outcomes" for each subscale (like Frey did)
- Have 36 values per model! (one per (sub-) scale).

## Packages & Helpers

In [1]:
# packages
import pandas as pd
import numpy as np
import glob
import os
import matplotlib.pyplot as plt
from utils import load_dataframes, filter_pred_prob

In [2]:
# ----------------------------------------------------------------------------------------------
# Pro Modell × Item die Zähler und Nenner berechnen
# ----------------------------------------------------------------------------------------------
# - Zähler = Summe von (Antwort * Wahrscheinlichkeit)
# - Nenner = Summe von (Wahrscheinlichkeiten)

def compute_weighted_score(group):
    numerator = (group["human_number"] * group["prob_pred"]).sum()
    denominator = group["prob_pred"].sum()
    return numerator / denominator if denominator > 0 else None


# Funktion für Top-n gewichteten Score
def compute_top_n_weighted_score(group, n = 100):
    # Sortiere die Zeilen nach Wahrscheinlichkeit absteigend
    top_n = group.sort_values("prob_pred", ascending=False).head(n)
    # Numerator = Summe von (Antwort * Wahrscheinlichkeit)
    numerator = (top_n["human_number"] * top_n["prob_pred"]).sum()
    # Denominator = Summe von Wahrscheinlichkeiten der Top n
    denominator = top_n["prob_pred"].sum()
    return numerator / denominator if denominator > 0 else None




# produce df with one value per model per item --------------------------------------------------
# more compact version (that runs faster)
def get_LLM_value_per_item(data):
    grouped = data.groupby(["experiment", "model", "item"])
    score = (grouped["human_number"].apply(lambda x: (x * data.loc[x.index, "prob_pred"]).sum())
             / grouped["prob_pred"].sum())
    return score.reset_index(name="score")

# produce df with one value per model per item for top n version --------------------------------------------------
def get_LLM_value_per_item_top_n(data):
    new_df = (
    data.groupby(["experiment", "model", "item"])[["human_number", "prob_pred"]]
      .apply(compute_top_n_weighted_score)
      .reset_index(name="score_top_n")
    )
    return(new_df)





## AUDIT SCALE

In [3]:
# load data
AUDIT_data = load_dataframes(task_name="AUDIT")

Merged DataFrame shape: (712264, 11)
Total models: 46


In [4]:
# normalise answer option sum to one (tun so als hätten wir sehr guten Prompt, dann würde LLM nur zwischen möglichen Antwortalternativen aussuchen, da simulieren wir dadurch)
mask = (AUDIT_data["item"] == 1)
AUDIT_data.loc[mask, "prob_1"] = np.exp(AUDIT_data.loc[mask, "1"])/(np.exp(AUDIT_data.loc[mask, "1"]) + np.exp(AUDIT_data.loc[mask, "2"]))
AUDIT_data.loc[mask, "prob_2"] = np.exp(AUDIT_data.loc[mask, "2"])/(np.exp(AUDIT_data.loc[mask, "1"]) + np.exp(AUDIT_data.loc[mask, "2"]))

mask = (AUDIT_data["item"].isin([10, 11]))
AUDIT_data.loc[mask, "prob_1"] = np.exp(AUDIT_data.loc[mask, "1"])/(np.exp(AUDIT_data.loc[mask, "1"]) + np.exp(AUDIT_data.loc[mask, "2"]) + np.exp(AUDIT_data.loc[mask, "3"]))
AUDIT_data.loc[mask, "prob_2"] = np.exp(AUDIT_data.loc[mask, "2"])/(np.exp(AUDIT_data.loc[mask, "1"]) + np.exp(AUDIT_data.loc[mask, "2"]) + np.exp(AUDIT_data.loc[mask, "3"]))
AUDIT_data.loc[mask, "prob_3"] = np.exp(AUDIT_data.loc[mask, "3"])/(np.exp(AUDIT_data.loc[mask, "1"]) + np.exp(AUDIT_data.loc[mask, "2"]) + np.exp(AUDIT_data.loc[mask, "3"]))


mask = (AUDIT_data["item"].isin([2, 3, 4, 5, 6, 7, 8, 9]))
AUDIT_data.loc[mask, "prob_1"] = np.exp(AUDIT_data.loc[mask, "1"])/(np.exp(AUDIT_data.loc[mask, "1"]) + np.exp(AUDIT_data.loc[mask, "2"]) + np.exp(AUDIT_data.loc[mask, "3"]) + np.exp(AUDIT_data.loc[mask, "4"]) + np.exp(AUDIT_data.loc[mask, "5"]))
AUDIT_data.loc[mask, "prob_2"] = np.exp(AUDIT_data.loc[mask, "2"])/(np.exp(AUDIT_data.loc[mask, "1"]) + np.exp(AUDIT_data.loc[mask, "2"]) + np.exp(AUDIT_data.loc[mask, "3"]) + np.exp(AUDIT_data.loc[mask, "4"]) + np.exp(AUDIT_data.loc[mask, "5"]))
AUDIT_data.loc[mask, "prob_3"] = np.exp(AUDIT_data.loc[mask, "3"])/(np.exp(AUDIT_data.loc[mask, "1"]) + np.exp(AUDIT_data.loc[mask, "2"]) + np.exp(AUDIT_data.loc[mask, "3"]) + np.exp(AUDIT_data.loc[mask, "4"]) + np.exp(AUDIT_data.loc[mask, "5"]))
AUDIT_data.loc[mask, "prob_4"] = np.exp(AUDIT_data.loc[mask, "4"])/(np.exp(AUDIT_data.loc[mask, "1"]) + np.exp(AUDIT_data.loc[mask, "2"]) + np.exp(AUDIT_data.loc[mask, "3"]) + np.exp(AUDIT_data.loc[mask, "4"]) + np.exp(AUDIT_data.loc[mask, "5"]))
AUDIT_data.loc[mask, "prob_5"] = np.exp(AUDIT_data.loc[mask, "5"])/(np.exp(AUDIT_data.loc[mask, "1"]) + np.exp(AUDIT_data.loc[mask, "2"]) + np.exp(AUDIT_data.loc[mask, "3"]) + np.exp(AUDIT_data.loc[mask, "4"]) + np.exp(AUDIT_data.loc[mask, "5"]))


In [5]:
# filter out probability LLM assigned to real item answer 
AUDIT_data=filter_pred_prob(AUDIT_data)

In [6]:
# # flip back human answers where they were flipped
# mask = (AUDIT_data["flipped"] == "yes") & (AUDIT_data["item"] == 1)
# AUDIT_data.loc[mask, "human_number"] = 3 - AUDIT_data.loc[mask, "human_number"]
# mask = (AUDIT_data["flipped"] == "yes") & (AUDIT_data["item"].isin([10, 11]))
# AUDIT_data.loc[mask, "human_number"] = 4 - AUDIT_data.loc[mask, "human_number"]
# mask = (AUDIT_data["flipped"] == "yes") & (AUDIT_data["item"].isin([2, 3, 4, 5, 6, 7, 8, 9]))
# AUDIT_data.loc[mask, "human_number"] = 6 - AUDIT_data.loc[mask, "human_number"]


In [7]:
# # Define mappings for each AUDIT question:
# audit_maps = {
#     1: {1: 1, 2: 0},                            # AlcSplit
#     2: {1: 0, 2: 1, 3: 2, 4: 3, 5: 4},          # AUDIT_1 (in proc data Frey)
#     3: {1: 0, 2: 1, 3: 2, 4: 3, 5: 4},          # AUDIT_2 (in proc data Frey)
#     4: {1: 0, 2: 1, 3: 2, 4: 3, 5: 4},          # AUDIT_3 (in proc data Frey)
#     5: {1: 0, 2: 1, 3: 2, 4: 3, 5: 4},          # AUDIT_4 (in proc data Frey)
#     6: {1: 0, 2: 1, 3: 2, 4: 3, 5: 4},          # AUDIT_5 (in proc data Frey)
#     7: {1: 0, 2: 1, 3: 2, 4: 3, 5: 4},          # AUDIT_6 (in proc data Frey)
#     8: {1: 0, 2: 1, 3: 2, 4: 3, 5: 4},          # AUDIT_7 (in proc data Frey)
#     9: {1: 0, 2: 1, 3: 2, 4: 3, 5: 4},          # AUDIT_8 (in proc data Frey)
#     10: {1: 0, 2: 2, 3: 4},                     # AUDIT_10 (in proc data Frey)
#     11: {1: 0, 2: 2, 3: 4},                     # AUDIT_11 (in proc data Frey)

# }

# # Apply mapping row-wise based on item number
# def recode_audit(row):
#     mapping = audit_maps.get(row["item"])
#     if mapping is not None:
#         return mapping.get(row["human_number"], None)  # None if invalid code
#     return row["human_number"]  

# AUDIT_data["human_number"] = AUDIT_data.apply(recode_audit, axis=1)


In [8]:
# produce df with one value per model per item 
model_item_scores_AUDIT = get_LLM_value_per_item(AUDIT_data)
model_item_scores_AUDIT_top_n = get_LLM_value_per_item_top_n(AUDIT_data)

# Merge them on the grouping keys
model_item_scores_AUDIT = model_item_scores_AUDIT.merge(
    model_item_scores_AUDIT_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)

## BARRAT SCALE

In [9]:
# load data
BARRAT_data = load_dataframes(task_name="BARRAT")

Merged DataFrame shape: (2082420, 10)
Total models: 46


In [10]:
# normalise answer option sum to one
BARRAT_data["prob_1"] = np.exp(BARRAT_data["1"])/(np.exp(BARRAT_data["1"]) + np.exp(BARRAT_data["2"]) + np.exp(BARRAT_data["3"]) + np.exp(BARRAT_data["4"]))
BARRAT_data["prob_2"] = np.exp(BARRAT_data["2"])/(np.exp(BARRAT_data["1"]) + np.exp(BARRAT_data["2"]) + np.exp(BARRAT_data["3"]) + np.exp(BARRAT_data["4"]))
BARRAT_data["prob_3"] = np.exp(BARRAT_data["3"])/(np.exp(BARRAT_data["1"]) + np.exp(BARRAT_data["2"]) + np.exp(BARRAT_data["3"]) + np.exp(BARRAT_data["4"]))
BARRAT_data["prob_4"] = np.exp(BARRAT_data["4"])/(np.exp(BARRAT_data["1"]) + np.exp(BARRAT_data["2"]) + np.exp(BARRAT_data["3"]) + np.exp(BARRAT_data["4"]))


In [11]:
# filter out probability LLM assigned to real item answer 
BARRAT_data=filter_pred_prob(BARRAT_data)

In [12]:
# # flip back human answers where they were flipped
# mask = (BARRAT_data["flipped"] == True)
# BARRAT_data.loc[mask, "human_number"] = 5 - BARRAT_data.loc[mask, "human_number"]


In [13]:
# # reverse human answers (again) where the items where reversed phrased

# # add whether item was reverse coded
# reverse_coded = {
#     1: True, 2: False, 3: False,  4: False, 5: False,  6: False,  7: True,  8: True,  9: True,  10: True,
#     11: False, 12: True, 13: True,  14: False, 15: True,  16: False,  17: False,  18: False,  19: False,  20: True,
#     21: False, 22: False, 23: False,  24: False, 25: False,  26: False,  27: False,  28: False,  29: True,  30: True
#     }

# # Apply mapping row-wise based on item number
# BARRAT_data["reverse_coded"] = BARRAT_data["item"].map(reverse_coded)


# # flip back answers that where reverse coded
# mask = (BARRAT_data["reverse_coded"] == True)
# BARRAT_data.loc[mask, "human_number"] = 5 - BARRAT_data.loc[mask, "human_number"]
# # drop reverse-coded column (not needed in final data)
# BARRAT_data = BARRAT_data.drop(columns=["reverse_coded"])


In [14]:
# produce df with one value per model per item 
model_item_scores_BARRAT = get_LLM_value_per_item(BARRAT_data)
model_item_scores_BARRAT_top_n = get_LLM_value_per_item_top_n(BARRAT_data)

# Merge them on the grouping keys
model_item_scores_BARRAT = model_item_scores_BARRAT.merge(
    model_item_scores_BARRAT_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)

In [15]:
# Adding task specific categories to save in all data

# add item categories
item_to_category = {
    1: "BISn", 2: "BISm", 3: "BISm",  4: "BISm", 5: "BISa",  6: "BISa",  7: "BISn",  8: "BISn",  9: "BISa",  10: "BISn",
    11: "BISa", 12: "BISn", 13: "BISn",  14: "BISn", 15: "BISn",  16: "BISm",  17: "BISm",  18: "BISn",  19: "BISm",  20: "BISa",
    21: "BISm", 22: "BISm", 23: "BISm",  24: "BISa", 25: "BISm",  26: "BISa",  27: "BISn",  28: "BISa",  29: "BISn",  30: "BISm"
}

model_item_scores_BARRAT["category"] = model_item_scores_BARRAT["item"].map(item_to_category)

In [16]:
# merge dfs
all_data = pd.concat([model_item_scores_AUDIT, model_item_scores_BARRAT], ignore_index=True)


## CARE TASK

In [17]:
# load data
CARE_data = load_dataframes(task_name="CARE")

Merged DataFrame shape: (1320614, 106)
Total models: 46


In [18]:
# get probabilities out of log-probabilities

cols = [str(i) for i in range(0, 100)]
# Compute normalized probabilities
exp_vals = np.exp(CARE_data[cols])
prob_vals = exp_vals.div(exp_vals.sum(axis=1), axis=0)

# Rename columns all at once
prob_vals.columns = [f"prob_{i}" for i in range(0, 100)]

# Join to original dataframe in one step
CARE_data = pd.concat([CARE_data, prob_vals], axis=1).copy()

In [19]:
# filter out probability LLM assigned to real item answer 
CARE_data=filter_pred_prob(CARE_data)

In [20]:
# produce df with one value per model per item 
model_item_scores_CARE = get_LLM_value_per_item(CARE_data)
model_item_scores_CARE_top_n = get_LLM_value_per_item_top_n(CARE_data)

# Merge them on the grouping keys
model_item_scores_CARE = model_item_scores_CARE.merge(
    model_item_scores_CARE_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)

In [21]:
# Adding task specific categories to save in all data
# add item categories
item_to_category = {
    1: "CAREa", 2: "CAREa", 3: "CAREa",  4: "CAREa", 5: "CAREa",  6: "CAREa",  7: "CAREa",  8: "CAREa",  9: "CAREa",  10: "CAREs",
    11: "CAREs", 12: "CAREs", 13: "CAREs",  14: "CAREs", 15: "CAREs",  16: "CAREw",  17: "CAREw",  18: "CAREw",  19: "CAREw"
}

model_item_scores_CARE["category"] = model_item_scores_CARE["item"].map(item_to_category)


In [22]:
all_data = pd.concat([all_data, model_item_scores_CARE], ignore_index=True)


## DAST SCALE

In [23]:
# load data
DAST_data = load_dataframes(task_name="DAST")

Merged DataFrame shape: (1391040, 8)
Total models: 46


In [24]:
# normalise answer option sum to one 
DAST_data["prob_1"] = np.exp(DAST_data["1"])/(np.exp(DAST_data["1"]) + np.exp(DAST_data["2"]))
DAST_data["prob_2"] = np.exp(DAST_data["2"])/(np.exp(DAST_data["1"]) + np.exp(DAST_data["2"]))

In [25]:
# filter out probability LLM assigned to real item answer 
DAST_data=filter_pred_prob(DAST_data)

In [26]:
# # flip back human answers where they were flipped
# mask = (DAST_data["flipped"] == True) 
# DAST_data.loc[mask, "human_number"] = 3 - DAST_data.loc[mask, "human_number"]

In [27]:
# # Define mappings for each DAST question:
# dast_maps = {
#     1: {1: 1, 2: 0},                           
#     2: {1: 1, 2: 0},
#     3: {1: 1, 2: 0},
#     4: {1: 0, 2: 1},
#     5: {1: 0, 2: 1},
#     6: {1: 1, 2: 0},
#     7: {1: 1, 2: 0},
#     8: {1: 1, 2: 0},
#     9: {1: 1, 2: 0},
#     10: {1: 1, 2: 0},
#     11: {1: 1, 2: 0},
#     12: {1: 1, 2: 0},
#     13: {1: 1, 2: 0},
#     14: {1: 1, 2: 0},
#     15: {1: 1, 2: 0},
#     16: {1: 1, 2: 0},
#     17: {1: 1, 2: 0},
#     18: {1: 1, 2: 0},
#     19: {1: 1, 2: 0},
#     20: {1: 1, 2: 0}
# }

# # Apply mapping row-wise based on item number
# def recode_dast(row):
#     mapping = dast_maps.get(row["item"])
#     if mapping is not None:
#         return mapping.get(row["human_number"], None)  # None if invalid code
#     return row["human_number"]  

# DAST_data["human_number"] = DAST_data.apply(recode_dast, axis=1)


In [28]:
# produce df with one value per model per item 
model_item_scores_DAST = get_LLM_value_per_item(DAST_data)
model_item_scores_DAST_top_n = get_LLM_value_per_item_top_n(DAST_data)

# Merge them on the grouping keys
model_item_scores_DAST = model_item_scores_DAST.merge(
    model_item_scores_DAST_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)

In [29]:
# merge dfs
all_data = pd.concat([all_data, model_item_scores_DAST], ignore_index=True)


## DM SCALE

In [30]:
# load data
DM_data = load_dataframes(task_name="DM")

Merged DataFrame shape: (1318866, 10)
Total models: 46


In [31]:
# normalise answer option sum to one
DM_data["prob_1"] = np.exp(DM_data["1"])/(np.exp(DM_data["1"]) + np.exp(DM_data["2"]) + np.exp(DM_data["3"]) + np.exp(DM_data["4"]))
DM_data["prob_2"] = np.exp(DM_data["2"])/(np.exp(DM_data["1"]) + np.exp(DM_data["2"]) + np.exp(DM_data["3"]) + np.exp(DM_data["4"]))
DM_data["prob_3"] = np.exp(DM_data["3"])/(np.exp(DM_data["1"]) + np.exp(DM_data["2"]) + np.exp(DM_data["3"]) + np.exp(DM_data["4"]))
DM_data["prob_4"] = np.exp(DM_data["4"])/(np.exp(DM_data["1"]) + np.exp(DM_data["2"]) + np.exp(DM_data["3"]) + np.exp(DM_data["4"]))


In [32]:
# filter out probability LLM assigned to real item answer 
DM_data=filter_pred_prob(DM_data)

In [33]:
# # flip back human answers where they were flipped
# mask = (DM_data["flipped"] == True) 
# DM_data.loc[mask, "human_number"] = 5 - DM_data.loc[mask, "human_number"]

In [34]:
# # Define mappings for DM, so that all 4s are transformed to 1s (like in original Fey dataset):
# # hier Abweichung von sonst Orientierung an Frey quest_proc df, aber da sonst später Umwandlung, hier gleich zu Scale 0-2
# mapping = {
#     4: 1,
#     3: 2,
#     2: 1,
#     1: 0
# }
# DM_data["human_number"] = DM_data["human_number"].map(mapping)


In [35]:
# produce df with one value per model per item 
model_item_scores_DM = get_LLM_value_per_item(DM_data)
model_item_scores_DM_top_n = get_LLM_value_per_item_top_n(DM_data)

# Merge them on the grouping keys
model_item_scores_DM = model_item_scores_DM.merge(
    model_item_scores_DM_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)

In [36]:
# merge dfs
all_data = pd.concat([all_data, model_item_scores_DM], ignore_index=True)


## DOSPERT SCALE

In [37]:
# load data
DOSPERT_data = load_dataframes(task_name="DOSPERT")

Merged DataFrame shape: (2780240, 11)
Total models: 46


In [38]:
# normalise answer option sum to one
DOSPERT_data["prob_1"] = np.exp(DOSPERT_data["1"])/(np.exp(DOSPERT_data["1"]) + np.exp(DOSPERT_data["2"]) + np.exp(DOSPERT_data["3"]) + np.exp(DOSPERT_data["4"]) + np.exp(DOSPERT_data["5"]))
DOSPERT_data["prob_2"] = np.exp(DOSPERT_data["2"])/(np.exp(DOSPERT_data["1"]) + np.exp(DOSPERT_data["2"]) + np.exp(DOSPERT_data["3"]) + np.exp(DOSPERT_data["4"]) + np.exp(DOSPERT_data["5"]))
DOSPERT_data["prob_3"] = np.exp(DOSPERT_data["3"])/(np.exp(DOSPERT_data["1"]) + np.exp(DOSPERT_data["2"]) + np.exp(DOSPERT_data["3"]) + np.exp(DOSPERT_data["4"]) + np.exp(DOSPERT_data["5"]))
DOSPERT_data["prob_4"] = np.exp(DOSPERT_data["4"])/(np.exp(DOSPERT_data["1"]) + np.exp(DOSPERT_data["2"]) + np.exp(DOSPERT_data["3"]) + np.exp(DOSPERT_data["4"]) + np.exp(DOSPERT_data["5"]))
DOSPERT_data["prob_5"] = np.exp(DOSPERT_data["5"])/(np.exp(DOSPERT_data["1"]) + np.exp(DOSPERT_data["2"]) + np.exp(DOSPERT_data["3"]) + np.exp(DOSPERT_data["4"]) + np.exp(DOSPERT_data["5"]))


In [39]:
# filter out probability LLM assigned to real item answer 
DOSPERT_data=filter_pred_prob(DOSPERT_data)

In [40]:
# # flip back human answers where they were flipped
# mask = (DOSPERT_data["flipped"] == 'yes') 
# DOSPERT_data.loc[mask, "human_number"] = 6 - DOSPERT_data.loc[mask, "human_number"]

In [41]:
# produce df with one value per model per item 
model_item_scores_DOSPERT = get_LLM_value_per_item(DOSPERT_data)
model_item_scores_DOSPERT_top_n = get_LLM_value_per_item_top_n(DOSPERT_data)

# Merge them on the grouping keys
model_item_scores_DOSPERT = model_item_scores_DOSPERT.merge(
    model_item_scores_DOSPERT_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)


In [42]:
# Adding task specific categories to save in all data

# add item categories
item_to_category = {
    1: "Social", 10: "Social", 16: "Social", 19: "Social", 23: "Social", 26: "Social", 34: "Social", 35: "Social",
    2: "Recreational", 6: "Recreational", 15: "Recreational", 17: "Recreational", 21: "Recreational", 31: "Recreational", 37: "Recreational", 38: "Recreational",
    3: "Gambling", 11: "Gambling", 22: "Gambling", 33: "Gambling",
    4: "Health", 8: "Health", 27: "Health", 29: "Health", 32: "Health", 36: "Health", 39: "Health", 40: "Health",
    5: "Ethical", 9: "Ethical", 12: "Ethical", 13: "Ethical", 14: "Ethical", 20: "Ethical", 25: "Ethical", 28: "Ethical",
    7: "Investment", 18: "Investment", 24: "Investment", 30: "Investment"
}

model_item_scores_DOSPERT["category"] = model_item_scores_DOSPERT["item"].map(item_to_category)


In [43]:
# merge dfs
all_data = pd.concat([all_data, model_item_scores_DOSPERT], ignore_index=True)


## FTND SCALE

In [44]:
# load data
FTND_data = load_dataframes(task_name="FTND")


Merged DataFrame shape: (163162, 10)
Total models: 46


In [45]:
# normalise answer option sum to one (tun so als hätten wir sehr guten Prompt, dann würde LLM nur zwischen möglichen Antwortalternativen aussuchen, da simulieren wir dadurch)
mask = (FTND_data["item"] == 1)
FTND_data.loc[mask, "prob_1"] = np.exp(FTND_data.loc[mask, "1"])/(np.exp(FTND_data.loc[mask, "1"]) + np.exp(FTND_data.loc[mask, "2"]) + np.exp(FTND_data.loc[mask, "3"]))
FTND_data.loc[mask, "prob_2"] = np.exp(FTND_data.loc[mask, "2"])/(np.exp(FTND_data.loc[mask, "1"]) + np.exp(FTND_data.loc[mask, "2"]) + np.exp(FTND_data.loc[mask, "3"]))
FTND_data.loc[mask, "prob_3"] = np.exp(FTND_data.loc[mask, "3"])/(np.exp(FTND_data.loc[mask, "1"]) + np.exp(FTND_data.loc[mask, "2"]) + np.exp(FTND_data.loc[mask, "3"]))

mask = (FTND_data["item"].isin([3, 4, 6, 7]))
FTND_data.loc[mask, "prob_1"] = np.exp(FTND_data.loc[mask, "1"])/(np.exp(FTND_data.loc[mask, "1"]) + np.exp(FTND_data.loc[mask, "2"]))
FTND_data.loc[mask, "prob_2"] = np.exp(FTND_data.loc[mask, "2"])/(np.exp(FTND_data.loc[mask, "1"]) + np.exp(FTND_data.loc[mask, "2"]))

mask = (FTND_data["item"].isin([2, 5]))
FTND_data.loc[mask, "prob_1"] = np.exp(FTND_data.loc[mask, "1"])/(np.exp(FTND_data.loc[mask, "1"]) + np.exp(FTND_data.loc[mask, "2"]) + np.exp(FTND_data.loc[mask, "3"]) + np.exp(FTND_data.loc[mask, "4"]))
FTND_data.loc[mask, "prob_2"] = np.exp(FTND_data.loc[mask, "2"])/(np.exp(FTND_data.loc[mask, "1"]) + np.exp(FTND_data.loc[mask, "2"]) + np.exp(FTND_data.loc[mask, "3"]) + np.exp(FTND_data.loc[mask, "4"]))
FTND_data.loc[mask, "prob_3"] = np.exp(FTND_data.loc[mask, "3"])/(np.exp(FTND_data.loc[mask, "1"]) + np.exp(FTND_data.loc[mask, "2"]) + np.exp(FTND_data.loc[mask, "3"]) + np.exp(FTND_data.loc[mask, "4"]))
FTND_data.loc[mask, "prob_4"] = np.exp(FTND_data.loc[mask, "4"])/(np.exp(FTND_data.loc[mask, "1"]) + np.exp(FTND_data.loc[mask, "2"]) + np.exp(FTND_data.loc[mask, "3"]) + np.exp(FTND_data.loc[mask, "4"]))


In [46]:
# filter out probability LLM assigned to real item answer 
FTND_data=filter_pred_prob(FTND_data)

In [47]:
# # flip back human answers where they were flipped
# mask = (FTND_data["flipped"] == True) & (FTND_data["item"] == 1)
# FTND_data.loc[mask, "human_number"] = 4 - FTND_data.loc[mask, "human_number"]
# mask = (FTND_data["flipped"] == True) & (FTND_data["item"].isin([3, 4, 6, 7]))
# FTND_data.loc[mask, "human_number"] = 3 - FTND_data.loc[mask, "human_number"]
# mask = (FTND_data["flipped"] == True) & (FTND_data["item"].isin([2, 5]))
# FTND_data.loc[mask, "human_number"] = 5 - FTND_data.loc[mask, "human_number"]


In [48]:
# # Define mappings for each FTND question:
# ftnd_maps = {
#     1: {1: 2, 2: 0, 3: 1},             # FTND_1Eingangsfrage: smoke?
#     2: {1: 3, 2: 2, 3: 1, 4: 0},       # FTND_1: time until first cigarette
#     3: {1: 1, 2: 0},                   # FTND_2: difficult to refrain
#     4: {1: 1, 2: 0},                   # FTND_3: which cigarette hardest to give up
#     5: {1: 0, 2: 1, 3: 2, 4: 3},       # FTND_4: cigarettes per day
#     6: {1: 1, 2: 0},                   # FTND_5: smoke more frequently in morning
#     7: {1: 1, 2: 0}                    # FTND_6: smoke when ill
# }

# # Apply mapping row-wise based on item number
# def recode_ftnd(row):
#     mapping = ftnd_maps.get(row["item"])
#     if mapping is not None:
#         return mapping.get(row["human_number"], None) 
#     return row["human_number"]  

# FTND_data["human_number"] = FTND_data.apply(recode_ftnd, axis=1)


In [49]:
# produce df with one value per model per item 
model_item_scores_FTND = get_LLM_value_per_item(FTND_data)
model_item_scores_FTND_top_n = get_LLM_value_per_item_top_n(FTND_data)

# Merge them on the grouping keys
model_item_scores_FTND = model_item_scores_FTND.merge(
    model_item_scores_FTND_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)

In [50]:
# merge dfs
all_data = pd.concat([all_data, model_item_scores_FTND], ignore_index=True)


## GABS SCALE

In [51]:
# load data
GABS_data = load_dataframes(task_name="GABS")

Merged DataFrame shape: (581210, 10)
Total models: 46


In [52]:
# normalise answer option sum to one

# columns representing log-probabilities
answer_cols = ["1", "2", "3", "4"]

# make a copy to avoid SettingWithCopy warnings
GABS_data = GABS_data.copy()

# case 1: item == 1 → only options 1 and 2
mask_item1 = GABS_data["item"] == 1
exp_vals_item1 = np.exp(GABS_data.loc[mask_item1, ["1", "2"]])
probs_item1 = exp_vals_item1.div(exp_vals_item1.sum(axis=1), axis=0)
probs_item1.columns = ["prob_1", "prob_2"]

# case 2: items 2–17 → options 1–4
mask_item2plus = GABS_data["item"].between(2, 17)
exp_vals_item2plus = np.exp(GABS_data.loc[mask_item2plus, answer_cols])
probs_item2plus = exp_vals_item2plus.div(exp_vals_item2plus.sum(axis=1), axis=0)
probs_item2plus.columns = [f"prob_{c}" for c in answer_cols]

# merge both parts back into original df
GABS_data = GABS_data.join(pd.concat([probs_item1, probs_item2plus]))



In [53]:
# filter out probability LLM assigned to real item answer 
GABS_data=filter_pred_prob(GABS_data)

In [54]:
# # flip back human answers where they were flipped
# mask = (GABS_data["flipped"] == True) & (GABS_data["item"] == 1)
# GABS_data.loc[mask, "human_number"] = 3 - GABS_data.loc[mask, "human_number"]
# mask = (GABS_data["flipped"] == True) & (GABS_data["item"].isin(range(2,17)))
# GABS_data.loc[mask, "human_number"] = 5 - GABS_data.loc[mask, "human_number"]


In [55]:
# produce df with one value per model per item 
model_item_scores_GABS = get_LLM_value_per_item(GABS_data)
model_item_scores_GABS_top_n = get_LLM_value_per_item_top_n(GABS_data)

# Merge them on the grouping keys
model_item_scores_GABS = model_item_scores_GABS.merge(
    model_item_scores_GABS_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)

In [56]:
# merge dfs
all_data = pd.concat([all_data, model_item_scores_GABS], ignore_index=True)


## PG SCALE

In [57]:
# load data
PG_data = load_dataframes(task_name="PG")

Merged DataFrame shape: (1127322, 13)
Total models: 46


In [58]:
# normalise answer option sum to one 

mask = (PG_data["item"].isin([1, 26]))
PG_data.loc[mask, "prob_1"] = np.exp(PG_data.loc[mask, "1"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]))
PG_data.loc[mask, "prob_2"] = np.exp(PG_data.loc[mask, "2"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]))



mask = (PG_data["item"].isin(range(2, 21)))
PG_data.loc[mask, "prob_1"] = np.exp(PG_data.loc[mask, "1"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]) + np.exp(PG_data.loc[mask, "3"]) + np.exp(PG_data.loc[mask, "4"]) + np.exp(PG_data.loc[mask, "5"]))
PG_data.loc[mask, "prob_2"] = np.exp(PG_data.loc[mask, "2"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]) + np.exp(PG_data.loc[mask, "3"]) + np.exp(PG_data.loc[mask, "4"]) + np.exp(PG_data.loc[mask, "5"]))
PG_data.loc[mask, "prob_3"] = np.exp(PG_data.loc[mask, "3"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]) + np.exp(PG_data.loc[mask, "3"]) + np.exp(PG_data.loc[mask, "4"]) + np.exp(PG_data.loc[mask, "5"]))
PG_data.loc[mask, "prob_4"] = np.exp(PG_data.loc[mask, "4"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]) + np.exp(PG_data.loc[mask, "3"]) + np.exp(PG_data.loc[mask, "4"]) + np.exp(PG_data.loc[mask, "5"]))
PG_data.loc[mask, "prob_5"] = np.exp(PG_data.loc[mask, "5"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]) + np.exp(PG_data.loc[mask, "3"]) + np.exp(PG_data.loc[mask, "4"]) + np.exp(PG_data.loc[mask, "5"]))



mask = (PG_data["item"] == 25)
PG_data.loc[mask, "prob_1"] = np.exp(PG_data.loc[mask, "1"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]) + np.exp(PG_data.loc[mask, "3"]) + np.exp(PG_data.loc[mask, "4"]) + np.exp(PG_data.loc[mask, "5"]) + np.exp(PG_data.loc[mask, "6"]))
PG_data.loc[mask, "prob_2"] = np.exp(PG_data.loc[mask, "2"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]) + np.exp(PG_data.loc[mask, "3"]) + np.exp(PG_data.loc[mask, "4"]) + np.exp(PG_data.loc[mask, "5"]) + np.exp(PG_data.loc[mask, "6"]))
PG_data.loc[mask, "prob_3"] = np.exp(PG_data.loc[mask, "3"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]) + np.exp(PG_data.loc[mask, "3"]) + np.exp(PG_data.loc[mask, "4"]) + np.exp(PG_data.loc[mask, "5"]) + np.exp(PG_data.loc[mask, "6"]))
PG_data.loc[mask, "prob_4"] = np.exp(PG_data.loc[mask, "4"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]) + np.exp(PG_data.loc[mask, "3"]) + np.exp(PG_data.loc[mask, "4"]) + np.exp(PG_data.loc[mask, "5"]) + np.exp(PG_data.loc[mask, "6"]))
PG_data.loc[mask, "prob_5"] = np.exp(PG_data.loc[mask, "5"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]) + np.exp(PG_data.loc[mask, "3"]) + np.exp(PG_data.loc[mask, "4"]) + np.exp(PG_data.loc[mask, "5"]) + np.exp(PG_data.loc[mask, "6"]))
PG_data.loc[mask, "prob_6"] = np.exp(PG_data.loc[mask, "6"])/(np.exp(PG_data.loc[mask, "1"]) + np.exp(PG_data.loc[mask, "2"]) + np.exp(PG_data.loc[mask, "3"]) + np.exp(PG_data.loc[mask, "4"]) + np.exp(PG_data.loc[mask, "5"]) + np.exp(PG_data.loc[mask, "6"]))


mask = (PG_data["item"].isin([21, 22, 23, 24, 27, 28, 29, 30, 31, 32]))
PG_data.loc[mask, "prob_0"] = np.exp(PG_data.loc[mask, "0"])/(np.exp(PG_data.loc[mask, "0"]) + np.exp(PG_data.loc[mask, "1"]))
PG_data.loc[mask, "prob_1"] = np.exp(PG_data.loc[mask, "1"])/(np.exp(PG_data.loc[mask, "0"]) + np.exp(PG_data.loc[mask, "1"]))


In [59]:
# filter out probability LLM assigned to real item answer 
PG_data=filter_pred_prob(PG_data)

In [60]:
# # flip back human answers where they were flipped
# mask = (PG_data["flipped"] == True) & (PG_data["item"].isin([1, 26]))
# PG_data.loc[mask, "human_number"] = 3 - PG_data.loc[mask, "human_number"]

# mask = (PG_data["flipped"] == True) & (PG_data["item"].isin(range(2, 21)))
# PG_data.loc[mask, "human_number"] = 6 - PG_data.loc[mask, "human_number"]

# mask = (PG_data["flipped"] == True) & (PG_data["item"] == 25)
# PG_data.loc[mask, "human_number"] = 7 - PG_data.loc[mask, "human_number"]

# mask = (PG_data["flipped"] == True) & (PG_data["item"].isin([21, 22, 23, 24, 27, 28, 29, 30, 31, 32]))
# PG_data.loc[mask, "human_number"] = 1 - PG_data.loc[mask, "human_number"]



In [61]:
# # Define mappings for each GABS question:
# pg_maps = {     
#     1: {1: 1, 2: 0},                      
#     2: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     3: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     4: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     5: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     6: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     7: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     8: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     9: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     10: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     11: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     12: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     13: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     14: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     15: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     16: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     17: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     18: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     19: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     20: {1: 4, 2: 3, 3: 2, 4: 1, 5: 0},
#     26: {1: 1, 2: 0}
# }

# # Apply mapping row-wise based on item number
# def recode_pg(row):
#     mapping = pg_maps.get(row["item"])
#     if mapping is not None:
#         return mapping.get(row["human_number"], None)  # None if invalid code
#     return row["human_number"]  

# PG_data["human_number"] = PG_data.apply(recode_pg, axis=1)

# # jetzt ist es konsistent mit Freys df quest_proc (außer an den Items, wo es in der gleichen Skale bei ESS_GABS_ausserh_01-10 plötzlich ?vergessen? wurde bei Frey)
# # aber meiner Meinung nach müsste man, damit man die Skala in binned factors umwandeln kann, noch alles in die gleiche Richtung bringen,
# # 1 und 26 sind in falscher Richtung! -> habe ich jetzt gefixt obwohl Abweichung

In [62]:
# produce df with one value per model per item 
model_item_scores_PG = get_LLM_value_per_item(PG_data)
model_item_scores_PG_top_n = get_LLM_value_per_item_top_n(PG_data)

# Merge them on the grouping keys
model_item_scores_PG = model_item_scores_PG.merge(
    model_item_scores_PG_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)

In [63]:
# merge dfs
all_data = pd.concat([all_data, model_item_scores_PG], ignore_index=True)


## PRI SCALE

In [64]:
# load data
PRI_data = load_dataframes(task_name="PRI")

Merged DataFrame shape: (1110624, 13)
Total models: 46


In [65]:
# normalise answer option sum to one 

mask = (PRI_data["item"].isin([1, 3, 5, 7, 9, 11, 13, 15]))
PRI_data.loc[mask, "prob_1"] = np.exp(PRI_data.loc[mask, "1"])/(np.exp(PRI_data.loc[mask, "1"]) + np.exp(PRI_data.loc[mask, "2"]))
PRI_data.loc[mask, "prob_2"] = np.exp(PRI_data.loc[mask, "2"])/(np.exp(PRI_data.loc[mask, "1"]) + np.exp(PRI_data.loc[mask, "2"]))



mask = (PRI_data["item"].isin([2, 4, 6, 8, 10, 12, 14, 16]))
PRI_data.loc[mask, "prob_1"] = np.exp(PRI_data.loc[mask, "1"])/(np.exp(PRI_data.loc[mask, "1"]) + np.exp(PRI_data.loc[mask, "2"]) + np.exp(PRI_data.loc[mask, "3"]) + np.exp(PRI_data.loc[mask, "4"]) + np.exp(PRI_data.loc[mask, "5"]) + np.exp(PRI_data.loc[mask, "6"]) + np.exp(PRI_data.loc[mask, "7"]))
PRI_data.loc[mask, "prob_2"] = np.exp(PRI_data.loc[mask, "2"])/(np.exp(PRI_data.loc[mask, "1"]) + np.exp(PRI_data.loc[mask, "2"]) + np.exp(PRI_data.loc[mask, "3"]) + np.exp(PRI_data.loc[mask, "4"]) + np.exp(PRI_data.loc[mask, "5"]) + np.exp(PRI_data.loc[mask, "6"]) + np.exp(PRI_data.loc[mask, "7"]))
PRI_data.loc[mask, "prob_3"] = np.exp(PRI_data.loc[mask, "3"])/(np.exp(PRI_data.loc[mask, "1"]) + np.exp(PRI_data.loc[mask, "2"]) + np.exp(PRI_data.loc[mask, "3"]) + np.exp(PRI_data.loc[mask, "4"]) + np.exp(PRI_data.loc[mask, "5"]) + np.exp(PRI_data.loc[mask, "6"]) + np.exp(PRI_data.loc[mask, "7"]))
PRI_data.loc[mask, "prob_4"] = np.exp(PRI_data.loc[mask, "4"])/(np.exp(PRI_data.loc[mask, "1"]) + np.exp(PRI_data.loc[mask, "2"]) + np.exp(PRI_data.loc[mask, "3"]) + np.exp(PRI_data.loc[mask, "4"]) + np.exp(PRI_data.loc[mask, "5"]) + np.exp(PRI_data.loc[mask, "6"]) + np.exp(PRI_data.loc[mask, "7"]))
PRI_data.loc[mask, "prob_5"] = np.exp(PRI_data.loc[mask, "5"])/(np.exp(PRI_data.loc[mask, "1"]) + np.exp(PRI_data.loc[mask, "2"]) + np.exp(PRI_data.loc[mask, "3"]) + np.exp(PRI_data.loc[mask, "4"]) + np.exp(PRI_data.loc[mask, "5"]) + np.exp(PRI_data.loc[mask, "6"]) + np.exp(PRI_data.loc[mask, "7"]))
PRI_data.loc[mask, "prob_6"] = np.exp(PRI_data.loc[mask, "6"])/(np.exp(PRI_data.loc[mask, "1"]) + np.exp(PRI_data.loc[mask, "2"]) + np.exp(PRI_data.loc[mask, "3"]) + np.exp(PRI_data.loc[mask, "4"]) + np.exp(PRI_data.loc[mask, "5"]) + np.exp(PRI_data.loc[mask, "6"]) + np.exp(PRI_data.loc[mask, "7"]))
PRI_data.loc[mask, "prob_7"] = np.exp(PRI_data.loc[mask, "7"])/(np.exp(PRI_data.loc[mask, "1"]) + np.exp(PRI_data.loc[mask, "2"]) + np.exp(PRI_data.loc[mask, "3"]) + np.exp(PRI_data.loc[mask, "4"]) + np.exp(PRI_data.loc[mask, "5"]) + np.exp(PRI_data.loc[mask, "6"]) + np.exp(PRI_data.loc[mask, "7"]))

In [66]:
# filter out probability LLM assigned to real item answer 
PRI_data=filter_pred_prob(PRI_data)

In [67]:
# # flip back human answers where they were flipped
# mask = (PRI_data["flipped"] == True) & (PRI_data["item"].isin([1, 3, 5, 7, 9, 11, 13, 15]))
# PRI_data.loc[mask, "human_number"] = 3 - PRI_data.loc[mask, "human_number"]

# mask = (PRI_data["flipped"] == True) & (PRI_data["item"].isin([2, 4, 6, 8, 10, 12, 14, 16]))
# PRI_data.loc[mask, "human_number"] = 8 - PRI_data.loc[mask, "human_number"]


In [68]:
# produce df with one value per model per item 
model_item_scores_PRI = get_LLM_value_per_item(PRI_data)
model_item_scores_PRI_top_n = get_LLM_value_per_item_top_n(PRI_data)

# Merge them on the grouping keys
model_item_scores_PRI = model_item_scores_PRI.merge(
    model_item_scores_PRI_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)

In [69]:
# Adding task specific categories to save in all data

# add item categories
item_to_category = {
     1: "decision", 3: "decision", 5: "decision", 7: "decision", 9: "decision", 11: "decision", 13: "decision", 15: "decision",
     2: "certainty", 4: "certainty", 6: "certainty", 8: "certainty", 10: "certainty", 12: "certainty", 14: "certainty", 16: "certainty"
}

model_item_scores_PRI["category"] = model_item_scores_PRI["item"].map(item_to_category)


In [70]:
# merge dfs
all_data = pd.concat([all_data, model_item_scores_PRI], ignore_index=True)

## SOEP SCALE

In [71]:
# load data
SOEP_data = load_dataframes(task_name="SOEP")

Merged DataFrame shape: (486542, 17)
Total models: 46


In [72]:
# get probabilities out of log-probabilities

cols = [str(i) for i in range(1, 12)]
# Compute normalized probabilities
exp_vals = np.exp(SOEP_data[cols])
prob_vals = exp_vals.div(exp_vals.sum(axis=1), axis=0)

# Rename columns all at once
prob_vals.columns = [f"prob_{i}" for i in range(1, 12)]

# Join to original dataframe in one step
SOEP_data = pd.concat([SOEP_data, prob_vals], axis=1).copy()

In [73]:
# filter out probability LLM assigned to real item answer 
SOEP_data=filter_pred_prob(SOEP_data)

In [74]:
# # flip back human answers where they were flipped
# mask = (SOEP_data["flipped"] == "yes") 
# SOEP_data.loc[mask, "human_number"] = 12 - SOEP_data.loc[mask, "human_number"]


In [75]:
# produce df with one value per model per item 
model_item_scores_SOEP = get_LLM_value_per_item(SOEP_data)
model_item_scores_SOEP_top_n = get_LLM_value_per_item_top_n(SOEP_data)

# Merge them on the grouping keys
model_item_scores_SOEP = model_item_scores_SOEP.merge(
    model_item_scores_SOEP_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)

In [76]:
# Adding task specific categories to save in all data

# add item categories
item_to_category = {
     1: "SOEP", 2: "SOEPdri", 3: "SOEPfin",  4: "SOEPrec", 5: "SOEPocc",  6: "SOEPhea",  7: "SOEPsoc"
}

model_item_scores_SOEP["category"] = model_item_scores_SOEP["item"].map(item_to_category)


In [77]:
# merge dfs
all_data = pd.concat([all_data, model_item_scores_SOEP], ignore_index=True)

## SSSV SCALE

In [78]:
# load data
SSSV_data = load_dataframes(task_name="SSSV")

Merged DataFrame shape: (2776560, 8)
Total models: 46


In [79]:
# normalise answer option sum to one
SSSV_data["prob_1"] = np.exp(SSSV_data["1"])/(np.exp(SSSV_data["1"]) + np.exp(SSSV_data["2"]))
SSSV_data["prob_2"] = np.exp(SSSV_data["2"])/(np.exp(SSSV_data["1"]) + np.exp(SSSV_data["2"]))

In [80]:
# filter out probability LLM assigned to real item answer 
SSSV_data=filter_pred_prob(SSSV_data)

In [81]:
# # flip back human answers where they were flipped
# mask = (SSSV_data["flipped"] == True) 
# SSSV_data.loc[mask, "human_number"] = 3 - SSSV_data.loc[mask, "human_number"]


In [82]:
# # reverse human answers (again) where the items where reversed phrased

# # add whether item was reverse coded
# reverse_coded = {
#      1: True, 2: False, 3: True, 4: False, 5: True, 6: True, 7: False, 8: True, 9: True, 10: False, 
#      11: False, 12: False, 13: False, 14: True, 15: False, 16: True, 17: True, 18: True, 19: False, 20: False,
#      21: False, 22: True, 23: True, 24: True, 25: False, 26: False, 27: False, 28: True, 29: True, 30: False,
#      31: False, 32: True, 33: False, 34: True, 35: False, 36: True, 37: False, 38: False, 39: True, 40: False

# }

# # Apply mapping row-wise based on item number
# SSSV_data["reverse_coded"] = SSSV_data["item"].map(reverse_coded)

# # flip back answers that where reverse coded
# mask = (SSSV_data["reverse_coded"] == True)
# SSSV_data.loc[mask, "human_number"] = 3 - SSSV_data.loc[mask, "human_number"]
# # drop reverse-coded column (not needed in final data)
# model_item_scores_SSSV = SSSV_data.drop(columns=["reverse_coded"])


In [83]:
# produce df with one value per model per item 
model_item_scores_SSSV = get_LLM_value_per_item(SSSV_data)
model_item_scores_SSSV_top_n = get_LLM_value_per_item_top_n(SSSV_data)

# Merge them on the grouping keys
model_item_scores_SSSV = model_item_scores_SSSV.merge(
    model_item_scores_SSSV_top_n,
    on=["experiment", "model", "item"],
    how="inner" 
)

In [84]:
# Adding task specific categories to save in all data

# add item categories
item_to_category = {
     3: "SStas", 11: "SStas", 16: "SStas", 17: "SStas", 20: "SStas", 21: "SStas", 23: "SStas", 28: "SStas", 38: "SStas", 40: "SStas",
     4: "SSexp", 6: "SSexp", 9: "SSexp", 10: "SSexp", 14: "SSexp", 18: "SSexp", 19: "SSexp", 22: "SSexp", 26: "SSexp", 37: "SSexp",
     1: "SSdis", 12: "SSdis", 13: "SSdis", 25: "SSdis", 29: "SSdis", 30: "SSdis", 32: "SSdis", 33: "SSdis", 35: "SSdis", 36: "SSdis",
     2: "SSbor", 5: "SSbor", 7: "SSbor", 8: "SSbor", 15: "SSbor", 24: "SSbor", 27: "SSbor", 31: "SSbor", 34: "SSbor", 39: "SSbor"
}

model_item_scores_SSSV["category"] = model_item_scores_SSSV["item"].map(item_to_category)

In [85]:
# merge dfs
all_data = pd.concat([all_data, model_item_scores_SSSV], ignore_index=True)

# Saving new processed dataframe

In [86]:
# save data
all_data.to_csv('processed_data/no_change_item_data.csv', index=False)
#all_data.to_csv("processed_data/items_per_LLM_random_simulation.csv", index=False)
#all_data.to_csv("processed_data/items_per_LLM_semi_random_simulation.csv", index=False)


In [87]:
all_data

Unnamed: 0,experiment,model,item,score,score_top_n,category
0,AUDIT scale,Apertus-70B-Instruct-2509,1,1.667793,2.000000,
1,AUDIT scale,Apertus-70B-Instruct-2509,2,2.534622,2.000000,
2,AUDIT scale,Apertus-70B-Instruct-2509,3,2.056587,1.000000,
3,AUDIT scale,Apertus-70B-Instruct-2509,4,2.477461,1.824985,
4,AUDIT scale,Apertus-70B-Instruct-2509,5,3.412819,5.000000,
...,...,...,...,...,...,...
11817,SSSV scale,zephyr-7b-beta,36,1.330655,1.000000,SSdis
11818,SSSV scale,zephyr-7b-beta,37,1.362839,1.000000,SSexp
11819,SSSV scale,zephyr-7b-beta,38,1.367971,1.000000,SStas
11820,SSSV scale,zephyr-7b-beta,39,1.355507,1.000000,SSbor
