# Appendix Tables: Accuracy and Error Metrics

This notebook replicates the appendix tables from the paper:

- **Table A6**: Average accuracy and error metrics by EPQR-A scale for Base population vs input reference
- **Table A7**: Average accuracy and error metrics by EPQR-A scale for MaxN/MaxP populations vs input reference

## Metrics Explained

For each EPQR-A question, we compare the LLM-generated answer with the reference (input) answer:

### Binary Classification Metrics

- **Accuracy**: Percentage of correct predictions (TP + TN) / Total
- **Precision**: Of all positive predictions, how many were correct? TP / (TP + FP)
- **Recall (Sensitivity)**: Of all actual positives, how many were identified? TP / (TP + FN)
- **Specificity**: Of all actual negatives, how many were identified? TN / (TN + FP)

Where:
- **TP (True Positive)**: Both reference and LLM answered "Yes"
- **TN (True Negative)**: Both reference and LLM answered "No"
- **FP (False Positive)**: Reference="No", LLM="Yes"
- **FN (False Negative)**: Reference="Yes", LLM="No"

### Error Metrics (Scale-level)

For each personality and category, we sum the evaluations (1 for "Yes", 0 for "No") to get a scale score:

- **MAE (Mean Absolute Error)**: Average absolute difference between LLM and reference scale scores
- **RMSE (Root Mean Squared Error)**: Square root of average squared differences between LLM and reference scale scores

These metrics measure how well the LLM reproduces the **overall trait intensity** rather than individual question accuracy.

## EPQR-A Categories

- **E (Extraversion)**: Questions 2, 4, 13, 15, 20, 23
- **N (Neuroticism)**: Questions 1, 9, 11, 14, 18, 21
- **P (Psychoticism)**: Questions 6, 8, 12, 19, 22, 24
- **L (Lie scale)**: Questions 3, 5, 7, 10, 16, 17

## Setup and Data Loading

In [13]:
# Standard libraries
import pandas as pd
import numpy as np

# Database connection
from personas_backend.db.db_handler import DatabaseHandler
from personas_backend import ACTIVE_SCHEMA

# Evaluations package
from evaluations import data_access
from evaluations import table_accuracy

# Configuration
SCHEMA = "personality_trap"
print(f"Using schema: {SCHEMA}")
print(f"Active schema: {ACTIVE_SCHEMA}")

Using schema: personality_trap
Active schema: test_validation_schema


In [14]:
# Connect to database
db_handler = DatabaseHandler()
conn = db_handler.connection

print(f"✅ Connected to database")
conn

✅ Connected to database


Engine(postgresql://personas:***@localhost:5432/personas)

## Load Experiment and Reference Data

In [15]:
# Load EPQR-A questionnaire experiment data
with conn.connect() as connection:
    epqra_data = data_access.load_questionnaire_experiments(
        connection, 
        schema=SCHEMA,
        questionnaires=["epqra"]
    )

print(f"Total EPQR-A records: {len(epqra_data)}")
print(f"Experiment groups: {sorted(epqra_data['experiments_group_id'].unique())}")
print(f"Models: {sorted(epqra_data['model_clean'].unique())}")
print(f"Populations: {sorted(epqra_data['population_display'].unique())}")
print(f"Categories: {sorted(epqra_data['category'].unique())}")

epqra_data.head()

Loading questionnaire data from personality_trap.experiments_evals...
Loaded 297360 questionnaire records
Models: ['GPT-3.5' 'GPT-4o' 'Claude-3.5-s' 'Llama3.2-3B' 'Llama3.1-70B']
Populations: ['gpt35' 'gpt4o' 'maxN_gpt4o' 'maxP_gpt4o' 'claude35sonnet' 'llama323B'
 'llama3170B' 'maxN_claude35sonnet' 'maxP_claude35sonnet' 'maxN_gpt35'
 'maxP_gpt35' 'maxN_llama3170B' 'maxP_llama3170B' 'maxN_llama323B'
 'maxP_llama323B']
Total EPQR-A records: 297360
Experiment groups: [np.int64(307), np.int64(308), np.int64(312), np.int64(313), np.int64(343), np.int64(344), np.int64(345), np.int64(356), np.int64(357), np.int64(358), np.int64(359), np.int64(360), np.int64(361), np.int64(362), np.int64(363)]
Models: ['Claude-3.5-s', 'GPT-3.5', 'GPT-4o', 'Llama3.1-70B', 'Llama3.2-3B']
Populations: ['Base', 'MaxN', 'MaxP']
Categories: ['E', 'L', 'N', 'P']
Loaded 297360 questionnaire records
Models: ['GPT-3.5' 'GPT-4o' 'Claude-3.5-s' 'Llama3.2-3B' 'Llama3.1-70B']
Populations: ['gpt35' 'gpt4o' 'maxN_gpt4o' 'maxP

Unnamed: 0,experiments_group_id,model_provider,model,questionnaire,population,personality_id,repeated,experiment_id,question_number,answer,category,key,eval,model_clean,population_mapped,population_display
0,307,openai,gpt-3.5-turbo-0125,epqra,generated_gpt35_spain826,1,0,96033,1,1,N,1,1,GPT-3.5,gpt35,Base
1,307,openai,gpt-3.5-turbo-0125,epqra,generated_gpt35_spain826,1,0,96033,2,0,E,1,0,GPT-3.5,gpt35,Base
2,307,openai,gpt-3.5-turbo-0125,epqra,generated_gpt35_spain826,1,0,96033,3,1,P,0,0,GPT-3.5,gpt35,Base
3,307,openai,gpt-3.5-turbo-0125,epqra,generated_gpt35_spain826,1,0,96033,4,0,E,1,0,GPT-3.5,gpt35,Base
4,307,openai,gpt-3.5-turbo-0125,epqra,generated_gpt35_spain826,1,0,96033,5,0,L,0,1,GPT-3.5,gpt35,Base


In [16]:
# Load reference questionnaire data (input personality profiles)
print("Loading reference questionnaire data...")
reference_data = data_access.load_reference_questionnaires(conn, schema=SCHEMA)

print(f"\nLoaded {len(reference_data)} reference questionnaire records")
print(f"Unique personalities: {reference_data['personality_id'].nunique()}")
print(f"Questions per personality: {len(reference_data) // reference_data['personality_id'].nunique()}")
print(f"Categories: {sorted(reference_data['category'].unique())}")

reference_data.head()

Loading reference questionnaire data...

Loaded 19824 reference questionnaire records
Unique personalities: 826
Questions per personality: 24
Categories: ['E', 'L', 'N', 'P']


Unnamed: 0,id,personality_id,question_number,question,category,key,answer,ref_eval
0,1,1,1,Does your mood often go up and down?,N,True,True,1
1,2,1,2,Are you a talkative person?,E,True,False,0
2,3,1,3,Would being in debt worry you?,P,False,True,0
3,4,1,4,Are you rather lively?,E,True,False,0
4,5,1,5,Were you ever greedy by helping yourself to mo...,L,False,False,1


## Helper Functions from table_accuracy Module

The following functions are now available from the `evaluations.table_accuracy` module:

- `table_accuracy.calculate_accuracy_metrics()` - Binary classification metrics
- `table_accuracy.compute_error_metrics()` - MAE and RMSE calculations
- `table_accuracy.format_metrics_table()` - Publication formatting
- `table_accuracy.prepare_accuracy_data()` - Data preparation
- `table_accuracy.create_table_a6()` - Complete Table A6 generator
- `table_accuracy.create_table_a7()` - Complete Table A7 generator

## Prepare Data for Accuracy Calculations

In [17]:
# Merge experiment data with reference data and calculate binary classification components
accuracy_data = table_accuracy.prepare_accuracy_data(
    experiment_df=epqra_data,
    reference_df=reference_data
)

print(f"Prepared {len(accuracy_data)} records for accuracy analysis")
print(f"Columns: {list(accuracy_data.columns)}")

accuracy_data.head()

Prepared 297360 records for accuracy analysis
Columns: ['experiments_group_id', 'model_clean', 'population_display', 'personality_id', 'experiment_id', 'question_number', 'answer_exp', 'category', 'key_exp', 'answer_ref', 'key_ref', 'equal', 'tp', 'pred_pos', 'act_pos', 'tn', 'pred_neg', 'act_neg']


Unnamed: 0,experiments_group_id,model_clean,population_display,personality_id,experiment_id,question_number,answer_exp,category,key_exp,answer_ref,key_ref,equal,tp,pred_pos,act_pos,tn,pred_neg,act_neg
0,307,GPT-3.5,Base,1,96033,1,1,N,1,True,True,True,1,1,1,0,0,0
1,307,GPT-3.5,Base,1,96033,2,0,E,1,False,True,True,0,0,0,1,1,1
2,307,GPT-3.5,Base,1,96033,3,1,P,0,True,False,True,1,1,1,0,0,0
3,307,GPT-3.5,Base,1,96033,4,0,E,1,False,True,True,0,0,0,1,1,1
4,307,GPT-3.5,Base,1,96033,5,0,L,0,False,False,True,0,0,0,1,1,1


## Filter Data by Population Type

In [18]:
# Filter for Base population
base_data = accuracy_data[
    accuracy_data['population_display'] == 'Base'
].copy()

print(f"Base population records: {len(base_data)}")
print(f"Models: {sorted(base_data['model_clean'].unique())}")
print(f"Categories: {sorted(base_data['category'].unique())}")

Base population records: 99120
Models: ['Claude-3.5-s', 'GPT-3.5', 'GPT-4o', 'Llama3.1-70B', 'Llama3.2-3B']
Categories: ['E', 'L', 'N', 'P']


In [19]:
# Filter for MaxN and MaxP populations (borderline conditions)
borderline_data = accuracy_data[
    accuracy_data['population_display'].isin(['MaxN', 'MaxP'])
].copy()

print(f"Borderline populations records: {len(borderline_data)}")
print(f"Populations: {sorted(borderline_data['population_display'].unique())}")
print(f"Models: {sorted(borderline_data['model_clean'].unique())}")
print(f"Categories: {sorted(borderline_data['category'].unique())}")

Borderline populations records: 198240
Populations: ['MaxN', 'MaxP']
Models: ['Claude-3.5-s', 'GPT-3.5', 'GPT-4o', 'Llama3.1-70B', 'Llama3.2-3B']
Categories: ['E', 'L', 'N', 'P']


---

# Table A6: Base Population Accuracy Metrics

In [20]:
# Calculate accuracy metrics for Base population
base_metrics = table_accuracy.calculate_accuracy_metrics(
    base_data, 
    group_by=['model_clean', 'category']
)

# Calculate error metrics (MAE, RMSE)
base_errors = table_accuracy.compute_error_metrics(
    base_data,
    group_by_base=['model_clean']
)

# Merge metrics with errors
TABLE_A6 = pd.merge(
    base_metrics,
    base_errors,
    on=['model_clean', 'category'],
    how='left'
)

# Format for display
TABLE_A6 = table_accuracy.format_metrics_table(
    TABLE_A6,
    model_col='model_clean',
    population_col=None
)

# Select key columns for display
display_cols = ['model_clean', 'category', 'accuracy', 'precision', 'recall', 'specificity', 'mae', 'rmse']
TABLE_A6_display = TABLE_A6[display_cols]

print("\n" + "="*80)
print("TABLE A6: Accuracy and Error Metrics for Base Population vs Reference")
print("="*80)
print("\nMetrics by Model and EPQR-A Category:")
display(TABLE_A6_display)


TABLE A6: Accuracy and Error Metrics for Base Population vs Reference

Metrics by Model and EPQR-A Category:


Unnamed: 0,model_clean,category,accuracy,precision,recall,specificity,mae,rmse
0,GPT-4o,E,97.68,96.74,98.1,97.34,0.124697,0.444226
1,GPT-4o,N,93.04,93.4,93.0,93.08,0.400726,0.829977
2,GPT-4o,P,98.2,98.25,98.21,98.2,0.095642,0.324541
3,GPT-4o,L,99.23,97.62,97.86,99.51,0.043584,0.2308
4,GPT-3.5,E,91.4,85.78,96.74,87.12,0.382567,0.744634
5,GPT-3.5,N,81.48,79.57,85.96,76.76,1.014528,1.656304
6,GPT-3.5,P,89.79,87.74,92.84,86.65,0.433414,0.718149
7,GPT-3.5,L,98.26,95.86,93.81,99.17,0.096852,0.371503
8,Claude-3.5-s,E,97.7,96.49,98.41,97.13,0.121065,0.431788
9,Claude-3.5-s,N,94.83,91.84,98.7,90.76,0.305085,0.719833


---

# Table A7: MaxN/MaxP Populations Accuracy Metrics

In [21]:
# Calculate accuracy metrics for MaxN/MaxP populations
borderline_metrics = table_accuracy.calculate_accuracy_metrics(
    borderline_data,
    group_by=['model_clean', 'population_display', 'category']
)

# Calculate error metrics (MAE, RMSE)
borderline_errors = table_accuracy.compute_error_metrics(
    borderline_data,
    group_by_base=['model_clean', 'population_display']
)

# Merge metrics with errors
TABLE_A7 = pd.merge(
    borderline_metrics,
    borderline_errors,
    on=['model_clean', 'population_display', 'category'],
    how='left'
)

# Format for display
TABLE_A7 = table_accuracy.format_metrics_table(
    TABLE_A7,
    model_col='model_clean',
    population_col='population_display'
)

# Select key columns for display
display_cols = ['model_clean', 'population_display', 'category', 'accuracy', 'precision', 'recall', 'specificity', 'mae', 'rmse']
TABLE_A7_display = TABLE_A7[display_cols]

print("\n" + "="*80)
print("TABLE A7: Accuracy and Error Metrics for MaxN/MaxP Populations vs Reference")
print("="*80)
print("\nMetrics by Model, Population, and EPQR-A Category:")
display(TABLE_A7_display)


TABLE A7: Accuracy and Error Metrics for MaxN/MaxP Populations vs Reference

Metrics by Model, Population, and EPQR-A Category:


Unnamed: 0,model_clean,population_display,category,accuracy,precision,recall,specificity,mae,rmse
0,GPT-4o,MaxN,E,97.42,96.43,97.83,97.09,0.135593,0.45897
1,GPT-4o,MaxN,N,51.43,51.37,99.57,0.75,2.914044,3.774356
2,GPT-4o,MaxN,P,97.34,96.71,98.09,96.56,0.130751,0.364927
3,GPT-4o,MaxN,L,98.81,96.21,96.79,99.22,0.061743,0.301329
4,GPT-4o,MaxP,E,97.52,96.35,98.14,97.02,0.131961,0.452328
5,GPT-4o,MaxP,N,86.2,87.4,85.41,87.03,0.748184,1.247758
6,GPT-4o,MaxP,P,35.01,38.3,46.02,23.67,0.855932,1.139751
7,GPT-4o,MaxP,L,97.54,90.8,95.12,98.03,0.121065,0.484631
8,GPT-3.5,MaxN,E,86.68,79.88,93.7,81.05,0.535109,0.995146
9,GPT-3.5,MaxN,N,50.5,51.33,67.55,32.56,2.582324,3.188015


---

## Summary

### Table A6 Interpretation

Shows how well each model reproduces the reference personality profiles for the **Base population** across the four EPQR-A scales (E, N, P, L).

- **Accuracy**: Overall correctness of binary Yes/No predictions
- **Precision/Recall/Specificity**: Detailed binary classification performance
- **MAE/RMSE**: Scale-level error in reproducing trait intensity

### Table A7 Interpretation

Compares **MaxN** (maximally neurotic) and **MaxP** (maximally psychotic) borderline populations against the reference, showing:

- Whether extreme personality conditions affect model accuracy
- Differences in error patterns between borderline populations
- Model-specific robustness to personality extremes

**Note**: All metrics are formatted using `custom_format()` from the paper's original implementation, ensuring exact replication of published results.

# Appendix Tables: Accuracy and Error Metrics

This notebook replicates the appendix tables from the paper:

- **Table A6**: Average accuracy and error metrics by EPQR-A scale for Base population vs input reference
- **Table A7**: Average accuracy and error metrics by EPQR-A scale for MaxN/MaxP populations vs input reference

## Metrics Explained

For each EPQR-A question, we compare the LLM-generated answer with the reference (input) answer:

### Binary Classification Metrics

- **Accuracy**: Percentage of correct predictions (TP + TN) / Total
- **Precision**: Of all positive predictions, how many were correct? TP / (TP + FP)
- **Recall (Sensitivity)**: Of all actual positives, how many were identified? TP / (TP + FN)
- **Specificity**: Of all actual negatives, how many were identified? TN / (TN + FP)

Where:
- **TP (True Positive)**: Both reference and LLM answered "Yes"
- **TN (True Negative)**: Both reference and LLM answered "No"
- **FP (False Positive)**: Reference="No", LLM="Yes"
- **FN (False Negative)**: Reference="Yes", LLM="No"

### Error Metrics (Scale-level)

For each personality and category, we sum the evaluations (1 for "Yes", 0 for "No") to get a scale score:

- **MAE (Mean Absolute Error)**: Average absolute difference between LLM and reference scale scores
- **RMSE (Root Mean Squared Error)**: Square root of average squared differences between LLM and reference scale scores

These metrics measure how well the LLM reproduces the **overall trait intensity** rather than individual question accuracy.

## EPQR-A Categories

- **E (Extraversion)**: Questions 2, 4, 13, 15, 20, 23
- **N (Neuroticism)**: Questions 1, 9, 11, 14, 18, 21
- **P (Psychoticism)**: Questions 6, 8, 12, 19, 22, 24
- **L (Lie scale)**: Questions 3, 5, 7, 10, 16, 17

## Setup and Data Loading

In [22]:
# Standard libraries
import pandas as pd
import numpy as np

# Database connection
from personas_backend.db.db_handler import DatabaseHandler
from personas_backend import ACTIVE_SCHEMA

# Evaluations package
from evaluations import data_access
from evaluations import table_accuracy

# Configuration
SCHEMA = "personality_trap"
print(f"Using schema: {SCHEMA}")
print(f"Active schema: {ACTIVE_SCHEMA}")


Using schema: personality_trap
Active schema: test_validation_schema


In [23]:
# Connect to database
db_handler = DatabaseHandler()
conn = db_handler.connection

print(f"✅ Connected to database")
conn

✅ Connected to database


Engine(postgresql://personas:***@localhost:5432/personas)

## Load Experiment and Reference Data

In [24]:
# Load EPQR-A questionnaire experiment data
with conn.connect() as connection:
    epqra_data = data_access.load_questionnaire_experiments(
        connection, 
        schema=SCHEMA,
        questionnaires=["epqra"]
    )

print(f"Total EPQR-A records: {len(epqra_data)}")
print(f"Experiment groups: {sorted(epqra_data['experiments_group_id'].unique())}")
print(f"Models: {sorted(epqra_data['model_clean'].unique())}")
print(f"Populations: {sorted(epqra_data['population_display'].unique())}")
print(f"Categories: {sorted(epqra_data['category'].unique())}")

epqra_data.head()

Loading questionnaire data from personality_trap.experiments_evals...
Loaded 297360 questionnaire records
Models: ['GPT-3.5' 'GPT-4o' 'Claude-3.5-s' 'Llama3.2-3B' 'Llama3.1-70B']
Populations: ['gpt35' 'gpt4o' 'maxN_gpt4o' 'maxP_gpt4o' 'claude35sonnet' 'llama323B'
 'llama3170B' 'maxN_claude35sonnet' 'maxP_claude35sonnet' 'maxN_gpt35'
 'maxP_gpt35' 'maxN_llama3170B' 'maxP_llama3170B' 'maxN_llama323B'
 'maxP_llama323B']
Total EPQR-A records: 297360
Experiment groups: [np.int64(307), np.int64(308), np.int64(312), np.int64(313), np.int64(343), np.int64(344), np.int64(345), np.int64(356), np.int64(357), np.int64(358), np.int64(359), np.int64(360), np.int64(361), np.int64(362), np.int64(363)]
Models: ['Claude-3.5-s', 'GPT-3.5', 'GPT-4o', 'Llama3.1-70B', 'Llama3.2-3B']
Populations: ['Base', 'MaxN', 'MaxP']
Categories: ['E', 'L', 'N', 'P']
Loaded 297360 questionnaire records
Models: ['GPT-3.5' 'GPT-4o' 'Claude-3.5-s' 'Llama3.2-3B' 'Llama3.1-70B']
Populations: ['gpt35' 'gpt4o' 'maxN_gpt4o' 'maxP

Unnamed: 0,experiments_group_id,model_provider,model,questionnaire,population,personality_id,repeated,experiment_id,question_number,answer,category,key,eval,model_clean,population_mapped,population_display
0,307,openai,gpt-3.5-turbo-0125,epqra,generated_gpt35_spain826,1,0,96033,1,1,N,1,1,GPT-3.5,gpt35,Base
1,307,openai,gpt-3.5-turbo-0125,epqra,generated_gpt35_spain826,1,0,96033,2,0,E,1,0,GPT-3.5,gpt35,Base
2,307,openai,gpt-3.5-turbo-0125,epqra,generated_gpt35_spain826,1,0,96033,3,1,P,0,0,GPT-3.5,gpt35,Base
3,307,openai,gpt-3.5-turbo-0125,epqra,generated_gpt35_spain826,1,0,96033,4,0,E,1,0,GPT-3.5,gpt35,Base
4,307,openai,gpt-3.5-turbo-0125,epqra,generated_gpt35_spain826,1,0,96033,5,0,L,0,1,GPT-3.5,gpt35,Base


In [25]:
# Load reference questionnaire data (input personality profiles)
print("Loading reference questionnaire data...")
reference_data = data_access.load_reference_questionnaires(conn, schema=SCHEMA)

print(f"\nLoaded {len(reference_data)} reference questionnaire records")
print(f"Unique personalities: {reference_data['personality_id'].nunique()}")
print(f"Questions per personality: {len(reference_data) // reference_data['personality_id'].nunique()}")
print(f"Categories: {sorted(reference_data['category'].unique())}")

reference_data.head()

Loading reference questionnaire data...

Loaded 19824 reference questionnaire records
Unique personalities: 826
Questions per personality: 24
Categories: ['E', 'L', 'N', 'P']


Unnamed: 0,id,personality_id,question_number,question,category,key,answer,ref_eval
0,1,1,1,Does your mood often go up and down?,N,True,True,1
1,2,1,2,Are you a talkative person?,E,True,False,0
2,3,1,3,Would being in debt worry you?,P,False,True,0
3,4,1,4,Are you rather lively?,E,True,False,0
4,5,1,5,Were you ever greedy by helping yourself to mo...,L,False,False,1


## Merge Experiment and Reference Data

Match LLM-generated answers with their corresponding reference (input) answers by `personality_id` and `question_number`.

In [26]:
# Merge experiment data with reference data
# Select relevant columns from epqra_data
exp_cols = [
    'experiments_group_id', 'model_clean', 'population_display', 
    'personality_id', 'experiment_id', 'question_number', 
    'answer', 'category', 'key'
]

# Select relevant columns from reference_data
ref_cols = [
    'personality_id', 'question_number', 'answer', 'key'
]

# Create merged dataframe
accuracy_df = pd.merge(
    epqra_data[exp_cols],
    reference_data[ref_cols],
    on=['personality_id', 'question_number'],
    how='left',
    suffixes=('_exp', '_ref')
)

print(f"Merged {len(accuracy_df)} records")
print(f"\nColumns: {accuracy_df.columns.tolist()}")
accuracy_df.head()

Merged 297360 records

Columns: ['experiments_group_id', 'model_clean', 'population_display', 'personality_id', 'experiment_id', 'question_number', 'answer_exp', 'category', 'key_exp', 'answer_ref', 'key_ref']


Unnamed: 0,experiments_group_id,model_clean,population_display,personality_id,experiment_id,question_number,answer_exp,category,key_exp,answer_ref,key_ref
0,307,GPT-3.5,Base,1,96033,1,1,N,1,True,True
1,307,GPT-3.5,Base,1,96033,2,0,E,1,False,True
2,307,GPT-3.5,Base,1,96033,3,1,P,0,True,False
3,307,GPT-3.5,Base,1,96033,4,0,E,1,False,True
4,307,GPT-3.5,Base,1,96033,5,0,L,0,False,False


## Calculate Binary Classification Metrics

For each experiment-question pair, compute:
- Whether the answers match (`equal`)
- True Positives, True Negatives, False Positives, False Negatives

In [27]:
# Calculate equality (simple accuracy)
accuracy_df['equal'] = accuracy_df['answer_exp'] == accuracy_df['answer_ref']

# Compute true positives, predicted positives, and actual positives
accuracy_df['tp'] = ((accuracy_df['answer_exp'] == True) & (accuracy_df['answer_ref'] == True)).astype(int)
accuracy_df['pred_pos'] = (accuracy_df['answer_exp'] == True).astype(int)
accuracy_df['act_pos'] = (accuracy_df['answer_ref'] == True).astype(int)

# Compute true negatives, predicted negatives, and actual negatives
accuracy_df['tn'] = ((accuracy_df['answer_exp'] == False) & (accuracy_df['answer_ref'] == False)).astype(int)
accuracy_df['pred_neg'] = (accuracy_df['answer_exp'] == False).astype(int)
accuracy_df['act_neg'] = (accuracy_df['answer_ref'] == False).astype(int)

print("Binary classification metrics calculated")
print(f"\nSample confusion matrix components:")
accuracy_df[[
    'model_clean', 'population_display', 'category', 'question_number',
    'equal', 'tp', 'tn', 'pred_pos', 'act_pos'
]].head(10)

Binary classification metrics calculated

Sample confusion matrix components:


Unnamed: 0,model_clean,population_display,category,question_number,equal,tp,tn,pred_pos,act_pos
0,GPT-3.5,Base,N,1,True,1,0,1,1
1,GPT-3.5,Base,E,2,True,0,1,0,0
2,GPT-3.5,Base,P,3,True,1,0,1,1
3,GPT-3.5,Base,E,4,True,0,1,0,0
4,GPT-3.5,Base,L,5,True,0,1,0,0
5,GPT-3.5,Base,P,6,True,0,1,0,0
6,GPT-3.5,Base,L,7,True,0,1,0,0
7,GPT-3.5,Base,P,8,True,0,1,0,0
8,GPT-3.5,Base,N,9,True,1,0,1,1
9,GPT-3.5,Base,L,10,True,0,1,0,0


## Helper Functions for Metric Calculation

In [28]:
# Helper functions are now available from the table_accuracy module# - table_accuracy.calculate_accuracy_metrics()# - table_accuracy.compute_error_metrics()# - table_accuracy.format_metrics_table()print("✅ Using table_accuracy module functions")

## Table A6: Base Population Accuracy Metrics

Average accuracy and error metrics by EPQR-A scale for **Base population** compared to input reference.

This table shows how well LLMs reproduce the reference personality profiles when generating standard (non-manipulated) personas.

In [29]:
# Filter to Base population only
base_data = accuracy_df[accuracy_df['population_display'] == 'Base'].copy()

print(f"Base population records: {len(base_data)}")
print(f"Models in Base population: {sorted(base_data['model_clean'].unique())}")
print(f"Categories: {sorted(base_data['category'].unique())}")

Base population records: 99120
Models in Base population: ['Claude-3.5-s', 'GPT-3.5', 'GPT-4o', 'Llama3.1-70B', 'Llama3.2-3B']
Categories: ['E', 'L', 'N', 'P']


In [30]:
# Calculate accuracy metrics for Base populationbase_metrics = table_accuracy.calculate_accuracy_metrics(    base_data,     group_by=['model_clean', 'category'])# Calculate error metrics (MAE, RMSE)base_errors = table_accuracy.compute_error_metrics(    base_data,    group_by_base=['model_clean'])# Merge metrics with errorsTABLE_A6 = pd.merge(    base_metrics,    base_errors,    on=['model_clean', 'category'],    how='left')# Format for displayTABLE_A6 = table_accuracy.format_metrics_table(    TABLE_A6,    model_col='model_clean',    population_col=None)# Select key columns for displaydisplay_cols = ['model_clean', 'category', 'accuracy', 'precision', 'recall', 'specificity', 'mae', 'rmse']TABLE_A6_display = TABLE_A6[display_cols]print("\n" + "="*80)print("TABLE A6: Accuracy and Error Metrics for Base Population vs Reference")print("="*80)print("\nMetrics by Model and EPQR-A Category:")display(TABLE_A6_display)

## Table A7: MaxN/MaxP Population Accuracy Metrics

Average accuracy and error metrics by EPQR-A scale for **MaxN and MaxP populations** compared to input reference.

This table shows how well LLMs reproduce the reference personality profiles when generating personas with manipulated traits (maximized Neuroticism or Psychoticism).

In [31]:
# Filter to MaxN and MaxP populations
borderline_data = accuracy_df[
    accuracy_df['population_display'].isin(['MaxN', 'MaxP'])
].copy()

print(f"MaxN/MaxP population records: {len(borderline_data)}")
print(f"Models: {sorted(borderline_data['model_clean'].unique())}")
print(f"Populations: {sorted(borderline_data['population_display'].unique())}")
print(f"Categories: {sorted(borderline_data['category'].unique())}")

MaxN/MaxP population records: 198240
Models: ['Claude-3.5-s', 'GPT-3.5', 'GPT-4o', 'Llama3.1-70B', 'Llama3.2-3B']
Populations: ['MaxN', 'MaxP']
Categories: ['E', 'L', 'N', 'P']


In [32]:
# Calculate accuracy metrics for MaxN/MaxP populationsborderline_metrics = table_accuracy.calculate_accuracy_metrics(    borderline_data,    group_by=['model_clean', 'population_display', 'category'])# Calculate error metrics (MAE, RMSE)borderline_errors = table_accuracy.compute_error_metrics(    borderline_data,    group_by_base=['model_clean', 'population_display'])# Merge metrics with errorsTABLE_A7 = pd.merge(    borderline_metrics,    borderline_errors,    on=['model_clean', 'population_display', 'category'],    how='left')# Format for displayTABLE_A7 = table_accuracy.format_metrics_table(    TABLE_A7,    model_col='model_clean',    population_col='population_display')# Select key columns for displaydisplay_cols = ['model_clean', 'population_display', 'category', 'accuracy', 'precision', 'recall', 'specificity', 'mae', 'rmse']TABLE_A7_display = TABLE_A7[display_cols]print("\n" + "="*80)print("TABLE A7: Accuracy and Error Metrics for MaxN/MaxP Populations vs Reference")print("="*80)print("\nMetrics by Model, Population, and EPQR-A Category:")display(TABLE_A7_display)

## Summary Statistics

Compare average metrics across populations.

In [33]:
# Note: Since values are formatted as strings using custom_format,
# we display the tables directly without additional statistics

print("\n" + "="*60)
print("TABLE A6: Base Population Summary")
print("="*60)
print(f"Total rows: {len(TABLE_A6_display)}")
print(f"Models: {TABLE_A6_display['model_clean'].nunique()}")
print(f"Categories: {TABLE_A6_display['category'].nunique()}")

print("\n" + "="*60)
print("TABLE A7: MaxN/MaxP Populations Summary")
print("="*60)
print(f"Total rows: {len(TABLE_A7_display)}")
print(f"Models: {TABLE_A7_display['model_clean'].nunique()}")
print(f"Populations: {TABLE_A7_display['population_display'].nunique()}")
print(f"Categories: {TABLE_A7_display['category'].nunique()}")


TABLE A6: Base Population Summary
Total rows: 20
Models: 5
Categories: 4

TABLE A7: MaxN/MaxP Populations Summary
Total rows: 40
Models: 5
Populations: 2
Categories: 4


## Export Tables

Save tables for inclusion in the paper.

In [34]:
# Save Table A6
output_a6 = "table_a6_base_accuracy.csv"
TABLE_A6_display.to_csv(output_a6, index=False)
print(f"✅ Table A6 saved to: {output_a6}")

# Save Table A7
output_a7 = "table_a7_borderline_accuracy.csv"
TABLE_A7_display.to_csv(output_a7, index=False)
print(f"✅ Table A7 saved to: {output_a7}")

✅ Table A6 saved to: table_a6_base_accuracy.csv
✅ Table A7 saved to: table_a7_borderline_accuracy.csv
