# Table 4: EPQR-A Personality Scores Analysis

This notebook replicates **Table 4** from the paper, which presents mean EPQR-A personality scores across different LLM models and population conditions, with statistical significance testing.

## Overview

**Table 4** compares personality trait scores (E, N, P, L) between:
- **LLM-generated personas** (from 5 different models)
- **Reference questionnaires** (baseline personality profiles used as input)
- **Random baseline** (control condition with randomly generated responses)

The analysis includes two levels of statistical testing:
1. **Individual-level**: Paired t-tests comparing model vs reference for the same personalities
2. **Population-level**: One-sample t-tests testing if mean differences deviate from zero

## Statistical Methods

### Significance Markers
- **Individual-level (paired t-test)**:
  - `*` = p < 0.05
  - `†` = p < 0.01
- **Population-level (one-sample t-test)**:
  - `§` = p < 0.05
  - `¶` = p < 0.01

### EPQR-A Dimensions
- **E**: Extraversion (social energy, outgoingness)
- **N**: Neuroticism (emotional stability, anxiety)
- **P**: Psychoticism (unconventional thinking, tough-mindedness)
- **L**: Lie scale (social desirability, response validity)

## Data Sources
- **Reference questionnaires**: 19,824 personality profiles (826 personas × 24 questions)
- **Random questionnaires**: 19,824 randomly generated responses (control baseline)
- **Experiment evaluations**: 541,680 LLM responses across all models and conditions

In [1]:
# =============================================================================
# SETUP: Import Required Libraries
# =============================================================================

import scipy.stats as stats
import numpy as np
import pandas as pd

# Import the evaluations package modules
from evaluations import data_access, table_personality

# Configuration: PostgreSQL schema containing the research data
SCHEMA = "personality_trap"

In [2]:
# =============================================================================
# DATABASE CONNECTION
# =============================================================================

from personas_backend.db.db_handler import DatabaseHandler

# Establish connection to PostgreSQL database
# Uses configuration from personas_backend (environment variables or YAML config)
db_handler = DatabaseHandler()
conn = db_handler.connection

print(f"Connected to database: {conn}")
conn

Connected to database: Engine(postgresql://personas:***@localhost:5432/personas)


Engine(postgresql://personas:***@localhost:5432/personas)

In [3]:
# =============================================================================
# STEP 1: Load LLM Experiment ENPL Scores
# =============================================================================
# Load aggregated ENPL scores from all LLM experiments
# This query sums correct answers per category (E, N, P, L) for each experiment

print(f"Loading experiment ENPL scores from {SCHEMA}.experiments_evals...")
questionnaire_enlp_data = data_access.load_experiment_enpl_scores(
    conn,
    schema=SCHEMA,
    experiment_groups=None,  # Include all experiment groups
    questionnaires=["epqra"],  # Use EPQR-A questionnaire only
)

# Display summary statistics
print(f"\nLoaded {len(questionnaire_enlp_data):,} experiment score records")
print(f"Unique models: {questionnaire_enlp_data['model_clean'].nunique()}")
print(f"Unique populations: {questionnaire_enlp_data['population_mapped'].nunique()}")
print(f"Categories: {sorted(questionnaire_enlp_data['category'].unique())}")
print("\nSample data:")
questionnaire_enlp_data.head()

Loading experiment ENPL scores from personality_trap.experiments_evals...
Loading experiment ENPL scores from personality_trap.experiments_evals...
Loaded 49560 experiment score records
Models: ['GPT-3.5' 'GPT-4o' 'Claude-3.5-s' 'Llama3.2-3B' 'Llama3.1-70B']
Populations: ['gpt35' 'gpt4o' 'maxN_gpt4o' 'maxP_gpt4o' 'claude35sonnet' 'llama323B'
 'llama3170B' 'maxN_claude35sonnet' 'maxP_claude35sonnet' 'maxN_gpt35'
 'maxP_gpt35' 'maxN_llama3170B' 'maxP_llama3170B' 'maxN_llama323B'
 'maxP_llama323B']
Categories: ['E', 'L', 'N', 'P']

Loaded 49,560 experiment score records
Unique models: 5
Unique populations: 15
Categories: ['E', 'L', 'N', 'P']

Sample data:
Loaded 49560 experiment score records
Models: ['GPT-3.5' 'GPT-4o' 'Claude-3.5-s' 'Llama3.2-3B' 'Llama3.1-70B']
Populations: ['gpt35' 'gpt4o' 'maxN_gpt4o' 'maxP_gpt4o' 'claude35sonnet' 'llama323B'
 'llama3170B' 'maxN_claude35sonnet' 'maxP_claude35sonnet' 'maxN_gpt35'
 'maxP_gpt35' 'maxN_llama3170B' 'maxP_llama3170B' 'maxN_llama323B'
 'max

Unnamed: 0,experiments_group_id,model_clean,population_mapped,population_display,experiment_id,personality_id,repeated,category,score
0,307,GPT-3.5,gpt35,Base,96033,1,0,E,0
1,307,GPT-3.5,gpt35,Base,96033,1,0,L,6
2,307,GPT-3.5,gpt35,Base,96033,1,0,N,6
3,307,GPT-3.5,gpt35,Base,96033,1,0,P,0
4,307,GPT-3.5,gpt35,Base,96034,2,0,E,1


In [4]:
# =============================================================================
# STEP 2: Load Reference ENPL Scores (Baseline Input)
# =============================================================================
# Load reference questionnaire scores - these are the personality profiles
# that were used as INPUT to generate the LLM personas
# Each score represents the sum of correct answers per category

print("Loading reference ENPL scores (baseline personality profiles)...")
reference_data = data_access.load_reference_enpl_scores(conn, schema=SCHEMA)

print(f"\nLoaded {len(reference_data):,} reference score records")
print(f"Unique personalities: {reference_data['personality_id'].nunique()}")
print(f"Categories: {sorted(reference_data['category'].unique())}")
print("\nSample reference data:")
reference_data.head()

Loading reference ENPL scores (baseline personality profiles)...

Loaded 3,304 reference score records
Unique personalities: 826
Categories: ['E', 'L', 'N', 'P']

Sample reference data:


Unnamed: 0,personality_id,category,ref_score
0,652,P,2
1,290,N,6
2,150,N,0
3,375,N,3
4,142,L,6


In [5]:
# =============================================================================
# STEP 3: Load Random Baseline ENPL Scores (Null Hypothesis Control)
# =============================================================================
# Load random questionnaire scores - these are RANDOMLY GENERATED responses
# that serve as a control/null hypothesis baseline for statistical comparison
# Purpose: Test whether observed patterns differ from pure random chance

print("Loading random baseline ENPL scores (null hypothesis control)...")
random_data = data_access.load_random_enpl_scores(conn, schema=SCHEMA)

print(f"\nLoaded {len(random_data):,} random baseline score records")
print(f"Unique personalities: {random_data['personality_id'].nunique()}")
print(f"Categories: {sorted(random_data['category'].unique())}")
print("\nSample random baseline data:")
random_data.head()

Loading random baseline ENPL scores (null hypothesis control)...

Loaded 3,304 random baseline score records
Unique personalities: 826
Categories: ['E', 'L', 'N', 'P']

Sample random baseline data:


Unnamed: 0,personality_id,category,random_score
0,652,P,1
1,290,N,4
2,150,N,2
3,375,N,3
4,142,L,6


In [6]:
# =============================================================================
# STEP 4: Merge Experiment Scores with Reference Scores
# =============================================================================
# Join experiment data with reference data on personality_id and category
# This enables paired statistical testing (same personality, different sources)

merged_data = pd.merge(
    questionnaire_enlp_data,
    reference_data,
    on=["personality_id", "category"],
    suffixes=("_experiment", "_reference"),
    validate="many_to_one",  # Many experiments can map to one reference profile
)

print(f"Merged data has {len(merged_data):,} entries")
print(f"Columns: {merged_data.columns.tolist()}")
print("\nSample merged data (showing experiment score vs reference score):")
merged_data.head()

Merged data has 49,560 entries
Columns: ['experiments_group_id', 'model_clean', 'population_mapped', 'population_display', 'experiment_id', 'personality_id', 'repeated', 'category', 'score', 'ref_score']

Sample merged data (showing experiment score vs reference score):


Unnamed: 0,experiments_group_id,model_clean,population_mapped,population_display,experiment_id,personality_id,repeated,category,score,ref_score
0,307,GPT-3.5,gpt35,Base,96033,1,0,E,0,0
1,307,GPT-3.5,gpt35,Base,96033,1,0,L,6,6
2,307,GPT-3.5,gpt35,Base,96033,1,0,N,6,6
3,307,GPT-3.5,gpt35,Base,96033,1,0,P,0,0
4,307,GPT-3.5,gpt35,Base,96034,2,0,E,1,0


## Statistical Testing

The statistical testing is now handled by the `evaluations.table_personality` package module.

This provides two key functions:
1. **`perform_enpl_tests()`**: Performs dual-level statistical testing (paired + one-sample t-tests)
2. **`create_personality_table()`**: Formats results into publication-ready table

See the package documentation for full details of the statistical methodology.

In [7]:
# =============================================================================
# STEP 5: Run Statistical Tests and Create Table 4
# =============================================================================
# Uses the evaluations.table_personality module for statistical testing

print("Running dual-level statistical tests (paired + one-sample t-tests)...")

# Perform statistical tests using package function
t_test_results = table_personality.perform_enpl_tests(
    data=merged_data,
    reference_data=reference_data,
    alpha=0.05,
)

print(f"Completed {len(t_test_results)} statistical tests")
print(f"\nModels tested: {sorted(t_test_results['model'].unique())}")
print(f"Populations tested: {sorted(t_test_results['population'].unique())}")
print(f"Categories tested: {sorted(t_test_results['category'].unique())}")

# ─────────────────────────────────────────────────────────────────────────────
# Create formatted table using package function
# ─────────────────────────────────────────────────────────────────────────────

print("\nCreating publication-ready table...")
pivot_results = table_personality.create_personality_table(
    test_results=t_test_results,
    # Model and population orders can be customized:
    # model_order=['GPT-4o', 'GPT-3.5', ...],
    # population_order=['Base', 'MaxN', 'MaxP'],
)

# ─────────────────────────────────────────────────────────────────────────────
# Display Table 4 with legend
# ─────────────────────────────────────────────────────────────────────────────

print("\n" + "="*80)
print("TABLE 4: EPQR-A Personality Scores (Mean ± SD)")
print("="*80)
print("\nLLM-generated persona scores compared to reference personality profiles")
print("\nSignificance markers (dual-level statistical testing):")
print("  Individual-level (paired t-test: model vs reference for same personas):")
print("    * = p < 0.05")
print("    † = p < 0.01")
print("  Population-level (one-sample t-test: mean difference from zero):")
print("    § = p < 0.05")
print("    ¶ = p < 0.01")
print("\nNote: Multiple markers indicate significance at both levels")
print("      (e.g., '†§' means individual p<0.01 AND population p<0.05)")
print("\n")

pivot_results

Running dual-level statistical tests (paired + one-sample t-tests)...
Completed 60 statistical tests

Models tested: ['Claude-3.5-s', 'GPT-3.5', 'GPT-4o', 'Llama3.1-70B', 'Llama3.2-3B']
Populations tested: ['Base', 'MaxN', 'MaxP']
Categories tested: ['E', 'L', 'N', 'P']

Creating publication-ready table...

TABLE 4: EPQR-A Personality Scores (Mean ± SD)

LLM-generated persona scores compared to reference personality profiles

Significance markers (dual-level statistical testing):
  Individual-level (paired t-test: model vs reference for same personas):
    * = p < 0.05
    † = p < 0.01
  Population-level (one-sample t-test: mean difference from zero):
    § = p < 0.05
    ¶ = p < 0.01

Note: Multiple markers indicate significance at both levels
      (e.g., '†§' means individual p<0.01 AND population p<0.05)


Completed 60 statistical tests

Models tested: ['Claude-3.5-s', 'GPT-3.5', 'GPT-4o', 'Llama3.1-70B', 'Llama3.2-3B']
Populations tested: ['Base', 'MaxN', 'MaxP']
Categories tested

Unnamed: 0,model,population,E,N,P,L
6,GPT-4o,Base,2.16 ± 2.85†¶,3.06 ± 2.59,0.87 ± 1.04,5.92 ± 0.48†¶
7,GPT-4o,MaxN,2.15 ± 2.86†¶,5.96 ± 0.29†¶,0.91 ± 1.11†¶,5.91 ± 0.57*§
8,GPT-4o,MaxP,2.17 ± 2.87†¶,3.01 ± 2.73,4.75 ± 1.03†¶,5.85 ± 0.72
3,GPT-3.5,Base,2.52 ± 2.64†¶,3.32 ± 2.36†¶,0.96 ± 0.93†¶,5.93 ± 0.41†¶
4,GPT-3.5,MaxN,2.94 ± 2.62†¶,4.05 ± 1.76†¶,1.17 ± 0.95†¶,5.94 ± 0.38†¶
5,GPT-3.5,MaxP,3.71 ± 2.32†¶,2.61 ± 2.30†¶,3.12 ± 1.18†¶,5.88 ± 0.49
0,Claude-3.5-s,Base,2.19 ± 2.85†¶,3.31 ± 2.44†¶,0.61 ± 0.91†¶,5.90 ± 0.60
1,Claude-3.5-s,MaxN,2.18 ± 2.86†¶,6.00 ± 0.00†¶,0.72 ± 0.97†¶,5.87 ± 0.61
2,Claude-3.5-s,MaxP,2.22 ± 2.85*§,3.15 ± 2.68*§,5.51 ± 0.65†¶,5.42 ± 1.42†¶
12,Llama3.2-3B,Base,2.55 ± 2.48†¶,2.42 ± 2.18†¶,1.38 ± 1.11†¶,5.22 ± 0.44†¶


In [8]:
# =============================================================================
# SUPPLEMENTARY: Reference Population Statistics
# =============================================================================
# Calculate mean ± std for the reference (input) personality profiles
# Uses the package function for consistent formatting

reference_summary = table_personality.compute_baseline_statistics(
    reference_data,
    score_column='ref_score',
    name='Reference'
)

print("="*60)
print("REFERENCE POPULATION: Baseline ENPL Scores")
print("="*60)
print("(Input personality profiles used to generate personas)")
print()
reference_summary

REFERENCE POPULATION: Baseline ENPL Scores
(Input personality profiles used to generate personas)



Unnamed: 0,Category,Reference Score (Mean ± SD)
0,E,2.26 ± 2.79
1,L,5.89 ± 0.54
2,N,3.08 ± 2.42
3,P,0.85 ± 0.97


In [9]:
# =============================================================================
# SUPPLEMENTARY: Random Baseline Statistics
# =============================================================================
# Calculate mean ± std for the random baseline (null hypothesis control)
# Uses the package function for consistent formatting

random_summary = table_personality.compute_baseline_statistics(
    random_data,
    score_column='random_score',
    name='Random Baseline'
)

print("="*60)
print("RANDOM BASELINE: Null Hypothesis Control ENPL Scores")
print("="*60)
print("(Randomly generated responses - what to expect by chance)")
print()
random_summary

RANDOM BASELINE: Null Hypothesis Control ENPL Scores
(Randomly generated responses - what to expect by chance)



Unnamed: 0,Category,Random Baseline Score (Mean ± SD)
0,E,2.23 ± 1.2
1,L,5.89 ± 0.33
2,N,3.01 ± 1.13
3,P,0.82 ± 0.72
