# Table 6: Cronbach's Alpha for EPQR-A Test

This notebook replicates **Table 6** from the paper, which presents Cronbach's Alpha values for the EPQR-A personality test across different LLM models, sample populations, and personality categories.

## Overview

**Table 6** measures the **internal consistency** (reliability) of the EPQR-A questionnaire by calculating Cronbach's Alpha for each:
- **Model**: GPT-4o, GPT-3.5, Claude-3.5-s, Llama3.2-3B, Llama3.1-70B
- **Population**: Base, MaxN, MaxP
- **Category**: E (Extraversion), N (Neuroticism), P (Psychoticism), L (Lie scale)

## Cronbach's Alpha (α)

Cronbach's Alpha is a measure of **internal consistency** - how closely related a set of items are as a group:
- **α > 0.9**: Excellent consistency
- **α > 0.8**: Good consistency
- **α > 0.7**: Acceptable consistency
- **α > 0.6**: Questionable consistency
- **α < 0.6**: Poor consistency

## Calculation Method

For each model/population/category combination:
1. **Pivot** questionnaire responses: rows = experiments, columns = questions within category
2. **Compute α** using Pingouin library: measures how consistently personas answer related questions
3. **Format** results into publication-ready table

## Expected Results

We expect:
- **High α values** (>0.7): Indicates LLMs generate internally consistent personality profiles
- **Similar α across models**: Suggests similar reliability in questionnaire responses
- **Lower α for MaxN/MaxP**: Manipulated personalities may show less consistency

## EPQR-A Categories

Each category has 6 questions in the abbreviated version:
- **E (Extraversion)**: Questions 2, 4, 13, 15, 20, 23
- **N (Neuroticism)**: Questions 1, 9, 11, 14, 18, 21
- **P (Psychoticism)**: Questions 6, 8, 12, 19, 22, 24
- **L (Lie scale)**: Questions 3, 5, 7, 10, 16, 17

## Setup and Data Loading

Import required packages and load EPQR-A questionnaire data.

In [1]:
# Standard libraries
import pandas as pd
import numpy as np

# Database connection
from personas_backend.db.db_handler import DatabaseHandler
from personas_backend import ACTIVE_SCHEMA

# Evaluations package
from evaluations import data_access, table_cronbach

# Configuration
SCHEMA = "personality_trap"
print(f"Using schema: {SCHEMA}")
print(f"Active schema: {ACTIVE_SCHEMA}")

Using schema: personality_trap
Active schema: test_validation_schema


In [2]:
# Connect to database
db_handler = DatabaseHandler()
conn = db_handler.connection

print(f"✅ Connected to database")
conn

✅ Connected to database


Engine(postgresql://personas:***@localhost:5432/personas)

In [3]:
# Load EPQR-A questionnaire data
with conn.connect() as connection:
    epqra_data = data_access.load_questionnaire_experiments(
        connection, 
        schema=SCHEMA,
        questionnaires=["epqra"]
    )

print(f"Total EPQR-A records: {len(epqra_data)}")
print(f"Experiment groups: {sorted(epqra_data['experiments_group_id'].unique())}")
print(f"Models: {sorted(epqra_data['model_clean'].unique())}")
print(f"Populations: {sorted(epqra_data['population_display'].unique())}")
print(f"Categories: {sorted(epqra_data['category'].unique())}")

# Check data completeness
print(f"\nData completeness check:")
completeness = epqra_data.groupby(['model_clean', 'population_display'], as_index=False).agg(
    personality_count=('personality_id', 'nunique'),
    experiment_count=('experiment_id', 'nunique'),
    question_count=('question_number', 'nunique'),
    total_records=('category', 'count')
)
display(completeness)

Loading questionnaire data from personality_trap.experiments_evals...
Loaded 297360 questionnaire records
Models: ['GPT-3.5' 'GPT-4o' 'Claude-3.5-s' 'Llama3.2-3B' 'Llama3.1-70B']
Populations: ['gpt35' 'gpt4o' 'maxN_gpt4o' 'maxP_gpt4o' 'claude35sonnet' 'llama323B'
 'llama3170B' 'maxN_claude35sonnet' 'maxP_claude35sonnet' 'maxN_gpt35'
 'maxP_gpt35' 'maxN_llama3170B' 'maxP_llama3170B' 'maxN_llama323B'
 'maxP_llama323B']
Total EPQR-A records: 297360
Experiment groups: [np.int64(307), np.int64(308), np.int64(312), np.int64(313), np.int64(343), np.int64(344), np.int64(345), np.int64(356), np.int64(357), np.int64(358), np.int64(359), np.int64(360), np.int64(361), np.int64(362), np.int64(363)]
Models: ['Claude-3.5-s', 'GPT-3.5', 'GPT-4o', 'Llama3.1-70B', 'Llama3.2-3B']
Populations: ['Base', 'MaxN', 'MaxP']
Categories: ['E', 'L', 'N', 'P']

Data completeness check:
Loaded 297360 questionnaire records
Models: ['GPT-3.5' 'GPT-4o' 'Claude-3.5-s' 'Llama3.2-3B' 'Llama3.1-70B']
Populations: ['gpt35' 

Unnamed: 0,model_clean,population_display,personality_count,experiment_count,question_count,total_records
0,Claude-3.5-s,Base,826,826,24,19824
1,Claude-3.5-s,MaxN,826,826,24,19824
2,Claude-3.5-s,MaxP,826,826,24,19824
3,GPT-3.5,Base,826,826,24,19824
4,GPT-3.5,MaxN,826,826,24,19824
5,GPT-3.5,MaxP,826,826,24,19824
6,GPT-4o,Base,826,826,24,19824
7,GPT-4o,MaxN,826,826,24,19824
8,GPT-4o,MaxP,826,826,24,19824
9,Llama3.1-70B,Base,826,826,24,19824


## Calculate Cronbach's Alpha Table

Calculate Cronbach's Alpha for all model-population combinations using the `table_cronbach` package module.

In [4]:
print("\nCalculating Cronbach's Alpha for EPQR-A data...")
experiment_alpha = table_cronbach.calculate_cronbach_alpha(
    data_frame=epqra_data,
    question_col='question_number',
    category_col='category',
    eval_col='eval',
    experiment_col='experiment_id',
    group_by=['model_clean', 'population_display']
)

print(f"\nCalculated alpha values for {len(experiment_alpha)} model-population combinations")
experiment_alpha.head()


Calculating Cronbach's Alpha for EPQR-A data...

Calculated alpha values for 15 model-population combinations

Calculated alpha values for 15 model-population combinations


Unnamed: 0,model_clean,population_display,E,N,P,L
0,Claude-3.5-s,Base,1.0,0.92,0.59,0.86
1,Claude-3.5-s,MaxN,1.0,,0.6,0.8
2,Claude-3.5-s,MaxP,0.99,0.95,0.23,0.89
3,GPT-3.5,Base,0.95,0.89,0.4,0.7
4,GPT-3.5,MaxN,0.95,0.77,0.44,0.75


## Table 6: Final Results (Ordered)

Display the final Cronbach's Alpha table in the expected order:
- **Models**: GPT-4o, GPT-3.5, Claude-3.5-s, Llama3.1-70B, Llama3.2-3B
- **Populations**: Base, MaxN, MaxP

In [5]:
# Create formatted table using package function
TABLE6 = table_cronbach.create_cronbach_table(
    experiment_alpha,
    model_col='model_clean',
    population_col='population_display'
)

print("\n" + "="*60)
print("TABLE 6: Cronbach's Alpha for EPQR-A Test")
print("="*60)
display(TABLE6)


TABLE 6: Cronbach's Alpha for EPQR-A Test


Unnamed: 0,model_clean,population_display,E,N,P,L
0,GPT-4o,Base,1.0,0.94,0.61,0.79
1,GPT-4o,MaxN,1.0,0.71,0.65,0.88
2,GPT-4o,MaxP,1.0,0.96,0.4,0.87
3,GPT-3.5,Base,0.95,0.89,0.4,0.7
4,GPT-3.5,MaxN,0.95,0.77,0.44,0.75
5,GPT-3.5,MaxP,0.91,0.88,0.28,0.65
6,Claude-3.5-s,Base,1.0,0.92,0.59,0.86
7,Claude-3.5-s,MaxN,1.0,,0.6,0.8
8,Claude-3.5-s,MaxP,0.99,0.95,0.23,0.89
9,Llama3.1-70B,Base,0.99,0.91,0.64,0.97


## Export Table

Save the table for inclusion in the paper.

## Interpretation Summary

Analyze the consistency of results across models and populations.

In [6]:
# Calculate summary statistics using package function
print("\nSummary Statistics:")
print("\nBy Category:")
category_stats = table_cronbach.compute_alpha_statistics(TABLE6)
display(category_stats)

# Count alpha quality levels using package function
print("\nAlpha Quality Distribution:")
quality_dist = table_cronbach.compute_alpha_quality_distribution(TABLE6)

print(f"  Excellent (α ≥ 0.9):     {quality_dist['excellent']} / {quality_dist['total']}")
print(f"  Good (0.8 ≤ α < 0.9):    {quality_dist['good']} / {quality_dist['total']}")
print(f"  Acceptable (0.7 ≤ α < 0.8): {quality_dist['acceptable']} / {quality_dist['total']}")
print(f"  Questionable (0.6 ≤ α < 0.7): {quality_dist['questionable']} / {quality_dist['total']}")
print(f"  Poor (α < 0.6):          {quality_dist['poor']} / {quality_dist['total']}")


Summary Statistics:

By Category:


Unnamed: 0,E,N,P,L
count,15.0,14.0,15.0,15.0
mean,0.97,0.84,0.48,0.69
std,0.03,0.18,0.16,0.32
min,0.91,0.27,0.18,0.01
25%,0.95,0.86,0.4,0.68
50%,0.99,0.88,0.45,0.8
75%,1.0,0.94,0.6,0.88
max,1.0,0.96,0.68,0.97



Alpha Quality Distribution:
  Excellent (α ≥ 0.9):     23 / 60
  Good (0.8 ≤ α < 0.9):    11 / 60
  Acceptable (0.7 ≤ α < 0.8): 5 / 60
  Questionable (0.6 ≤ α < 0.7): 7 / 60
  Poor (α < 0.6):          13 / 60


## Baseline Comparisons

Calculate Cronbach's Alpha for reference and random baseline data to provide context for the LLM-generated results.

### Reference Questionnaire Data

Load and analyze the reference (baseline) personality profiles that were used as input to generate the LLM personas.

In [7]:
# Load reference questionnaire data
print("Loading reference questionnaire data...")
reference_data = data_access.load_reference_questionnaires(conn, schema=SCHEMA)

print(f"\nLoaded {len(reference_data)} reference questionnaire records")
print(f"Unique personalities: {reference_data['personality_id'].nunique()}")
print(f"Questions per personality: {len(reference_data) // reference_data['personality_id'].nunique()}")
print(f"Categories: {sorted(reference_data['category'].unique())}")

# Show sample
print("\nSample reference data:")
reference_data.head(10)

Loading reference questionnaire data...

Loaded 19824 reference questionnaire records
Unique personalities: 826
Questions per personality: 24
Categories: ['E', 'L', 'N', 'P']

Sample reference data:


Unnamed: 0,id,personality_id,question_number,question,category,key,answer,ref_eval
0,1,1,1,Does your mood often go up and down?,N,True,True,1
1,2,1,2,Are you a talkative person?,E,True,False,0
2,3,1,3,Would being in debt worry you?,P,False,True,0
3,4,1,4,Are you rather lively?,E,True,False,0
4,5,1,5,Were you ever greedy by helping yourself to mo...,L,False,False,1
5,6,1,6,Would you take drugs which may have strange or...,P,True,False,0
6,7,1,7,Have you ever blamed someone for doing somethi...,L,False,False,1
7,8,1,8,Do you prefer to go your own way rather than a...,P,True,False,0
8,9,1,9,Do you often feel 'fed-up'?,N,True,True,1
9,10,1,10,Have you ever taken anything (even a pin or bu...,L,False,False,1


In [8]:
# Calculate Cronbach's Alpha for reference data using package function
print("\nCronbach's Alpha for Reference Questionnaire Data:")
print("(Baseline personality profiles used as input to generate personas)")
print()

reference_alpha = table_cronbach.calculate_cronbach_alpha(
    data_frame=reference_data,
    question_col='question_number',
    category_col='category',
    eval_col='ref_eval',
    experiment_col='personality_id'
)

# Display results
print("="*60)
print("REFERENCE BASELINE: Cronbach's Alpha")
print("="*60)
display(reference_alpha[['E', 'N', 'P', 'L']])


Cronbach's Alpha for Reference Questionnaire Data:
(Baseline personality profiles used as input to generate personas)

REFERENCE BASELINE: Cronbach's Alpha
REFERENCE BASELINE: Cronbach's Alpha


Unnamed: 0,E,N,P,L
0,0.98,0.91,0.57,0.74


### Random Baseline Data

Load and analyze randomly generated questionnaire responses to establish a null hypothesis baseline (what to expect by pure chance).

In [9]:
# Load random questionnaire data (check if table exists first)
try:
    print("Loading random questionnaire data...")
    
    # Query to load random baseline data
    query = f"""
    SELECT personality_id, question_number, category, key, answer,
           CASE WHEN key = answer THEN 1 ELSE 0 END as eval
    FROM {SCHEMA}.random_questionnaires
    """
    
    with conn.connect() as connection:
        random_data = pd.read_sql(query, connection)
    
    print(f"\nLoaded {len(random_data)} random questionnaire records")
    print(f"Unique personalities: {random_data['personality_id'].nunique()}")
    print(f"Questions per personality: {len(random_data) // random_data['personality_id'].nunique()}")
    print(f"Categories: {sorted(random_data['category'].unique())}")
    
    # Show sample
    print("\nSample random data:")
    display(random_data.head(10))
    
except Exception as e:
    print(f"⚠️ Could not load random questionnaire data: {e}")
    print("This table may not exist in the current schema.")
    random_data = None

Loading random questionnaire data...

Loaded 19824 random questionnaire records
Unique personalities: 826
Questions per personality: 24
Categories: ['E', 'L', 'N', 'P']

Sample random data:

Loaded 19824 random questionnaire records
Unique personalities: 826
Questions per personality: 24
Categories: ['E', 'L', 'N', 'P']

Sample random data:


Unnamed: 0,personality_id,question_number,category,key,answer,eval
0,1,1,N,True,True,1
1,1,2,E,True,True,1
2,1,3,P,False,True,0
3,1,4,E,True,True,1
4,1,5,L,False,False,1
5,1,6,P,True,False,0
6,1,7,L,False,False,1
7,1,8,P,True,False,0
8,1,9,N,True,True,1
9,1,10,L,False,False,1


In [10]:
# Calculate Cronbach's Alpha for random baseline data using package function
if random_data is not None:
    print("\nCronbach's Alpha for Random Baseline Data:")
    print("(Randomly generated responses - null hypothesis control)")
    print()
    
    random_alpha = table_cronbach.calculate_cronbach_alpha(
        data_frame=random_data,
        question_col='question_number',
        category_col='category',
        eval_col='eval',
        experiment_col='personality_id'
    )
    
    # Display results
    print("="*60)
    print("RANDOM BASELINE: Cronbach's Alpha")
    print("="*60)
    display(random_alpha[['E', 'N', 'P', 'L']])
else:
    print("⚠️ Skipping random baseline Cronbach's Alpha (data not available)")
    random_alpha = None


Cronbach's Alpha for Random Baseline Data:
(Randomly generated responses - null hypothesis control)

RANDOM BASELINE: Cronbach's Alpha


Unnamed: 0,E,N,P,L
0,0.03,-0.15,0.06,0.02


### Comparative Analysis

Compare Cronbach's Alpha values across LLM experiments, reference baseline, and random baseline.

In [11]:
# Create comprehensive comparison table using package function
print("\n" + "="*80)
print("COMPARATIVE CRONBACH'S ALPHA ANALYSIS")
print("="*80)
print()

comparison_df = table_cronbach.create_baseline_comparison(
    llm_alpha=TABLE6,
    reference_alpha=reference_alpha,
    random_alpha=random_alpha if random_data is not None else None
)

print("Cronbach's Alpha Comparison Across Data Sources:")
print()
display(comparison_df)

print("\nInterpretation:")
print("• Higher alpha values indicate better internal consistency")
print("• LLM experiments should show similar or better consistency than reference")
print("• Random baseline should show poor consistency (close to 0 or negative)")


COMPARATIVE CRONBACH'S ALPHA ANALYSIS

Cronbach's Alpha Comparison Across Data Sources:



Unnamed: 0,Source,E,N,P,L
0,LLM Experiments (Mean),0.97,0.84,0.48,0.69
1,Reference Baseline,0.98,0.91,0.57,0.74
2,Random Baseline,0.03,-0.15,0.06,0.02



Interpretation:
• Higher alpha values indicate better internal consistency
• LLM experiments should show similar or better consistency than reference
• Random baseline should show poor consistency (close to 0 or negative)


In [12]:
# Save to CSV
output_path = "table6_cronbach_alpha.csv"
TABLE6.to_csv(output_path, index=False)
print(f"\n✅ Table 6 saved to: {output_path}")


✅ Table 6 saved to: table6_cronbach_alpha.csv
