# Phase 6: Exploratory Analysis

**Swiss Ballot Chatbot Study - Measurement Analysis**

2x2 Factorial Design: Transparency (T0/T1) x Control (C0/C1)

## Purpose

### 6A. Dashboard Behavior (C1 only: Conditions C & D)
- Frequency analysis of each dashboard variable (scope/purpose/storage/retention)
- Compare C vs D distributions for each dashboard variable (χ² + Cramér's V)
- Optional: cluster analysis of dashboard preference profiles

### 6B. Q14 Open Text ("What mattered most…")
- Theme codebook (multi-label coding)
- Theme frequencies by condition (A/B/C/D)
- Theme frequencies by donate vs decline
- Condition contrasts: A vs B, A vs C, C vs D, B vs D
- 5 short representative quotes

## Setup & Imports

In [None]:
import os
import re
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Import from Phase 1
from phase1_descriptive_statistics import (
    AnalysisConfig,
    load_participant_data,
    prepare_variables,
    compute_sample_flow
)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 100)
plt.style.use('seaborn-v0_8-whitegrid')

# Significance threshold
ALPHA = 0.05

print("Setup complete!")

## Configuration

In [None]:
# Set participant type: 'ai' for AI test users, 'human' for real participants
PARTICIPANT_TYPE = 'ai'

config = AnalysisConfig(is_ai_participant=(PARTICIPANT_TYPE == 'ai'))
print(f"Analyzing: {'AI Test Users' if PARTICIPANT_TYPE == 'ai' else 'Human Participants'}")

## Helper Functions

In [None]:
def cramers_v(contingency_table: np.ndarray) -> tuple:
    """
    Calculate Cramér's V effect size for a contingency table.
    
    Returns: (V, interpretation)
    Interpretation: V < 0.1 = negligible, 0.1-0.2 = small, 0.2-0.4 = medium, >= 0.4 = large
    """
    chi2 = stats.chi2_contingency(contingency_table)[0]
    n = contingency_table.sum()
    min_dim = min(contingency_table.shape) - 1
    
    V = np.sqrt(chi2 / (n * min_dim)) if (n * min_dim) > 0 else 0
    
    # Interpretation
    if V < 0.1:
        interpretation = "negligible"
    elif V < 0.2:
        interpretation = "small"
    elif V < 0.4:
        interpretation = "medium"
    else:
        interpretation = "large"
    
    return V, interpretation


def chi_square_test(data: pd.DataFrame, var1: str, var2: str) -> dict:
    """
    Perform chi-square test between two categorical variables.
    
    Returns dict with contingency table, chi2, df, p, Cramér's V.
    """
    ct = pd.crosstab(data[var1], data[var2])
    chi2, p, df, expected = stats.chi2_contingency(ct)
    V, V_interp = cramers_v(ct.values)
    
    return {
        'contingency_table': ct,
        'chi2': chi2,
        'df': df,
        'p': p,
        'cramers_v': V,
        'v_interpretation': V_interp,
        'expected': expected
    }


print("Helper functions defined!")

## Load and Prepare Data

In [None]:
# Load data
df_raw = load_participant_data(config)
df = prepare_variables(df_raw, config)

# Apply exclusions
sample_flow = compute_sample_flow(df)
df_filtered = sample_flow['df_filtered']

print(f"\nFinal sample size: N = {len(df_filtered)}")

# Filter to C1 conditions (C and D) for dashboard analysis
df_c1 = df_filtered[df_filtered['control_level'] == 1].copy()
print(f"C1 participants (Conditions C & D): n = {len(df_c1)}")
print(f"  Condition C: n = {len(df_c1[df_c1['condition'] == 'C'])}")
print(f"  Condition D: n = {len(df_c1[df_c1['condition'] == 'D'])}")

## 6A: Dashboard Behavior Analysis

In [None]:
print("6A: DASHBOARD BEHAVIOR ANALYSIS")
print("=" * 70)
print("\nAnalyzing dashboard selections for Conditions C & D only.")

# Dashboard variables
dashboard_vars = ['dashboard_scope', 'dashboard_purpose', 'dashboard_storage', 'dashboard_retention']

# Check which variables exist
available_vars = [v for v in dashboard_vars if v in df_c1.columns]
print(f"\nAvailable dashboard variables: {available_vars}")

In [None]:
# Frequency tables for each dashboard variable
print("\nDASHBOARD OPTION FREQUENCIES")
print("=" * 70)

dashboard_freq_results = {}

for var in available_vars:
    print(f"\n--- {var.upper()} ---")
    
    # Overall frequencies
    freq_overall = df_c1[var].value_counts().sort_index()
    pct_overall = (df_c1[var].value_counts(normalize=True) * 100).sort_index()
    
    # By condition (C vs D)
    freq_c = df_c1[df_c1['condition'] == 'C'][var].value_counts().sort_index()
    freq_d = df_c1[df_c1['condition'] == 'D'][var].value_counts().sort_index()
    
    pct_c = (df_c1[df_c1['condition'] == 'C'][var].value_counts(normalize=True) * 100).sort_index()
    pct_d = (df_c1[df_c1['condition'] == 'D'][var].value_counts(normalize=True) * 100).sort_index()
    
    # Combine into table
    freq_table = pd.DataFrame({
        'Overall n': freq_overall,
        'Overall %': pct_overall.round(1),
        'C n': freq_c,
        'C %': pct_c.round(1),
        'D n': freq_d,
        'D %': pct_d.round(1)
    }).fillna(0)
    
    print(freq_table)
    dashboard_freq_results[var] = freq_table

In [None]:
# Chi-square tests: C vs D for each dashboard variable
print("\nCHI-SQUARE TESTS: CONDITION C vs D")
print("=" * 70)

chi_results = []

for var in available_vars:
    print(f"\n--- {var} ---")
    
    # Only test if both conditions have variation
    n_categories_c = df_c1[df_c1['condition'] == 'C'][var].nunique()
    n_categories_d = df_c1[df_c1['condition'] == 'D'][var].nunique()
    
    if n_categories_c > 1 and n_categories_d > 1:
        result = chi_square_test(df_c1, 'condition', var)
        
        print(f"χ²({result['df']}) = {result['chi2']:.3f}, p = {result['p']:.4f}")
        print(f"Cramér's V = {result['cramers_v']:.3f} ({result['v_interpretation']})")
        print(f"Significant: {'Yes' if result['p'] < ALPHA else 'No'}")
        
        chi_results.append({
            'Variable': var,
            'χ²': round(result['chi2'], 3),
            'df': result['df'],
            'p': round(result['p'], 4),
            'Cramér\'s V': round(result['cramers_v'], 3),
            'Interpretation': result['v_interpretation'],
            'Significant': 'Yes' if result['p'] < ALPHA else 'No'
        })
    else:
        print("Insufficient variation for chi-square test.")
        chi_results.append({
            'Variable': var,
            'χ²': 'N/A',
            'df': 'N/A',
            'p': 'N/A',
            'Cramér\'s V': 'N/A',
            'Interpretation': 'N/A',
            'Significant': 'N/A'
        })

chi_results_df = pd.DataFrame(chi_results)
print("\n" + "=" * 70)
print("Summary:")
chi_results_df

In [None]:
# Top configuration profiles
print("\nTOP DASHBOARD CONFIGURATIONS")
print("=" * 70)

if len(available_vars) == 4:
    # Create configuration string
    df_c1['config'] = (df_c1['dashboard_scope'].astype(str) + ' | ' +
                       df_c1['dashboard_purpose'].astype(str) + ' | ' +
                       df_c1['dashboard_storage'].astype(str) + ' | ' +
                       df_c1['dashboard_retention'].astype(str))
    
    # Top 10 configurations
    top_configs = df_c1['config'].value_counts().head(10)
    print("\nTop 10 most common configurations (scope | purpose | storage | retention):")
    for i, (config_str, count) in enumerate(top_configs.items(), 1):
        pct = count / len(df_c1) * 100
        print(f"{i}. {config_str}: n={count} ({pct:.1f}%)")
else:
    print("Not all 4 dashboard variables available for configuration analysis.")

In [None]:
# Dashboard visualizations
if len(available_vars) >= 2:
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    axes = axes.flatten()
    
    colors_cd = {'C': '#45B7D1', 'D': '#96CEB4'}
    
    for i, var in enumerate(available_vars[:4]):
        ax = axes[i]
        
        # Get data
        data_c = df_c1[df_c1['condition'] == 'C'][var].value_counts(normalize=True) * 100
        data_d = df_c1[df_c1['condition'] == 'D'][var].value_counts(normalize=True) * 100
        
        # Combine for plotting
        categories = sorted(set(data_c.index) | set(data_d.index))
        x = np.arange(len(categories))
        width = 0.35
        
        vals_c = [data_c.get(cat, 0) for cat in categories]
        vals_d = [data_d.get(cat, 0) for cat in categories]
        
        bars1 = ax.bar(x - width/2, vals_c, width, label='Condition C', color=colors_cd['C'], alpha=0.8)
        bars2 = ax.bar(x + width/2, vals_d, width, label='Condition D', color=colors_cd['D'], alpha=0.8)
        
        ax.set_xticks(x)
        ax.set_xticklabels([str(c)[:15] for c in categories], rotation=45, ha='right', fontsize=9)
        ax.set_ylabel('Percentage (%)', fontsize=10)
        ax.set_title(var.replace('dashboard_', '').title(), fontsize=11, fontweight='bold')
        ax.legend(fontsize=9)
        ax.set_ylim(0, 100)
    
    plt.suptitle('Dashboard Selections: Condition C vs D', fontsize=12, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()
else:
    print("Insufficient dashboard variables for visualization.")

## 6B: Q14 Open Text Analysis

In [None]:
print("6B: Q14 OPEN TEXT ANALYSIS")
print("=" * 70)
print("\nQ14: 'What mattered most for your data donation decision?'")

# Check if Q14 exists
q14_col = None
for col in ['q14', 'Q14', 'q14_text', 'open_text', 'decision_reason']:
    if col in df_filtered.columns:
        q14_col = col
        break

if q14_col:
    print(f"\nFound Q14 in column: {q14_col}")
    
    # Response rate
    total_n = len(df_filtered)
    non_empty = df_filtered[q14_col].dropna()
    non_empty = non_empty[non_empty.str.strip() != '']
    response_rate = len(non_empty) / total_n * 100
    
    print(f"Response rate: {len(non_empty)}/{total_n} ({response_rate:.1f}%)")
    
    # By condition
    print("\nResponse rate by condition:")
    for cond in ['A', 'B', 'C', 'D']:
        cond_df = df_filtered[df_filtered['condition'] == cond]
        cond_responses = cond_df[q14_col].dropna()
        cond_responses = cond_responses[cond_responses.str.strip() != '']
        cond_rate = len(cond_responses) / len(cond_df) * 100 if len(cond_df) > 0 else 0
        print(f"  {cond}: {len(cond_responses)}/{len(cond_df)} ({cond_rate:.1f}%)")
else:
    print("\nQ14 column not found in dataset.")
    print("Available columns:", list(df_filtered.columns))

In [None]:
# Theme codebook
THEME_CODEBOOK = {
    'transparency': ['transparent', 'clear', 'understand', 'know', 'information', 'explain', 'disclosure'],
    'control': ['control', 'choice', 'choose', 'decide', 'option', 'configure', 'granular'],
    'anonymity': ['anonymous', 'anonymity', 'identity', 'personal', 'identifiable', 'private'],
    'risk': ['risk', 'danger', 'unsafe', 'concern', 'worry', 'afraid', 'misuse'],
    'purpose': ['purpose', 'research', 'academic', 'science', 'commercial', 'profit'],
    'storage': ['storage', 'store', 'server', 'switzerland', 'local', 'location', 'where'],
    'retention': ['delete', 'retain', 'keep', 'time', 'duration', 'permanent', 'temporary'],
    'trust': ['trust', 'believe', 'reliable', 'credible', 'honest', 'trustworthy'],
    'civic': ['civic', 'citizen', 'democracy', 'vote', 'public', 'society', 'benefit'],
    'general_privacy': ['privacy', 'data protection', 'gdpr', 'sensitive']
}

def code_themes(text: str, codebook: dict) -> list:
    """Identify themes in text based on keyword matching."""
    if pd.isna(text) or text.strip() == '':
        return []
    
    text_lower = text.lower()
    themes_found = []
    
    for theme, keywords in codebook.items():
        for keyword in keywords:
            if keyword in text_lower:
                themes_found.append(theme)
                break  # Only count each theme once per response
    
    return themes_found

print("Theme codebook defined with", len(THEME_CODEBOOK), "themes.")

In [None]:
# Code responses
if q14_col:
    print("\nTHEME CODING RESULTS")
    print("=" * 70)
    
    # Apply coding
    df_filtered['themes'] = df_filtered[q14_col].apply(lambda x: code_themes(x, THEME_CODEBOOK))
    
    # Count themes overall
    all_themes = []
    for themes_list in df_filtered['themes']:
        all_themes.extend(themes_list)
    
    theme_counts = Counter(all_themes)
    
    print("\nOverall theme frequencies:")
    for theme, count in theme_counts.most_common():
        pct = count / len(non_empty) * 100 if len(non_empty) > 0 else 0
        print(f"  {theme}: n={count} ({pct:.1f}%)")
else:
    print("Skipping theme coding - Q14 not available.")

In [None]:
# Theme frequencies by condition
if q14_col:
    print("\nTHEME FREQUENCIES BY CONDITION")
    print("=" * 70)
    
    theme_by_condition = {}
    
    for cond in ['A', 'B', 'C', 'D']:
        cond_df = df_filtered[df_filtered['condition'] == cond]
        cond_themes = []
        for themes_list in cond_df['themes']:
            cond_themes.extend(themes_list)
        
        theme_by_condition[cond] = Counter(cond_themes)
    
    # Create comparison table
    all_theme_names = list(THEME_CODEBOOK.keys())
    comparison_data = []
    
    for theme in all_theme_names:
        row = {'Theme': theme}
        for cond in ['A', 'B', 'C', 'D']:
            count = theme_by_condition[cond].get(theme, 0)
            n_cond = len(df_filtered[df_filtered['condition'] == cond])
            pct = count / n_cond * 100 if n_cond > 0 else 0
            row[f'{cond} (%)'] = round(pct, 1)
        comparison_data.append(row)
    
    theme_comparison_df = pd.DataFrame(comparison_data)
    theme_comparison_df

In [None]:
# Theme frequencies by donation decision
if q14_col:
    print("\nTHEME FREQUENCIES BY DONATION DECISION")
    print("=" * 70)
    
    theme_by_decision = {}
    
    for decision in [0, 1]:
        decision_df = df_filtered[df_filtered['donation_decision'] == decision]
        decision_themes = []
        for themes_list in decision_df['themes']:
            decision_themes.extend(themes_list)
        
        theme_by_decision[decision] = Counter(decision_themes)
    
    # Create comparison table
    decision_comparison = []
    
    for theme in all_theme_names:
        n_decline = len(df_filtered[df_filtered['donation_decision'] == 0])
        n_donate = len(df_filtered[df_filtered['donation_decision'] == 1])
        
        count_decline = theme_by_decision[0].get(theme, 0)
        count_donate = theme_by_decision[1].get(theme, 0)
        
        pct_decline = count_decline / n_decline * 100 if n_decline > 0 else 0
        pct_donate = count_donate / n_donate * 100 if n_donate > 0 else 0
        
        decision_comparison.append({
            'Theme': theme,
            'Decline (%)': round(pct_decline, 1),
            'Donate (%)': round(pct_donate, 1),
            'Δ (pp)': round(pct_donate - pct_decline, 1)
        })
    
    decision_comparison_df = pd.DataFrame(decision_comparison)
    decision_comparison_df.sort_values('Δ (pp)', ascending=False)

In [None]:
# Condition contrasts
if q14_col:
    print("\nCONDITION CONTRASTS (Theme % differences)")
    print("=" * 70)
    
    contrasts = [
        ('A', 'B', 'Effect of adding DNL (no dashboard)'),
        ('A', 'C', 'Effect of adding Dashboard (no DNL)'),
        ('C', 'D', 'Effect of adding DNL (with dashboard)'),
        ('B', 'D', 'Effect of adding Dashboard (with DNL)')
    ]
    
    for cond1, cond2, description in contrasts:
        print(f"\n{cond1} vs {cond2}: {description}")
        
        n1 = len(df_filtered[df_filtered['condition'] == cond1])
        n2 = len(df_filtered[df_filtered['condition'] == cond2])
        
        differences = []
        for theme in all_theme_names:
            pct1 = theme_by_condition[cond1].get(theme, 0) / n1 * 100 if n1 > 0 else 0
            pct2 = theme_by_condition[cond2].get(theme, 0) / n2 * 100 if n2 > 0 else 0
            diff = pct2 - pct1
            if abs(diff) >= 5:  # Only show meaningful differences
                differences.append((theme, diff))
        
        if differences:
            differences.sort(key=lambda x: abs(x[1]), reverse=True)
            for theme, diff in differences[:5]:
                direction = '↑' if diff > 0 else '↓'
                print(f"  {theme}: {direction} {abs(diff):.1f} pp")
        else:
            print("  No meaningful differences (≥5 pp)")

In [None]:
# Representative quotes
if q14_col:
    print("\nREPRESENTATIVE QUOTES")
    print("=" * 70)
    
    # Get non-empty responses
    df_with_text = df_filtered[df_filtered[q14_col].notna()].copy()
    df_with_text = df_with_text[df_with_text[q14_col].str.strip() != '']
    
    if len(df_with_text) > 0:
        # Sample 5 quotes (stratified by condition if possible)
        quotes = []
        
        for cond in ['A', 'B', 'C', 'D']:
            cond_texts = df_with_text[df_with_text['condition'] == cond]
            if len(cond_texts) > 0:
                # Take first valid quote from this condition
                sample_row = cond_texts.iloc[0]
                quotes.append({
                    'Condition': cond,
                    'Donated': 'Yes' if sample_row['donation_decision'] == 1 else 'No',
                    'Quote': sample_row[q14_col][:200] + ('...' if len(str(sample_row[q14_col])) > 200 else '')
                })
        
        # Add one more if we have less than 5
        if len(quotes) < 5 and len(df_with_text) > 4:
            remaining = df_with_text[~df_with_text.index.isin([df_with_text[df_with_text['condition'] == q['Condition']].index[0] for q in quotes])]
            if len(remaining) > 0:
                sample_row = remaining.iloc[0]
                quotes.append({
                    'Condition': sample_row['condition'],
                    'Donated': 'Yes' if sample_row['donation_decision'] == 1 else 'No',
                    'Quote': sample_row[q14_col][:200] + ('...' if len(str(sample_row[q14_col])) > 200 else '')
                })
        
        print(f"\n{len(quotes)} representative quotes:")
        for i, q in enumerate(quotes, 1):
            print(f"\n{i}. [Condition {q['Condition']}, Donated: {q['Donated']}]")
            print(f"   \"{q['Quote']}\"")
    else:
        print("No text responses available for quotes.")

## Save Results

In [None]:
# Create output directory
output_dir = './output/phase6'
os.makedirs(output_dir, exist_ok=True)

# Save dashboard results
if len(available_vars) > 0:
    chi_results_df.to_csv(f'{output_dir}/phase6_dashboard_chi_square_{PARTICIPANT_TYPE}.csv', index=False)
    
    for var, freq_table in dashboard_freq_results.items():
        freq_table.to_csv(f'{output_dir}/phase6_{var}_frequencies_{PARTICIPANT_TYPE}.csv')

# Save Q14 results
if q14_col:
    theme_comparison_df.to_csv(f'{output_dir}/phase6_themes_by_condition_{PARTICIPANT_TYPE}.csv', index=False)
    decision_comparison_df.to_csv(f'{output_dir}/phase6_themes_by_decision_{PARTICIPANT_TYPE}.csv', index=False)

print(f"Results saved to {output_dir}/")

In [None]:
# Final Summary
print("\n" + "="*70)
print("PHASE 6 SUMMARY")
print("="*70)

print(f"""
6A: DASHBOARD BEHAVIOR (Conditions C & D, n={len(df_c1)})
  Dashboard variables analyzed: {len(available_vars)}
  Chi-square tests (C vs D): {len(chi_results)} completed
  Significant differences: {sum(1 for r in chi_results if r.get('Significant') == 'Yes')}

6B: Q14 OPEN TEXT ANALYSIS
  Q14 column: {q14_col if q14_col else 'Not found'}
  Response rate: {response_rate:.1f}% ({len(non_empty)}/{total_n}) if q14_col else 'N/A'
  Themes coded: {len(THEME_CODEBOOK)}
  Most common theme: {theme_counts.most_common(1)[0] if theme_counts else 'N/A'}

KEY FINDINGS:
  - Dashboard preferences {'vary' if any(r.get('Significant') == 'Yes' for r in chi_results) else 'do not significantly differ'} between C and D
  - Most cited themes relate to: {', '.join([t for t, _ in theme_counts.most_common(3)]) if theme_counts else 'N/A'}
""")

## Phase 6 Complete

The exploratory analysis is complete. Key outputs:

### 6A: Dashboard Behavior
1. Frequency tables for each dashboard variable (overall and by condition)
2. Chi-square tests comparing C vs D distributions
3. Top configuration profiles
4. Visualization of dashboard selections

### 6B: Q14 Open Text
1. Theme coding using keyword-based codebook
2. Theme frequencies by condition (A/B/C/D)
3. Theme frequencies by donation decision
4. Condition contrasts (A vs B, A vs C, C vs D, B vs D)
5. Representative quotes