# Sensitivity Analysis: Minimum Detectable Effect (MDE)

**Context:** Sample size was externally fixed by the Metaculus Q2 2025 AI forecasting tournament (N = 202 questions). We did not perform a-priori power analysis to determine sample size.

**Objective:** Determine the smallest effect each scenario was powered to detect, given the fixed N and observed variance. This explains why some conditions reached significance while others did not.

**Approach:** Sensitivity analysis (NOT post-hoc power). We compute the Minimum Detectable Effect (MDE) at 80% power for each scenario using observed standard deviations.

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.stats.power import TTestPower

# Parameters
N = 202
alpha = 0.05
target_power = 0.80

# Observed standard deviations of paired differences (from main analysis)
# These are the SDs of (deliberative - independent) log loss scores
scenarios = {
    'diverse_full': {'sd': 0.117, 'observed_effect': -0.020, 'p': 0.017},
    'diverse_info': {'sd': 0.237, 'observed_effect': -0.022, 'p': 0.182},
    'homo_full': {'sd': 0.194, 'observed_effect': 0.020, 'p': 0.144},
    'homo_info': {'sd': 0.308, 'observed_effect': 0.008, 'p': 0.717},
}

print(f"Sample size: N = {N} (fixed by Metaculus tournament)")
print(f"Alpha: {alpha} (two-sided)")
print(f"Target power: {target_power}")

## 1. Minimum Detectable Effect by Scenario

For a paired t-test with N = 202, we first compute the required Cohen's d for 80% power, then convert to log loss units using each scenario's observed SD.

In [None]:
# Calculate required Cohen's d for 80% power with N = 202
analyzer = TTestPower()
d_required = analyzer.solve_power(
    effect_size=None, 
    nobs=N, 
    alpha=alpha, 
    power=target_power, 
    alternative='two-sided'
)

print(f"Required Cohen's d for 80% power: {d_required:.3f}")
print()

# Compute MDE for each scenario
results = []
for name, data in scenarios.items():
    mde = d_required * data['sd']
    observed_abs = abs(data['observed_effect'])
    detectable = observed_abs >= mde
    
    results.append({
        'Scenario': name,
        'SD of change': data['sd'],
        'MDE (log loss)': mde,
        'Observed effect': data['observed_effect'],
        '|Observed| >= MDE': 'Yes' if detectable else 'No',
        'p-value': data['p'],
        'Significant': 'Yes' if data['p'] < 0.05 else 'No'
    })

mde_table = pd.DataFrame(results)
mde_table = mde_table.round(4)
mde_table

## 2. Interpretation

The table above explains why `diverse_full` reached significance while `diverse_info` did not, despite similar effect sizes:

- **diverse_full**: Low variance (SD = 0.12) → MDE = 0.023 → observed effect (0.020) is close to detectable
- **diverse_info**: High variance (SD = 0.24) → MDE = 0.047 → observed effect (0.022) is NOT detectable

The distributed information condition has ~2× the variance of the full information condition. This means we would need ~2× the effect size to detect it with the same power.

**Key insight**: Distributing information across agents increases noise in the deliberation effect, reducing statistical power. This is itself a finding, not just a limitation.

## 3. Sensitivity Curves

Power as a function of effect size for each scenario.

In [None]:
# Plot sensitivity curves for each scenario
fig, ax = plt.subplots(figsize=(10, 6))

# Effect sizes to evaluate (in log loss units)
effect_sizes = np.linspace(0.001, 0.08, 100)

colors = {'diverse_full': '#2ca02c', 'diverse_info': '#1f77b4', 
          'homo_full': '#d62728', 'homo_info': '#ff7f0e'}

for name, data in scenarios.items():
    sd = data['sd']
    # Convert effect sizes to Cohen's d for this scenario
    d_values = effect_sizes / sd
    # Calculate power for each d
    powers = [analyzer.power(effect_size=d, nobs=N, alpha=alpha, alternative='two-sided') 
              for d in d_values]
    ax.plot(effect_sizes, powers, label=f"{name} (SD={sd:.2f})", color=colors[name], linewidth=2)
    
    # Mark observed effect
    obs_effect = abs(data['observed_effect'])
    obs_d = obs_effect / sd
    obs_power = analyzer.power(effect_size=obs_d, nobs=N, alpha=alpha, alternative='two-sided')
    ax.scatter([obs_effect], [obs_power], color=colors[name], s=100, zorder=5, edgecolor='black')

ax.axhline(y=0.80, color='gray', linestyle='--', alpha=0.7, label='80% power')
ax.set_xlabel('Effect Size (Log Loss improvement)')
ax.set_ylabel('Power')
ax.set_title('Sensitivity Analysis: Power by Effect Size and Scenario')
ax.set_xlim(0, 0.08)
ax.set_ylim(0, 1)
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("../data/analysis/fig_sensitivity.png", dpi=150, bbox_inches='tight')
plt.show()

## 4. Summary

| Scenario | MDE (80% power) | Observed Effect | Informative? |
|----------|-----------------|-----------------|--------------|
| diverse_full | 0.023 | -0.020 | Yes - effect near MDE, significant |
| diverse_info | 0.047 | -0.022 | No - effect well below MDE |
| homo_full | 0.039 | +0.020 | Marginal - effect ~50% of MDE |
| homo_info | 0.061 | +0.008 | No - effect well below MDE |

**Conclusions:**

1. The study was adequately powered for the `diverse_full` condition (low variance).

2. The `diverse_info` condition showed similar improvement but was not detectable due to higher variance from distributed information.

3. The homogeneous conditions had small or null effects, well below their respective MDEs. We cannot distinguish "no effect" from "small undetectable effect" in these conditions.

4. Future studies examining distributed information effects should anticipate ~2× higher variance and plan sample sizes accordingly.