# Multi-Group Comparisons: Beyond A/B Testing

A/B testing compares two groups. But what if we have **multiple treatments** or **multiple classifiers** to compare?

This notebook covers statistical methods for comparing 3+ groups, with special focus on:
- **Friedman test**: Non-parametric test for repeated measures (e.g., multiple classifiers on same datasets)
- **Nemenyi post-hoc test**: Pairwise comparisons after Friedman
- **Kruskal-Wallis**: Non-parametric test for independent groups
- **ANOVA**: Parametric alternative

## Learning Objectives

1. Understand when to use each multi-group test
2. Apply Friedman test to ensemble learning scenarios
3. Perform post-hoc pairwise comparisons correctly
4. Visualize results with critical difference diagrams
5. Avoid multiple testing pitfalls

## Why This Matters

**Use cases:**
- Comparing multiple ML models on the same datasets (your ensemble learning case)
- Testing multiple drug treatments
- Comparing gene expression across multiple conditions
- A/B/C/D/... testing in product development

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import friedmanchisquare, kruskal, f_oneway
from itertools import combinations

np.random.seed(42)

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

---

## Part 1: The Problem with Multiple Pairwise Tests

Why can't we just do multiple A/B tests?

In [None]:
# Demonstrate multiple testing problem
def multiple_pairwise_problem(k_groups=5, alpha=0.05):
    """
    Show how pairwise testing inflates Type I error.
    
    k_groups: Number of groups to compare
    """
    # Number of pairwise comparisons
    n_comparisons = k_groups * (k_groups - 1) // 2
    
    # Family-wise error rate (FWER)
    # Probability of at least one false positive
    fwer = 1 - (1 - alpha)**n_comparisons
    
    return n_comparisons, fwer

print("Multiple Testing Problem")
print("="*60)
print(f"{'Groups':<10} {'Comparisons':<15} {'FWER (α=0.05)':<20}")
print("-"*60)

for k in [2, 3, 4, 5, 10]:
    n_comp, fwer = multiple_pairwise_problem(k)
    print(f"{k:<10} {n_comp:<15} {fwer:.3f}")

print("\n⚠️  With 5 groups, we have 40% chance of false positive!")
print("Solution: Use omnibus test first, then post-hoc corrections.")

---

## Part 2: Friedman Test (Repeated Measures)

**When to use**: Comparing k treatments/algorithms on the **same** subjects/datasets

**Example**: Comparing multiple classifiers on the same benchmark datasets

### The Setup

- **Rows**: Datasets (or subjects, blocks)
- **Columns**: Classifiers (or treatments)
- **Values**: Performance metric (accuracy, F1, etc.)

Friedman test ranks each row, then tests if mean ranks differ across columns.

In [None]:
# Generate example data: 5 classifiers on 10 datasets
np.random.seed(42)

n_datasets = 10
classifiers = ['Random Forest', 'XGBoost', 'SVM', 'Logistic Reg', 'Neural Net']

# Simulate accuracies (RF and XGBoost are better)
data = {
    'Random Forest': np.random.beta(8, 2, n_datasets),
    'XGBoost': np.random.beta(8, 2, n_datasets),
    'SVM': np.random.beta(6, 3, n_datasets),
    'Logistic Reg': np.random.beta(5, 4, n_datasets),
    'Neural Net': np.random.beta(7, 3, n_datasets),
}

df_classifiers = pd.DataFrame(data)
df_classifiers.index = [f'Dataset_{i+1}' for i in range(n_datasets)]

print("Classifier Performance (Accuracy)")
print("="*70)
print(df_classifiers.round(3))
print(f"\nMean accuracy per classifier:")
print(df_classifiers.mean().round(3))

In [None]:
# Perform Friedman test
statistic, p_value = friedmanchisquare(*[df_classifiers[col] for col in df_classifiers.columns])

print("Friedman Test Results")
print("="*50)
print(f"Test statistic (χ²): {statistic:.3f}")
print(f"p-value: {p_value:.4f}")
print(f"Significant at α=0.05: {p_value < 0.05}")

if p_value < 0.05:
    print("\n✓ Reject null hypothesis: Classifiers have different performance")
    print("  → Proceed to post-hoc tests to find which pairs differ")
else:
    print("\n✗ Fail to reject null: No evidence of differences")

### Understanding Friedman Test

**How it works:**
1. Rank each dataset (row) from 1 to k
2. Compute mean rank for each classifier
3. Test if mean ranks differ significantly

**Null hypothesis**: All classifiers have the same distribution

**Advantages**:
- Non-parametric (no normality assumption)
- Accounts for dataset-specific difficulty
- More powerful than independent tests when data is paired

In [None]:
# Compute and visualize ranks
ranks = df_classifiers.rank(axis=1, ascending=False)
mean_ranks = ranks.mean()

print("Mean Ranks (lower is better)")
print("="*50)
print(mean_ranks.sort_values().round(2))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Mean accuracy
df_classifiers.mean().sort_values(ascending=False).plot(kind='barh', ax=axes[0], color='steelblue')
axes[0].set_xlabel('Mean Accuracy')
axes[0].set_title('Mean Performance Across Datasets')
axes[0].grid(True, alpha=0.3)

# Mean ranks
mean_ranks.sort_values().plot(kind='barh', ax=axes[1], color='coral')
axes[1].set_xlabel('Mean Rank (lower is better)')
axes[1].set_title('Friedman Test Rankings')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Part 3: Nemenyi Post-Hoc Test

After Friedman test rejects the null, we need **post-hoc tests** to determine which pairs differ.

**Nemenyi test**: All pairwise comparisons with family-wise error rate control

### Critical Difference

Two classifiers are significantly different if their mean rank difference exceeds the **critical difference (CD)**:

$$CD = q_\alpha \sqrt{\frac{k(k+1)}{6N}}$$

where:
- $q_\alpha$ is the critical value from Studentized range distribution
- $k$ is the number of classifiers
- $N$ is the number of datasets

In [None]:
def nemenyi_test(ranks_df, alpha=0.05):
    """
    Perform Nemenyi post-hoc test.
    
    Parameters
    ----------
    ranks_df : DataFrame
        Ranks for each dataset (rows) and classifier (columns)
    alpha : float
        Significance level
    
    Returns
    -------
    dict with critical difference and pairwise comparisons
    """
    k = ranks_df.shape[1]  # Number of classifiers
    N = ranks_df.shape[0]  # Number of datasets
    
    # Critical value from Studentized range distribution
    # For Nemenyi, we use q_alpha / sqrt(2)
    from scipy.stats import studentized_range
    q_alpha = studentized_range.ppf(1 - alpha, k, np.inf) / np.sqrt(2)
    
    # Critical difference
    cd = q_alpha * np.sqrt(k * (k + 1) / (6 * N))
    
    # Mean ranks
    mean_ranks = ranks_df.mean()
    
    # Pairwise comparisons
    comparisons = []
    for clf1, clf2 in combinations(ranks_df.columns, 2):
        rank_diff = abs(mean_ranks[clf1] - mean_ranks[clf2])
        significant = rank_diff > cd
        comparisons.append({
            'classifier_1': clf1,
            'classifier_2': clf2,
            'rank_diff': rank_diff,
            'significant': significant,
        })
    
    return {
        'critical_difference': cd,
        'mean_ranks': mean_ranks,
        'comparisons': pd.DataFrame(comparisons),
    }

# Apply Nemenyi test
nemenyi_results = nemenyi_test(ranks)

print("Nemenyi Post-Hoc Test")
print("="*70)
print(f"Critical Difference (α=0.05): {nemenyi_results['critical_difference']:.3f}")
print(f"\nMean Ranks:")
print(nemenyi_results['mean_ranks'].sort_values().round(3))
print(f"\nPairwise Comparisons:")
print(nemenyi_results['comparisons'].to_string(index=False))

### Critical Difference Diagram

A standard visualization in ML benchmarking literature.

In [None]:
def plot_critical_difference(mean_ranks, cd, figsize=(12, 4)):
    """
    Plot critical difference diagram.
    
    Classifiers connected by a horizontal line are NOT significantly different.
    """
    fig, ax = plt.subplots(figsize=figsize)
    
    # Sort by rank
    sorted_ranks = mean_ranks.sort_values()
    n_classifiers = len(sorted_ranks)
    
    # Plot ranks on x-axis
    y_positions = np.arange(n_classifiers)
    
    # Draw classifier names and ranks
    for i, (clf, rank) in enumerate(sorted_ranks.items()):
        ax.plot(rank, i, 'o', markersize=10, color='steelblue')
        ax.text(rank, i + 0.3, clf, ha='center', fontsize=10, fontweight='bold')
        ax.text(rank, i - 0.3, f'{rank:.2f}', ha='center', fontsize=9, color='gray')
    
    # Draw critical difference bars
    # Connect classifiers that are NOT significantly different
    clf_list = sorted_ranks.index.tolist()
    for i in range(n_classifiers):
        for j in range(i + 1, n_classifiers):
            rank_diff = abs(sorted_ranks.iloc[j] - sorted_ranks.iloc[i])
            if rank_diff <= cd:
                # Not significant - draw connecting line
                y_mid = (i + j) / 2
                ax.plot([sorted_ranks.iloc[i], sorted_ranks.iloc[j]], 
                       [y_mid, y_mid], 'k-', linewidth=2, alpha=0.3)
    
    # Add CD reference
    ax.plot([1, 1 + cd], [-0.5, -0.5], 'r-', linewidth=3, label=f'CD = {cd:.2f}')
    ax.text(1 + cd/2, -0.7, f'Critical Difference', ha='center', color='red', fontsize=9)
    
    ax.set_xlabel('Mean Rank (lower is better)', fontsize=12)
    ax.set_yticks([])
    ax.set_ylim(-1, n_classifiers)
    ax.set_title('Critical Difference Diagram (Nemenyi Test)\nClassifiers connected by lines are not significantly different', 
                 fontsize=13, fontweight='bold')
    ax.grid(True, axis='x', alpha=0.3)
    
    plt.tight_layout()
    plt.show()

plot_critical_difference(nemenyi_results['mean_ranks'], nemenyi_results['critical_difference'])

---

## Part 4: Kruskal-Wallis Test (Independent Groups)

**When to use**: Comparing k groups that are **independent** (not repeated measures)

**Example**: Comparing gene expression across multiple tissue types (different samples)

In [None]:
# Generate independent group data
np.random.seed(42)

# Gene expression in 4 different tissues
tissue_a = np.random.lognormal(3, 0.5, 30)
tissue_b = np.random.lognormal(3.2, 0.5, 30)
tissue_c = np.random.lognormal(3.5, 0.5, 30)
tissue_d = np.random.lognormal(3.1, 0.5, 30)

# Kruskal-Wallis test
statistic, p_value = kruskal(tissue_a, tissue_b, tissue_c, tissue_d)

print("Kruskal-Wallis Test: Gene Expression Across Tissues")
print("="*60)
print(f"Test statistic (H): {statistic:.3f}")
print(f"p-value: {p_value:.4f}")
print(f"Significant at α=0.05: {p_value < 0.05}")

# Visualize
data_tissues = pd.DataFrame({
    'Tissue A': tissue_a,
    'Tissue B': tissue_b,
    'Tissue C': tissue_c,
    'Tissue D': tissue_d,
})

fig, ax = plt.subplots(figsize=(10, 6))
data_tissues.boxplot(ax=ax)
ax.set_ylabel('Gene Expression (log scale)')
ax.set_title('Gene Expression Across Tissue Types')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Dunn's Post-Hoc Test

After Kruskal-Wallis, use **Dunn's test** for pairwise comparisons.

In [None]:
def dunn_test(groups, alpha=0.05):
    """
    Perform Dunn's post-hoc test with Bonferroni correction.
    
    Parameters
    ----------
    groups : dict
        Dictionary of group_name: data arrays
    alpha : float
        Significance level
    """
    from scipy.stats import mannwhitneyu
    
    group_names = list(groups.keys())
    n_comparisons = len(group_names) * (len(group_names) - 1) // 2
    alpha_corrected = alpha / n_comparisons  # Bonferroni
    
    results = []
    for g1, g2 in combinations(group_names, 2):
        stat, p = mannwhitneyu(groups[g1], groups[g2], alternative='two-sided')
        results.append({
            'group_1': g1,
            'group_2': g2,
            'statistic': stat,
            'p_value': p,
            'p_corrected': p * n_comparisons,  # Bonferroni adjustment
            'significant': p < alpha_corrected,
        })
    
    return pd.DataFrame(results)

groups_dict = {
    'Tissue A': tissue_a,
    'Tissue B': tissue_b,
    'Tissue C': tissue_c,
    'Tissue D': tissue_d,
}

dunn_results = dunn_test(groups_dict)

print("Dunn's Post-Hoc Test (Bonferroni corrected)")
print("="*70)
print(dunn_results.to_string(index=False))

---

## Part 5: ANOVA (Parametric Alternative)

**When to use**: Data is approximately normal, equal variances

**Advantage**: More powerful than non-parametric tests when assumptions hold

In [None]:
# Generate normal data
np.random.seed(42)

group1 = np.random.normal(100, 15, 50)
group2 = np.random.normal(105, 15, 50)
group3 = np.random.normal(110, 15, 50)
group4 = np.random.normal(103, 15, 50)

# One-way ANOVA
f_stat, p_value = f_oneway(group1, group2, group3, group4)

print("One-Way ANOVA")
print("="*50)
print(f"F-statistic: {f_stat:.3f}")
print(f"p-value: {p_value:.4f}")
print(f"Significant at α=0.05: {p_value < 0.05}")

### Tukey's HSD Post-Hoc

After ANOVA, use **Tukey's Honestly Significant Difference** test.

In [None]:
from scipy.stats import tukey_hsd

# Perform Tukey HSD
res = tukey_hsd(group1, group2, group3, group4)

print("Tukey's HSD Post-Hoc Test")
print("="*50)
print("Pairwise confidence intervals:")
print(res.confidence_interval())
print(f"\nPairwise p-values:")
print(res.pvalue)

---

## Part 6: Practical Application - Ensemble Learning

Complete workflow for comparing multiple classifiers.

In [None]:
def compare_classifiers_workflow(performance_df, alpha=0.05):
    """
    Complete workflow for comparing multiple classifiers.
    
    Parameters
    ----------
    performance_df : DataFrame
        Rows = datasets, Columns = classifiers, Values = performance metric
    alpha : float
        Significance level
    """
    print("="*70)
    print("CLASSIFIER COMPARISON WORKFLOW")
    print("="*70)
    
    # Step 1: Descriptive statistics
    print("\n1. DESCRIPTIVE STATISTICS")
    print("-"*70)
    print("Mean performance:")
    print(performance_df.mean().sort_values(ascending=False).round(4))
    
    # Step 2: Friedman test
    print("\n2. FRIEDMAN TEST (Omnibus)")
    print("-"*70)
    stat, p = friedmanchisquare(*[performance_df[col] for col in performance_df.columns])
    print(f"χ² = {stat:.3f}, p = {p:.4f}")
    
    if p >= alpha:
        print(f"\n✗ No significant differences detected (p ≥ {alpha})")
        print("  → Stop here. No need for post-hoc tests.")
        return
    
    print(f"\n✓ Significant differences detected (p < {alpha})")
    print("  → Proceed to post-hoc tests")
    
    # Step 3: Compute ranks
    ranks = performance_df.rank(axis=1, ascending=False)
    mean_ranks = ranks.mean().sort_values()
    
    print("\n3. MEAN RANKS")
    print("-"*70)
    print(mean_ranks.round(3))
    
    # Step 4: Nemenyi test
    print("\n4. NEMENYI POST-HOC TEST")
    print("-"*70)
    nemenyi_res = nemenyi_test(ranks, alpha)
    print(f"Critical Difference: {nemenyi_res['critical_difference']:.3f}")
    print("\nSignificant pairwise differences:")
    sig_comparisons = nemenyi_res['comparisons'][nemenyi_res['comparisons']['significant']]
    if len(sig_comparisons) > 0:
        print(sig_comparisons[['classifier_1', 'classifier_2', 'rank_diff']].to_string(index=False))
    else:
        print("  (none)")
    
    # Step 5: Visualization
    print("\n5. CRITICAL DIFFERENCE DIAGRAM")
    print("-"*70)
    plot_critical_difference(mean_ranks, nemenyi_res['critical_difference'])
    
    return nemenyi_res

# Apply to our classifier data
results = compare_classifiers_workflow(df_classifiers)

---

## Part 7: Decision Guide

### Which Test Should I Use?

```
┌─────────────────────────────────────┐
│ How many groups?                    │
└─────────────────────────────────────┘
           │
           ├─ 2 groups → Use A/B testing (t-test, Mann-Whitney)
           │
           └─ 3+ groups
                  │
                  ├─ Repeated measures? (same subjects/datasets)
                  │     │
                  │     ├─ Yes → FRIEDMAN TEST
                  │     │         Post-hoc: NEMENYI
                  │     │
                  │     └─ No → Independent groups
                  │               │
                  │               ├─ Normal + equal variance?
                  │               │     │
                  │               │     ├─ Yes → ANOVA
                  │               │     │         Post-hoc: TUKEY HSD
                  │               │     │
                  │               │     └─ No → KRUSKAL-WALLIS
                  │               │               Post-hoc: DUNN
```

### Summary Table

| Test | Data Type | Groups | Assumptions | Post-Hoc |
|------|-----------|--------|-------------|----------|
| **Friedman** | Repeated measures | 3+ | Non-parametric | Nemenyi |
| **Kruskal-Wallis** | Independent | 3+ | Non-parametric | Dunn |
| **ANOVA** | Independent | 3+ | Normal, equal variance | Tukey HSD |
| **t-test** | Independent | 2 | Normal | N/A |
| **Mann-Whitney** | Independent | 2 | Non-parametric | N/A |

---

## Summary

### Key Takeaways

1. **Don't do multiple pairwise tests**: Use omnibus test first to control family-wise error rate
2. **Friedman test is ideal for ML benchmarking**: Accounts for dataset-specific difficulty
3. **Nemenyi test controls FWER**: Safe for all pairwise comparisons after Friedman
4. **Critical difference diagrams**: Standard visualization in ML literature
5. **Choose the right test**: Match your test to your data structure (repeated vs independent)

### Connection to A/B Testing

- **A/B testing**: Special case with k=2 groups
- **Multi-group testing**: Generalization to k≥3 groups
- **Same principles**: Randomization, hypothesis testing, multiple testing correction

### When to Use This

**Your ensemble learning case**: Comparing multiple constituent classifiers
- Use **Friedman test** (repeated measures on same datasets)
- Follow with **Nemenyi post-hoc**
- Visualize with **critical difference diagram**

### References

- Demšar (2006). "Statistical Comparisons of Classifiers over Multiple Data Sets"
- García & Herrera (2008). "An Extension on Statistical Comparisons of Classifiers"
- Benavoli et al. (2016). "Should We Really Use Post-Hoc Tests?"