<a href="https://colab.research.google.com/github/lawrennd/fitkit/blob/main/examples/wikipedia_editing_fitness_complexity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Wikipedia Editing Data: fitness / complexity analysis

This notebook:

- Downloads a sample of Wikipedia users from BigQuery and aggregates their edits into per-user text.
- Builds a `user` $\times$ `word` matrix and its *support* (analogous to `country` $\times$ `product`).

**Expected behavior**: Wikipedia likely has **community/modular structure** (different specialist fields like astrophysics, biology, history) rather than a single nested hierarchy. This means:
- **ECI** will show **poor correlation** with diversification globally (community structure breaks its assumption)
- **Eigenvalue analysis** can detect and quantify community structure
- **Community detection** can separate users into specialist groups
- **ECI may work better within communities** (where structure is more nested)
- **log(Fitness)** remains **robust** as a global complexity measure
- See the nested_matrix notebook for systematic analysis of this effect


### Why log(Fitness)?

Throughout this notebook, we use **log(Fitness)** for correlations with ECI and diversification because:
- Fitness is a **multiplicative/exponential** quantity (spans many orders of magnitude)
- ECI and diversification are **linear** scales
- log(Fitness) provides the appropriate scale for meaningful comparison


### BigQuery Setup Instructions

To run the BigQuery query cells in this notebook, you need to have a Google Cloud Project with the BigQuery API enabled and proper authentication setup.

Here's a general guide:

1.  **Google Cloud Account**: If you don't have one, sign up for a Google Cloud account. You might be eligible for a free trial.
    *   [Sign up for Google Cloud](https://cloud.google.com/free)

2.  **Create/Select a Project**: In the [Google Cloud Console](https://console.cloud.google.com/), create a new project or select an existing one.
    *   Ensure that **billing is enabled** for your project, as BigQuery usage incurs costs (though often minimal for small queries, especially with the free tier).

3.  **Enable the BigQuery API**: For your selected project, ensure the BigQuery API is enabled.
    *   Go to the [API Library](https://console.cloud.google.com/apis/library) in the Cloud Console.
    *   Search for "BigQuery API" and enable it if it's not already enabled.

4.  **Authentication**:
    
    **In Google Colab**: Authentication is automatic. The `WikipediaLoader` will detect the Colab environment and use `google.colab.auth.authenticate_user()` to prompt you to log in with your Google account.
    
    **In Local Jupyter**: You need to set up Application Default Credentials (ADC) using the `gcloud` CLI:
    ```bash
    gcloud auth application-default login
    ```
    This will authenticate you and allow the `WikipediaLoader` to access BigQuery using your credentials.

Once these steps are complete, you should be able to run the BigQuery cells successfully!

### 0) Setup

You’ll need BigQuery credentials configured locally (e.g. `gcloud auth application-default login`).

If you don’t have BigQuery access, you can still run the later cells by loading from the cached parquet file (see the caching cell below).


In [None]:
import sys
import subprocess
from pathlib import Path


def _pip_install(args: list[str]) -> None:
    cmd = [sys.executable, "-m", "pip", *args]
    print("Running:", " ".join(cmd))
    subprocess.check_call(cmd)


def ensure_fitkit_installed() -> None:
    """Prefer editable local install; fall back to GitHub.

    - Local (typical): `pip install -e ..` when running from `examples/`
    - Colab/remote: `pip install git+https://github.com/lawrennd/fitkit.git`
    """
    try:
        import fitkit  # noqa: F401

        return
    except ImportError:
        pass

    here = Path.cwd().resolve()
    candidates = [here, here.parent, here.parent.parent]

    for root in candidates:
        if (root / "pyproject.toml").exists() and (root / "fitkit").is_dir():
            _pip_install(["install", "-e", str(root)])
            return

    _pip_install(["install", "git+https://github.com/lawrennd/fitkit.git"])


ensure_fitkit_installed()
import fitkit

print("fitkit version:", getattr(fitkit, "__version__", "unknown"))

In [None]:
from fitkit.data import WikipediaLoader, QueryConfig, create_small_fixture
from fitkit.algorithms import FitnessComplexity, ECI, SinkhornScaler
from fitkit.algorithms import fitness_complexity, compute_eci_pci, sinkhorn_masked  # functional API also available

In [None]:
# Core
import os
import numpy as np
import pandas as pd

# Sparse matrices
import scipy.sparse as sp

# Plotting
import matplotlib.pyplot as plt


In [None]:
# Standard random sampling (no specific users)
cfg = QueryConfig()

CACHE_DIR = "data"
os.makedirs(CACHE_DIR, exist_ok=True)

# Updated cache path for Wikipedia data (v4 - random sample)
CACHE_PATH = os.path.join(
    CACHE_DIR,
    f"wikipedia_authors{cfg.max_authors}_v4.parquet",
)

print("Cache path:", CACHE_PATH)

In [None]:
# Load data using WikipediaLoader
print(f"Using cache path: {CACHE_PATH}")

loader = WikipediaLoader(cfg, CACHE_PATH)
bundle = loader.load()

# Extract components from bundle
X = bundle.matrix
user_ids = bundle.row_labels.tolist()
vocab = bundle.col_labels.tolist()

print(f"Loaded: {len(user_ids)} users, {len(vocab)} words")
print(f"Matrix shape: {X.shape}, dtype: {X.dtype}")
print(f"Matrix is sparse: {sp.issparse(X)}")

In [None]:
# 2) Extract support matrix and prepare for analysis
#
# In the paper's language, we will treat the *support* as M_{uw} = 1{X_{uw} > 0}.
# The matrix X from WikipediaLoader already has filtering applied (via QueryConfig).
# The loader uses binary=False by default (word counts), but we can work with either.

# Support mask (structural zeros off-support)
M = X.copy()
M.data = np.ones_like(M.data)

# Basic margins (analogues of diversification and ubiquity)
user_strength = np.asarray(X.sum(axis=1)).ravel()
word_strength = np.asarray(X.sum(axis=0)).ravel()

print("User strength:", pd.Series(user_strength).describe())
print("Word strength:", pd.Series(word_strength).describe())
print(f"Matrix -> Users: {X.shape[0]}, Vocab: {X.shape[1]}")

# Labeled view for plotting and downstream helpers
M_df = pd.DataFrame.sparse.from_spmatrix(M, index=user_ids, columns=vocab)

### 3) Baseline: 1D Pietronero Fitness–Complexity fixed point

This is the usual nonlinear rank-1 fixed point on the **support matrix** \(M\) (binary incidence). We’ll compute it as a scalar reference, then move to the rank-2 extension.


### Fitness–Complexity ⇄ IPF/Sinkhorn equivalence (what the paper is using)

In the paper (`economic-fitness.tex`), the key point is that **Fitness–Complexity is a reparameterisation of masked IPF/Sinkhorn matrix scaling** on the support graph.

- We solve for a coupling/flow \(w_{uw}\ge 0\) supported on \(M\) such that \(\sum_w w_{uw}=r_u\) and \(\sum_u w_{uw}=c_w\).
- IPF/Sinkhorn gives a diagonal scaling solution \(w_{uw} = M_{uw} A_u B_w\).
- Setting \(A_u \equiv 1/F_u\) and \(B_w \equiv Q_w\) yields \(w_{uw} \propto M_{uw} Q_w/F_u\), and the FC fixed-point updates recover the scaling equations (up to the usual projective normalisation/gauge).

So the Sinkhorn/IPF object here is **not a different model**—it’s the same masked matrix-scaling problem, viewed in “flow” form. The only extra modelling choice is **which marginals \((r,c)\)** to impose (uniform is a common default in the support-only setting; data-marginals are natural for quantitative flows).


In [None]:
# Use sklearn-style estimators
fc = FitnessComplexity()
F, Q = fc.fit_transform(M)
fc_hist = fc.history_

eci_model = ECI()
eci, pci = eci_model.fit_transform(M)

F_s = pd.Series(F, index=user_ids, name="Fitness")
Q_s = pd.Series(Q, index=vocab, name="Complexity")
eci_s = pd.Series(eci, index=user_ids, name="ECI")
pci_s = pd.Series(pci, index=vocab, name="PCI")

kc = pd.Series(np.asarray(M.sum(axis=1)).ravel(), index=user_ids, name="diversification_kc")
kp = pd.Series(np.asarray(M.sum(axis=0)).ravel(), index=vocab, name="ubiquity_kp")

# Sinkhorn/IPF scaling to build a flow W on the support.
# For the FC ⇄ Sinkhorn equivalence viewpoint, the natural default is *uniform* marginals.
# However, uniform marginals can be infeasible on some sparse masks; we fall back if needed.

# default: uniform marginals (same total mass, different per-node mass if rectangular)
r_uniform = np.ones(M.shape[0], dtype=float)
r_uniform = r_uniform / r_uniform.sum()
c_uniform = np.ones(M.shape[1], dtype=float)
c_uniform = c_uniform / c_uniform.sum()

scaler = SinkhornScaler()
W = scaler.fit_transform(M, row_marginals=r_uniform, col_marginals=c_uniform)
u, v, sk_hist = scaler.u_, scaler.v_, scaler.history_

if not sk_hist.get("converged", False):
    print("Sinkhorn with uniform marginals did not converge; falling back to degree marginals.")
    r_deg = kc.to_numpy(dtype=float)
    r_deg = r_deg / r_deg.sum()
    c_deg = kp.to_numpy(dtype=float)
    c_deg = c_deg / c_deg.sum()
    scaler = SinkhornScaler()
    W = scaler.fit_transform(M, row_marginals=r_deg, col_marginals=c_deg)
    u, v, sk_hist = scaler.u_, scaler.v_, scaler.history_

results_countries = pd.concat([F_s, eci_s, kc], axis=1).sort_values("Fitness", ascending=False)
results_products = pd.concat([Q_s, pci_s, kp], axis=1).sort_values("Complexity", ascending=False)

word_scores_1d = Q_s.sort_values(ascending=False)
user_scores_1d = F_s.sort_values(ascending=False)

print("Top 20 words by complexity:")
print(word_scores_1d.head(20))
print("Top 20 users by fitness:")
print(user_scores_1d.head(20))

## Flow-native visualisations (Sinkhorn/OT coupling) + ranked barcodes

The objects we visualise here are:

- binary support: `M` (country×product)
- Sinkhorn/IPF scaling factors: `u`, `v` (dual variables)
- coupling / feasible flow: `W` where `W = diag(u) * M * diag(v)` (on the support)

To avoid “hairballs”, every flow plot below supports **top-k / top-edge filtering**.

In [None]:
# Diagnostics: convergence
fig, ax = plt.subplots(1, 2, figsize=(10, 3))
ax[0].plot(fc.history_["dF"], label="max |ΔF|")
ax[0].plot(fc.history_["dQ"], label="max |ΔQ|")
ax[0].set_yscale("log")
ax[0].set_title("FC convergence")
ax[0].legend()

ax[1].plot(scaler.history_["dr"], label="max row marginal error")
ax[1].plot(scaler.history_["dc"], label="max col marginal error")
ax[1].set_yscale("log")
ax[1].set_title("Sinkhorn/IPF convergence")
ax[1].legend()

plt.tight_layout()
plt.show()

# Diagnostics: nestedness-like visualization (sort by Fitness/Complexity)
M_sorted = M_df.loc[results_countries.index, results_products.index]
plt.figure(figsize=(10, 4))
plt.imshow(M_sorted.sparse.to_dense().to_numpy(), aspect="auto", interpolation="nearest", cmap="Greys")
plt.title("M sorted by Fitness (rows) and Complexity (cols)")
plt.xlabel("words")
plt.ylabel("users")
plt.tight_layout()
plt.show()

# Diagnostics: compare rankings
# Calculate log(Fitness) for meaningful correlation with linear scales
results_countries["log_Fitness"] = np.log(results_countries["Fitness"])

# Compute correlations
corr_eci_div = results_countries["ECI"].corr(results_countries["diversification_kc"])
corr_logF_div = results_countries["log_Fitness"].corr(results_countries["diversification_kc"])
corr_eci_logF = results_countries["ECI"].corr(results_countries["log_Fitness"])

print("\n" + "="*60)
print("CORRELATION ANALYSIS")
print("="*60)
print(f"Correlation(ECI, Diversification):         {corr_eci_div:.4f}")
print(f"Correlation(log(Fitness), Diversification): {corr_logF_div:.4f}")
print(f"Correlation(ECI, log(Fitness)):             {corr_eci_logF:.4f}")
print("\nInterpretation:")
print("- Low ECI correlations suggest Wikipedia has COMMUNITY/MODULAR structure")
print("- Different specialist communities (astrophysics, biology, history, etc.)")
print("  each have their own specialized vocabulary (rare technical terms)")
print("- This breaks the single nested hierarchy that ECI requires")
print("- log(Fitness) remains robust: works across different data structures")
print("\nWhy ECI fails here:")
print("- ECI assumes: high-diversity users use ALL words low-diversity users have")
print("- Reality: specialist editors use rare technical terms generalists don't")
print("- Just 2 communities drops ECI correlation from 0.9 → 0.1 (see nested notebook)")
print("="*60 + "\n")

plt.figure(figsize=(5, 4))
plt.scatter(results_countries["ECI"], results_countries["Fitness"], s=15, alpha=0.7)
plt.xlabel("ECI (standardised)")
plt.ylabel("Fitness")
plt.title(f"Users: Fitness vs ECI\n(Correlation with log(Fitness): {corr_eci_logF:.3f})")
plt.yscale('log')
plt.tight_layout()
plt.show()


### Eigenvalue Spectrum Analysis

**Key diagnostic**: The eigenvalue spectrum of the country-country projection matrix C reveals community structure.

**Expected patterns**:
- **Perfect nesting**: λ₂ >> λ₃ >> ... (clear spectral gap, single dominant structure)
- **k Communities**: λ₂ ≈ λ₃ ≈ ... ≈ λₖ (multiple significant eigenvalues)

The **spectral gap ratio** (λ₂/λ₃) tells us:
- Large ratio (>5): Single nested hierarchy → ECI works
- Small ratio (<2): Multiple communities → ECI fails

In [None]:
# Compute country-country projection matrix (what ECI uses)
# C = (M/kc) @ (M^T/kp)
kc = np.asarray(M.sum(axis=1)).ravel()  # diversification
kp = np.asarray(M.sum(axis=0)).ravel()  # ubiquity

# Avoid division by zero
kc_safe = np.where(kc > 0, kc, 1)
kp_safe = np.where(kp > 0, kp, 1)

# Compute C
M_normalized = M.toarray() / kc_safe[:, np.newaxis]
M_T_normalized = M.toarray().T / kp_safe[:, np.newaxis]
C = M_normalized @ M_T_normalized

# Compute eigenvalues
eigenvalues = np.linalg.eigvalsh(C)
eigenvalues = np.sort(eigenvalues)[::-1]  # Sort descending

# Analyze top eigenvalues
n_show = min(15, len(eigenvalues))
top_eigenvalues = eigenvalues[:n_show]

# Compute spectral gap
spectral_gap = eigenvalues[1] / eigenvalues[2] if len(eigenvalues) > 2 else np.inf

print("\n" + "="*70)
print("EIGENVALUE SPECTRUM ANALYSIS")
print("="*70)
print(f"\nTop {n_show} eigenvalues of country-country matrix C:")
print("-" * 70)
for i, eig in enumerate(top_eigenvalues, 1):
    marker = "  ← ECI uses this" if i == 2 else ""
    print(f"  λ{i:<2} = {eig:>8.4f}{marker}")

print(f"\nSpectral gap ratio (λ₂/λ₃): {spectral_gap:.2f}")

# Compute relative magnitudes
if len(eigenvalues) > 3:
    rel_3 = eigenvalues[2] / eigenvalues[1]
    rel_4 = eigenvalues[3] / eigenvalues[1]
    print(f"Relative magnitudes: λ₃/λ₂={rel_3:.2f}, λ₄/λ₂={rel_4:.2f}")

# Count significant eigenvalues (>20% of λ₂)
n_sig = np.sum(eigenvalues > 0.2 * eigenvalues[1])
print(f"\nSignificant eigenvalues (>20% of λ₂): {n_sig}")

print("\nInterpretation:")
if spectral_gap > 5:
    print("  ✓ Large spectral gap → Single dominant structure → ECI should work")
elif spectral_gap > 2:
    print("  ⚠ Moderate spectral gap → Mixed structure → ECI may struggle")
else:
    print("  ✗ Small spectral gap → Multiple communities → ECI fails")
    print(f"    {n_sig} significant eigenvalues → {n_sig-1} communities likely present")

print("\n⚠️  KEY INSIGHT:")
print("  ECI uses only λ₂ (2nd eigenvector = 1 dimension) to capture complexity")
print(f"  But with {n_sig} significant eigenvalues, we need {n_sig-1} dimensions (λ₂, λ₃, ...)")
print("  The single-dimension projection misses most of the structure!")
print("="*70)

In [None]:
# Visualize eigenvalue spectrum
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Top eigenvalues (linear scale)
ax1.bar(range(1, n_show + 1), top_eigenvalues, color='steelblue', alpha=0.7, edgecolor='black')
ax1.axvline(2, color='red', linestyle='--', linewidth=2, label='ECI uses λ₂', alpha=0.7)
ax1.set_xlabel('Eigenvalue Index', fontsize=12)
ax1.set_ylabel('Eigenvalue', fontsize=12)
ax1.set_title('Top Eigenvalues of Country-Country Matrix', fontsize=13)
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Eigenvalue decay (log scale)
ax2.semilogy(range(1, n_show + 1), top_eigenvalues, 'o-', markersize=8, linewidth=2, color='steelblue')
ax2.axvline(2, color='red', linestyle='--', linewidth=2, label='ECI uses λ₂', alpha=0.7)
ax2.axhline(0.1 * top_eigenvalues[1], color='orange', linestyle=':', linewidth=2, 
            label='10% of λ₂ threshold', alpha=0.7)
ax2.set_xlabel('Eigenvalue Index', fontsize=12)
ax2.set_ylabel('Eigenvalue (log scale)', fontsize=12)
ax2.set_title('Eigenvalue Decay (reveals community structure)', fontsize=13)
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✓ If flat eigenvalue spectrum → Multiple communities → ECI fails")
print("✓ If exponential decay → Single hierarchy → ECI works")

## Community Detection via Eigenvector Clustering

Since the eigenvalue spectrum reveals multiple communities, we can:
1. Project each user onto the top eigenvectors
2. Cluster users based on which eigenvector(s) they align with
3. Run ECI *within* each community (where it should work better)
4. Maintain Fitness as the global complexity measure

**Key Insight**: ECI fails globally because it uses only λ₂ (1 dimension), but by separating communities first, ECI can work within each community's nested structure.

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Compute eigenvectors (not just eigenvalues)
eigenvalues_full, eigenvectors = np.linalg.eigh(C)
# Sort by eigenvalue (descending)
idx = np.argsort(eigenvalues_full)[::-1]
eigenvalues_full = eigenvalues_full[idx]
eigenvectors = eigenvectors[:, idx]

# Project users onto top k eigenvectors (excluding λ₁ which is trivial uniform)
# We use 2-5 eigenvectors to capture the community structure
n_components = min(5, eigenvectors.shape[1])
user_projections = eigenvectors[:, 1:n_components+1]  # Skip λ₁

print(f"User projections shape: {user_projections.shape}")
print(f"Using top {n_components} eigenvectors (λ₂ through λ_{n_components+1})")

# Determine optimal number of communities
# Use the number of significant eigenvalues as a guide
n_sig = np.sum(eigenvalues_full > 0.2 * eigenvalues_full[1])
n_communities = min(n_sig - 1, 5)  # -1 because λ₁ is trivial, cap at 5 for interpretability

print(f"\nDetected {n_sig} significant eigenvalues")
print(f"Clustering into {n_communities} communities\n")

# Standardize projections for clustering
scaler = StandardScaler()
user_projections_scaled = scaler.fit_transform(user_projections)

# K-means clustering
kmeans = KMeans(n_clusters=n_communities, random_state=42, n_init=20)
community_labels = kmeans.fit_predict(user_projections_scaled)

# Add community labels to results
results_countries['community'] = community_labels

# Print community statistics
print("="*70)
print("COMMUNITY STATISTICS")
print("="*70)
for comm in range(n_communities):
    mask = community_labels == comm
    n_users = mask.sum()
    avg_div = results_countries.loc[mask, 'diversification_kc'].mean()
    avg_fitness = results_countries.loc[mask, 'Fitness'].mean()
    print(f"\nCommunity {comm}:")
    print(f"  Users: {n_users} ({100*n_users/len(results_countries):.1f}%)")
    print(f"  Avg diversification: {avg_div:.1f}")
    print(f"  Avg Fitness: {avg_fitness:.3f}")
    # Show top 3 users
    top_users = results_countries.loc[mask].nlargest(3, 'Fitness')
    print(f"  Top users: {', '.join(top_users.index[:3])}")

print("\n" + "="*70)

In [None]:
# Visualize communities in eigenvector space
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Users projected onto λ₂ vs λ₃
ax1 = axes[0]
for comm in range(n_communities):
    mask = community_labels == comm
    ax1.scatter(user_projections[mask, 0], user_projections[mask, 1], 
                label=f'Community {comm}', alpha=0.6, s=50)
ax1.set_xlabel('λ₂ (2nd eigenvector)', fontsize=12)
ax1.set_ylabel('λ₃ (3rd eigenvector)', fontsize=12)
ax1.set_title('Users Clustered by Eigenvector Projections', fontsize=13)
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Community sizes and average Fitness
ax2 = axes[1]
comm_sizes = [np.sum(community_labels == c) for c in range(n_communities)]
comm_fitness = [results_countries.loc[community_labels == c, 'Fitness'].mean() 
                for c in range(n_communities)]
x = np.arange(n_communities)
width = 0.35
ax2_twin = ax2.twinx()
bars1 = ax2.bar(x - width/2, comm_sizes, width, label='# Users', color='steelblue', alpha=0.7)
bars2 = ax2_twin.bar(x + width/2, comm_fitness, width, label='Avg Fitness', color='coral', alpha=0.7)
ax2.set_xlabel('Community', fontsize=12)
ax2.set_ylabel('Number of Users', fontsize=12, color='steelblue')
ax2_twin.set_ylabel('Average Fitness', fontsize=12, color='coral')
ax2.set_title('Community Sizes and Fitness', fontsize=13)
ax2.set_xticks(x)
ax2.tick_params(axis='y', labelcolor='steelblue')
ax2_twin.tick_params(axis='y', labelcolor='coral')
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

In [None]:
# Test ECI performance within each community vs. globally
print("\n" + "="*70)
print("ECI PERFORMANCE: GLOBAL vs. WITHIN-COMMUNITY")
print("="*70)

# Global ECI correlation (we already computed this)
print(f"\nGlobal ECI correlation with log(Fitness): {corr_eci_logF:.3f}")
print(f"Global ECI correlation with diversification: {corr_eci_div:.3f}")

print("\nWithin-community ECI correlations:")
print("-" * 70)

within_comm_correlations = []
for comm in range(n_communities):
    mask = community_labels == comm
    comm_data = results_countries.loc[mask]
    
    # Remove NaN values for correlation
    valid_mask = ~np.isnan(comm_data['ECI'])
    if valid_mask.sum() > 3:  # Need at least a few points
        corr_eci_logF_comm = np.corrcoef(
            comm_data.loc[valid_mask, 'ECI'],
            comm_data.loc[valid_mask, 'log_Fitness']
        )[0, 1]
        corr_eci_div_comm = np.corrcoef(
            comm_data.loc[valid_mask, 'ECI'],
            comm_data.loc[valid_mask, 'diversification_kc']
        )[0, 1]
        
        within_comm_correlations.append({
            'community': comm,
            'n_users': mask.sum(),
            'corr_eci_logF': corr_eci_logF_comm,
            'corr_eci_div': corr_eci_div_comm
        })
        
        print(f"\nCommunity {comm} (n={mask.sum()}):")
        print(f"  ECI vs log(Fitness): {corr_eci_logF_comm:6.3f}  {'✓' if abs(corr_eci_logF_comm) > 0.7 else '✗'}")
        print(f"  ECI vs diversification: {corr_eci_div_comm:6.3f}  {'✓' if abs(corr_eci_div_comm) > 0.7 else '✗'}")
    else:
        print(f"\nCommunity {comm}: Too few users ({mask.sum()}) for correlation")

# Compute average within-community correlation
if within_comm_correlations:
    avg_within_corr_logF = np.mean([c['corr_eci_logF'] for c in within_comm_correlations])
    avg_within_corr_div = np.mean([c['corr_eci_div'] for c in within_comm_correlations])
    
    print("\n" + "="*70)
    print("SUMMARY:")
    print("="*70)
    print(f"Global ECI correlation with log(Fitness):      {corr_eci_logF:6.3f}")
    print(f"Average within-community correlation:          {avg_within_corr_logF:6.3f}")
    print(f"\nImprovement: {avg_within_corr_logF - corr_eci_logF:+.3f}")
    
    print("\n⚠️  KEY FINDING:")
    if avg_within_corr_logF > corr_eci_logF + 0.1:
        print("  ✓ ECI works BETTER within communities!")
        print("  → Community detection + within-community ECI is a viable strategy")
        print("  → Use Fitness for global ranking, ECI for within-community ranking")
    else:
        print("  ✗ ECI still struggles even within communities")
        print("  → May need finer community detection or different approach")
        print("  → Fitness remains the more robust global measure")
    print("="*70)

In [None]:
from fitkit.diagnostics import (
    plot_circular_bipartite_flow,
    plot_alluvial_bipartite,
    plot_dual_potential_bipartite,
    plot_ranked_barcodes,
    _to_flow_df,
    _top_subset,
)

import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display, Javascript

# Prepare data with URL
plot_df = results_countries.copy()
# Construct Wikipedia User URLs (replacing spaces with underscores)
plot_df["wiki_url"] = "https://en.wikipedia.org/wiki/User:" + plot_df.index.astype(str).str.replace(' ', '_')

# Interactive scatter plot with custom_data for the URL
fig = px.scatter(
    plot_df,
    x="ECI",
    y="Fitness",
    hover_name=plot_df.index,
    hover_data=["diversification_kc", "log_Fitness"],
    custom_data=["wiki_url"],
    title=f"Users: Fitness vs ECI (Click dot to open User Page)<br>Correlation(ECI, log(Fitness)): {corr_eci_logF:.3f}",
    labels={"ECI": "ECI (standardised)", "Fitness": "Fitness (log scale)"},
    template="plotly_white",
    opacity=0.7,
    log_y=True
)

fig.update_traces(marker=dict(size=8))
fig.update_layout(width=700, height=500)



In [None]:
# Plotting functions have been moved to fitkit.diagnostics
# Import them from the module instead (see cell above)


In [None]:
# Build a labeled coupling DataFrame
W_df = _to_flow_df(M_df, W)

# Sort according to Fitness/Complexity orderings
W_sorted = W_df.loc[results_countries.index, results_products.index]

# Filter to top nodes for readability
W_small = _top_subset(W_sorted, top_c=18, top_p=28, by="fitness_complexity", F_s=F_s, Q_s=Q_s)

# 1) Circular bipartite flow (chord-style)
plot_circular_bipartite_flow(
    W_small,
    max_edges=320,
    color_by="country",
    title="Circular bipartite flow for Sinkhorn coupling W (filtered)",
)

# 2) Alluvial / Sankey-style flow
plot_alluvial_bipartite(
    W_small,
    max_edges=220,
    title="Alluvial view of Sinkhorn coupling W (filtered)",
)

# 3) Dual potentials landscape (log u/log v) + top edges
plot_dual_potential_bipartite(
    M=M_df,
    W_df=W_df,
    u=u,
    v=v,
    max_edges=450,
    title="Dual potentials (log u, log v) + flow edges from W",
)

# 4) Ranked barcode plots
plot_ranked_barcodes(results_countries, results_products, top_n=40)

