<a href="https://colab.research.google.com/github/lawrennd/fitkit/blob/main/examples/atlas_fitness_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Economic Fitness Analysis: Comparing 2000 vs 2020

This notebook demonstrates the Fitness-Complexity algorithm using real-world trade data from the Harvard Atlas of Economic Complexity. We compare economic fitness and product complexity between 2000 and 2020 to understand how countries and products evolved over this 20-year period.

## What is Economic Fitness?

The Fitness-Complexity algorithm (Tacchella et al., 2012) measures:
- **Economic Fitness (F)**: A country's ability to produce complex products
- **Product Complexity (Q)**: How difficult a product is to produce

These are computed iteratively from the country-product export matrix, where F_c depends on Q_p and vice versa.

In [None]:
import sys
import subprocess
from pathlib import Path


def _pip_install(args: list[str]) -> None:
    cmd = [sys.executable, "-m", "pip", *args]
    print("Running:", " ".join(cmd))
    subprocess.check_call(cmd)


def ensure_fitkit_installed() -> None:
    """Prefer editable local install; fall back to GitHub.

    - Local (typical): `pip install -e ..` when running from `examples/`
    - Colab/remote: `pip install git+https://github.com/lawrennd/fitkit.git`
    """
    try:
        import fitkit  # noqa: F401

        return
    except ImportError:
        pass

    here = Path.cwd().resolve()
    candidates = [here, here.parent, here.parent.parent]

    for root in candidates:
        if (root / "pyproject.toml").exists() and (root / "fitkit").is_dir():
            _pip_install(["install", "-e", str(root)])
            return

    _pip_install(["install", "git+https://github.com/lawrennd/fitkit.git"])


ensure_fitkit_installed()
import fitkit

print("fitkit version:", getattr(fitkit, "__version__", "unknown"))

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr, pearsonr

from fitkit import load_atlas_trade, list_atlas_available_years, load_gdp_per_capita, load_human_capital_index
from fitkit.algorithms import FitnessComplexity

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

## 1. Data Loading

First, let's check what years are available and load data for 2000 and 2020.

In [None]:
# Check available years
years = list_atlas_available_years('hs92')
print(f"Atlas HS92 data available for {len(years)} years")
print(f"Range: {min(years)} - {max(years)}")
print(f"\nYears: {years}")

In [None]:
# Load data for 2000 and 2020 at 4-digit product level
print("Loading 2000 data...")
M_2000, countries_2000, products_2000 = load_atlas_trade(
    year=2000,
    classification='hs92',
    product_level=4,
    rca_threshold=1.0
)

print("\nLoading 2020 data...")
M_2020, countries_2020, products_2020 = load_atlas_trade(
    year=2020,
    classification='hs92',
    product_level=4,
    rca_threshold=1.0
)

print("\n" + "="*60)
print("DATA SUMMARY")
print("="*60)
print(f"\n2000: {M_2000.shape[0]} countries × {M_2000.shape[1]} products")
print(f"      Density: {M_2000.nnz / (M_2000.shape[0] * M_2000.shape[1]):.4f}")
print(f"      Non-zero entries: {M_2000.nnz:,}")

print(f"\n2020: {M_2020.shape[0]} countries × {M_2020.shape[1]} products")
print(f"      Density: {M_2020.nnz / (M_2020.shape[0] * M_2020.shape[1]):.4f}")
print(f"      Non-zero entries: {M_2020.nnz:,}")

In [None]:
# Display loaded data structure (with product names!)
print("\n" + "="*70)
print("LOADED DATA STRUCTURE")
print("="*70)

print("\nCountries DataFrame:")
print(countries_2020.head())
print(f"\nTotal countries: {len(countries_2020)}")

print("\nProducts DataFrame (now includes product names!):")
print(products_2020.head(15))
print(f"\nTotal products: {len(products_2020)}")

print("\nSample product names:")
for code in ['0101', '2709', '8703', '8517', '3004', '9018']:
    match = products_2020[products_2020['product'] == code]
    if not match.empty:
        print(f"  {code}: {match['product_name'].values[0]}")

print("\nRCA Matrix:")
print(f"  Shape: {M_2020.shape}")
print(f"  Density: {M_2020.nnz / (M_2020.shape[0] * M_2020.shape[1]):.4f}")
print(f"  Non-zero entries: {M_2020.nnz:,}")

## 2. Computing Fitness and Complexity

Now we compute the Fitness-Complexity metrics for both years. The algorithm iteratively updates country fitness and product complexity until convergence.

**Note on filtering**: The Fitness-Complexity algorithm mathematically requires the bipartite graph (countries ↔ products) to be **connected**. Disconnected components or isolated subgraphs violate this assumption, causing numerical collapse where one component dominates.

The FitnessComplexity estimator automatically filters low-connectivity nodes:
- Products exported by < 3 countries (min_ubiquity=3)
- Countries exporting < 5 products (min_diversification=5)

This heuristic prevents the most common cause of disconnection: products with ubiquity=1-2 create isolated 2-node subgraphs (country ↔ product). Without filtering, these can cause extreme fitness concentration on a single country.

In [None]:
# Compute fitness-complexity for 2000
print("Computing fitness-complexity for 2000...")
fc_2000 = FitnessComplexity()
F_2000, Q_2000 = fc_2000.fit_transform(M_2000)
print(f"  Converged: {fc_2000.converged_}, iterations: {fc_2000.n_iter_}")
print(f"  Filtered: {fc_2000.n_countries_filtered_} countries, {fc_2000.n_products_filtered_} products")

# Compute fitness-complexity for 2020
print("\nComputing fitness-complexity for 2020...")
fc_2020 = FitnessComplexity()
F_2020, Q_2020 = fc_2020.fit_transform(M_2020)
print(f"  Converged: {fc_2020.converged_}, iterations: {fc_2020.n_iter_}")
print(f"  Filtered: {fc_2020.n_countries_filtered_} countries, {fc_2020.n_products_filtered_} products")


In [None]:
# Show convergence info
print(f"2000: converged={fc_2000.converged_}, n_iter={fc_2000.n_iter_}")
print(f"2020: converged={fc_2020.converged_}, n_iter={fc_2020.n_iter_}")

In [None]:
# Plot fitness distributions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Distribution comparison
ax1.hist(np.log10(F_2000 + 1e-10), bins=30, alpha=0.5, label='2000', density=True, edgecolor='black')
ax1.hist(np.log10(F_2020 + 1e-10), bins=30, alpha=0.5, label='2020', density=True, edgecolor='black')
ax1.set_xlabel('log₁₀(Fitness)', fontsize=12)
ax1.set_ylabel('Density', fontsize=12)
ax1.set_title('Distribution of Country Fitness', fontsize=13)
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Complexity distribution comparison
ax2.hist(np.log10(Q_2000 + 1e-10), bins=30, alpha=0.5, label='2000', density=True, edgecolor='black')
ax2.hist(np.log10(Q_2020 + 1e-10), bins=30, alpha=0.5, label='2020', density=True, edgecolor='black')
ax2.set_xlabel('log₁₀(Complexity)', fontsize=12)
ax2.set_ylabel('Density', fontsize=12)
ax2.set_title('Distribution of Product Complexity', fontsize=13)
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n{'='*60}")
print(f"2000: {fc_2000.n_iter_} iterations (converged: {fc_2000.converged_})")
print(f"2020: {fc_2020.n_iter_} iterations (converged: {fc_2020.converged_})")
print(f"{'='*60}")

## 3. Country Rankings and Changes

Let's examine which countries have the highest economic fitness in each year and how rankings changed.

In [None]:
# Add fitness to dataframes
countries_2000['fitness'] = F_2000
countries_2000['year'] = 2000
countries_2000 = countries_2000.sort_values('fitness', ascending=False).reset_index(drop=True)
countries_2000['rank_2000'] = countries_2000.index + 1

countries_2020['fitness'] = F_2020
countries_2020['year'] = 2020
countries_2020 = countries_2020.sort_values('fitness', ascending=False).reset_index(drop=True)
countries_2020['rank_2020'] = countries_2020.index + 1

# Show top countries in each year
print("="*60)
print("TOP 20 COUNTRIES BY ECONOMIC FITNESS")
print("="*60)

top_2000 = countries_2000[['rank_2000', 'country', 'fitness']].head(20).copy()
top_2020 = countries_2020[['rank_2020', 'country', 'fitness']].head(20).copy()

comparison_top = pd.merge(
    top_2000.rename(columns={'fitness': 'fitness_2000'}),
    top_2020.rename(columns={'fitness': 'fitness_2020'}),
    on='country',
    how='outer'
).fillna({'rank_2000': 999, 'rank_2020': 999})

comparison_top['rank_2000'] = comparison_top['rank_2000'].astype(int)
comparison_top['rank_2020'] = comparison_top['rank_2020'].astype(int)
comparison_top = comparison_top.sort_values('rank_2020')

print("\n{:<10} {:>10} {:>15} {:>10} {:>15}".format(
    'Country', 'Rank 2000', 'Fitness 2000', 'Rank 2020', 'Fitness 2020'
))
print("-"*70)

for _, row in comparison_top.head(20).iterrows():
    r2000 = str(row['rank_2000']) if row['rank_2000'] < 999 else '-'
    f2000 = f"{row['fitness_2000']:.4f}" if pd.notna(row['fitness_2000']) else '-'
    r2020 = str(row['rank_2020']) if row['rank_2020'] < 999 else '-'
    f2020 = f"{row['fitness_2020']:.4f}" if pd.notna(row['fitness_2020']) else '-'

    print(f"{row['country']:<10} {r2000:>10} {f2000:>15} {r2020:>10} {f2020:>15}")

## 4. Fitness Changes Over Time

Now let's analyze which countries gained or lost the most economic fitness between 2000 and 2020.

In [None]:
# Merge data for countries present in both years
comparison = countries_2000[['country', 'fitness', 'rank_2000']].merge(
    countries_2020[['country', 'fitness', 'rank_2020']],
    on='country',
    how='inner',
    suffixes=('_2000', '_2020')
)

comparison['fitness_change'] = comparison['fitness_2020'] - comparison['fitness_2000']
comparison['fitness_pct_change'] = 100 * (comparison['fitness_2020'] / comparison['fitness_2000'] - 1)
comparison['rank_change'] = comparison['rank_2000'] - comparison['rank_2020']  # Positive = improved

print(f"\nAnalyzing {len(comparison)} countries present in both years\n")

# Biggest gainers and losers
print("="*60)
print("TOP 10 FITNESS GAINERS (2000-2020)")
print("="*60)
gainers = comparison.nlargest(10, 'fitness_change')
print("\n{:<10} {:>12} {:>12} {:>15} {:>12}".format(
    'Country', 'Fit. 2000', 'Fit. 2020', 'Abs. Change', 'Rank Change'
))
print("-"*70)
for _, row in gainers.iterrows():
    print(f"{row['country']:<10} {row['fitness_2000']:>12.4f} {row['fitness_2020']:>12.4f} "
          f"{row['fitness_change']:>+15.4f} {row['rank_change']:>+12.0f}")

print("\n" + "="*60)
print("TOP 10 FITNESS LOSERS (2000-2020)")
print("="*60)
losers = comparison.nsmallest(10, 'fitness_change')
print("\n{:<10} {:>12} {:>12} {:>15} {:>12}".format(
    'Country', 'Fit. 2000', 'Fit. 2020', 'Abs. Change', 'Rank Change'
))
print("-"*70)
for _, row in losers.iterrows():
    print(f"{row['country']:<10} {row['fitness_2000']:>12.4f} {row['fitness_2020']:>12.4f} "
          f"{row['fitness_change']:>+15.4f} {row['rank_change']:>+12.0f}")

## 5. Visualization: Fitness Scatter Plot

Let's create a scatter plot showing fitness in 2000 vs 2020, with countries colored by their change in fitness.

In [None]:
# Create scatter plot with color-coded changes
fig, ax = plt.subplots(figsize=(12, 10))

# Plot all countries
scatter = ax.scatter(
    comparison['fitness_2000'],
    comparison['fitness_2020'],
    c=comparison['fitness_change'],
    cmap='RdYlGn',
    s=100,
    alpha=0.6,
    edgecolors='black',
    linewidth=0.5
)

# Add diagonal line (no change)
max_val = max(comparison['fitness_2000'].max(), comparison['fitness_2020'].max())
ax.plot([0, max_val], [0, max_val], 'k--', alpha=0.3, linewidth=1, label='No change')

# Label top gainers and losers
top_movers = pd.concat([
    comparison.nlargest(5, 'fitness_change'),
    comparison.nsmallest(5, 'fitness_change')
])

for _, row in top_movers.iterrows():
    ax.annotate(
        row['country'],
        (row['fitness_2000'], row['fitness_2020']),
        xytext=(5, 5),
        textcoords='offset points',
        fontsize=9,
        bbox=dict(boxstyle='round,pad=0.3', facecolor='white', edgecolor='gray', alpha=0.7)
    )

ax.set_xlabel('Economic Fitness 2000', fontsize=12)
ax.set_ylabel('Economic Fitness 2020', fontsize=12)
ax.set_title('Economic Fitness: 2000 vs 2020\n(Countries above diagonal improved, below declined)', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)

# Add colorbar
cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Fitness Change (2020 - 2000)', fontsize=11)

plt.tight_layout()
plt.show()

# Compute correlation
r_pearson, p_pearson = pearsonr(comparison['fitness_2000'], comparison['fitness_2020'])
r_spearman, p_spearman = spearmanr(comparison['fitness_2000'], comparison['fitness_2020'])

print(f"\nCorrelation between 2000 and 2020 fitness:")
print(f"  Pearson r = {r_pearson:.3f} (p < {p_pearson:.1e})")
print(f"  Spearman ρ = {r_spearman:.3f} (p < {p_spearman:.1e})")

## 6. Product Complexity Analysis

Now let's examine which products are most complex and how complexity changed over time.

In [None]:
# Add complexity to product dataframes
products_2000['complexity'] = Q_2000
products_2000 = products_2000.sort_values('complexity', ascending=False).reset_index(drop=True)
products_2000['rank_2000'] = products_2000.index + 1

products_2020['complexity'] = Q_2020
products_2020 = products_2020.sort_values('complexity', ascending=False).reset_index(drop=True)
products_2020['rank_2020'] = products_2020.index + 1

# Merge for comparison (include product_name from 2020)
product_comparison = products_2000[['product', 'complexity', 'rank_2000']].merge(
    products_2020[['product', 'product_name', 'complexity', 'rank_2020']],
    on='product',
    how='inner',
    suffixes=('_2000', '_2020')
)

product_comparison['complexity_change'] = product_comparison['complexity_2020'] - product_comparison['complexity_2000']

print("="*85)
print("TOP 15 MOST COMPLEX PRODUCTS (2020)")
print("="*85)
print("\n{:<8} {:<32} {:>12} {:>12} {:>12}".format(
    'Code', 'Product Name', 'Comp. 2000', 'Comp. 2020', 'Change'
))
print("-"*85)

for _, row in product_comparison.nlargest(15, 'complexity_2020').iterrows():
    name = row['product_name'][:30]  # Truncate long names
    print(f"{row['product']:<8} {name:<32} "
          f"{row['complexity_2000']:>12.4f} {row['complexity_2020']:>12.4f} "
          f"{row['complexity_change']:>+12.4f}")

In [None]:
# Plot product complexity distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Histogram of complexity
axes[0, 0].hist(np.log10(Q_2000 + 1e-10), bins=30, alpha=0.5, label='2000', density=True)
axes[0, 0].hist(np.log10(Q_2020 + 1e-10), bins=30, alpha=0.5, label='2020', density=True)
axes[0, 0].set_xlabel('log₁₀(Complexity)')
axes[0, 0].set_ylabel('Density')
axes[0, 0].set_title('Distribution of Product Complexity')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Complexity scatter plot
axes[0, 1].scatter(
    product_comparison['complexity_2000'],
    product_comparison['complexity_2020'],
    alpha=0.5,
    s=50
)
max_q = max(product_comparison['complexity_2000'].max(), product_comparison['complexity_2020'].max())
axes[0, 1].plot([0, max_q], [0, max_q], 'k--', alpha=0.3)
axes[0, 1].set_xlabel('Product Complexity 2000')
axes[0, 1].set_ylabel('Product Complexity 2020')
axes[0, 1].set_title('Product Complexity: 2000 vs 2020')
axes[0, 1].grid(True, alpha=0.3)

# Complexity changes
axes[1, 0].hist(product_comparison['complexity_change'], bins=40, edgecolor='black', alpha=0.7)
axes[1, 0].axvline(0, color='red', linestyle='--', linewidth=1, alpha=0.5)
axes[1, 0].set_xlabel('Change in Complexity (2020 - 2000)')
axes[1, 0].set_ylabel('Number of Products')
axes[1, 0].set_title('Distribution of Complexity Changes')
axes[1, 0].grid(True, alpha=0.3)

# Rank changes
product_comparison['rank_change'] = product_comparison['rank_2000'] - product_comparison['rank_2020']
axes[1, 1].hist(product_comparison['rank_change'], bins=40, edgecolor='black', alpha=0.7)
axes[1, 1].axvline(0, color='red', linestyle='--', linewidth=1, alpha=0.5)
axes[1, 1].set_xlabel('Rank Change (positive = more complex in 2020)')
axes[1, 1].set_ylabel('Number of Products')
axes[1, 1].set_title('Distribution of Complexity Rank Changes')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Diversification Analysis

Let's examine how product diversification (number of products exported) relates to economic fitness.

In [None]:
# Compute diversification (number of products per country)
div_2000 = M_2000.sum(axis=1).A1
div_2020 = M_2020.sum(axis=1).A1

comparison['diversification_2000'] = div_2000[comparison.index]
comparison['diversification_2020'] = div_2020[comparison.index]
comparison['div_change'] = comparison['diversification_2020'] - comparison['diversification_2000']

# Plot diversification vs fitness
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 2000
axes[0].scatter(
    comparison['diversification_2000'],
    comparison['fitness_2000'],
    alpha=0.6,
    s=80,
    edgecolors='black',
    linewidth=0.5
)
axes[0].set_xlabel('Number of Products Exported', fontsize=12)
axes[0].set_ylabel('Economic Fitness', fontsize=12)
axes[0].set_title('Diversification vs Fitness (2000)', fontsize=13)
axes[0].grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(comparison['diversification_2000'], comparison['fitness_2000'], 1)
p = np.poly1d(z)
x_line = np.linspace(comparison['diversification_2000'].min(), comparison['diversification_2000'].max(), 100)
axes[0].plot(x_line, p(x_line), 'r--', alpha=0.5, linewidth=2, label=f'Linear fit')
axes[0].legend()

# 2020
axes[1].scatter(
    comparison['diversification_2020'],
    comparison['fitness_2020'],
    alpha=0.6,
    s=80,
    edgecolors='black',
    linewidth=0.5
)
axes[1].set_xlabel('Number of Products Exported', fontsize=12)
axes[1].set_ylabel('Economic Fitness', fontsize=12)
axes[1].set_title('Diversification vs Fitness (2020)', fontsize=13)
axes[1].grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(comparison['diversification_2020'], comparison['fitness_2020'], 1)
p = np.poly1d(z)
x_line = np.linspace(comparison['diversification_2020'].min(), comparison['diversification_2020'].max(), 100)
axes[1].plot(x_line, p(x_line), 'r--', alpha=0.5, linewidth=2, label=f'Linear fit')
axes[1].legend()

plt.tight_layout()
plt.show()

# Compute correlations
r_2000, p_2000 = spearmanr(comparison['diversification_2000'], comparison['fitness_2000'])
r_2020, p_2020 = spearmanr(comparison['diversification_2020'], comparison['fitness_2020'])

print(f"\nCorrelation between diversification and fitness:")
print(f"  2000: Spearman ρ = {r_2000:.3f} (p < {p_2000:.1e})")
print(f"  2020: Spearman ρ = {r_2020:.3f} (p < {p_2020:.1e})")

## 8. Summary Statistics

Let's summarize the key findings from our analysis.

In [None]:
print("="*70)
print("SUMMARY OF CHANGES (2000-2020)")
print("="*70)

print(f"\nCountries analyzed: {len(comparison)}")
print(f"Products compared: {len(product_comparison)}")

print(f"\n{'FITNESS CHANGES:':<40}")
print(f"  Mean change: {comparison['fitness_change'].mean():>25.4f}")
print(f"  Median change: {comparison['fitness_change'].median():>23.4f}")
print(f"  Std dev: {comparison['fitness_change'].std():>29.4f}")
print(f"  Countries that improved: {(comparison['fitness_change'] > 0).sum():>16}")
print(f"  Countries that declined: {(comparison['fitness_change'] < 0).sum():>16}")

print(f"\n{'DIVERSIFICATION CHANGES:':<40}")
print(f"  Mean change in # products: {comparison['div_change'].mean():>14.1f}")
print(f"  Median change in # products: {comparison['div_change'].median():>12.1f}")
print(f"  Countries more diversified: {(comparison['div_change'] > 0).sum():>15}")
print(f"  Countries less diversified: {(comparison['div_change'] < 0).sum():>15}")

print(f"\n{'PRODUCT COMPLEXITY:':<40}")
print(f"  Mean complexity change: {product_comparison['complexity_change'].mean():>17.4f}")
print(f"  Products more complex: {(product_comparison['complexity_change'] > 0).sum():>18}")
print(f"  Products less complex: {(product_comparison['complexity_change'] < 0).sum():>18}")

print("\n" + "="*70)

## 10. Fitness vs Economic Indicators

Let's examine how economic fitness relates to traditional economic metrics like GDP per capita and the Human Capital Index (HCI).

In [None]:
# Load GDP per capita and Human Capital Index data
print("Loading GDP per capita data...")
gdp_df = load_gdp_per_capita(start_year=2020, end_year=2020)
print(f"  Loaded data for {len(gdp_df)} countries")

print("\nLoading Human Capital Index data...")
hci_df = load_human_capital_index(start_year=2020, end_year=2020)
print(f"  Loaded data for {len(hci_df)} countries")

# Merge with 2020 fitness data
comparison_gdp = countries_2020[['country', 'fitness']].copy()
comparison_gdp = comparison_gdp.merge(
    gdp_df[[2020]].rename(columns={2020: 'gdp_per_capita'}),
    left_on='country',
    right_index=True,
    how='inner'
)

comparison_hci = countries_2020[['country', 'fitness']].copy()
comparison_hci = comparison_hci.merge(
    hci_df[[2020]].rename(columns={2020: 'hci'}),
    left_on='country',
    right_index=True,
    how='inner'
)

print(f"\n{len(comparison_gdp)} countries with both fitness and GDP per capita data")
print(f"{len(comparison_hci)} countries with both fitness and HCI data")

In [None]:
# Create scatter plots with country code labels
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# === Plot 1: log Fitness vs log GDP per Capita ===
ax1 = axes[0]

# Remove any zero or negative values for log scale
comparison_gdp_clean = comparison_gdp[(comparison_gdp['fitness'] > 0) & (comparison_gdp['gdp_per_capita'] > 0)]

log_fitness = np.log10(comparison_gdp_clean['fitness'])
log_gdp = np.log10(comparison_gdp_clean['gdp_per_capita'])

# Scatter plot
ax1.scatter(
    log_gdp,
    log_fitness,
    s=100,
    alpha=0.6,
    edgecolors='black',
    linewidth=0.5,
    c='steelblue'
)

# Add country code labels to all points
for _, row in comparison_gdp_clean.iterrows():
    ax1.annotate(
        row['country'],
        (np.log10(row['gdp_per_capita']), np.log10(row['fitness'])),
        fontsize=7,
        alpha=0.7,
        ha='center',
        va='bottom'
    )

# Add trend line
z = np.polyfit(log_gdp, log_fitness, 1)
p = np.poly1d(z)
x_line = np.linspace(log_gdp.min(), log_gdp.max(), 100)
ax1.plot(x_line, p(x_line), 'r--', alpha=0.5, linewidth=2, label=f'Linear fit (R² = {np.corrcoef(log_gdp, log_fitness)[0,1]**2:.3f})')

ax1.set_xlabel('log₁₀(GDP per Capita) [current US$]', fontsize=12)
ax1.set_ylabel('log₁₀(Economic Fitness)', fontsize=12)
ax1.set_title('Economic Fitness vs GDP per Capita (2020)', fontsize=14, fontweight='bold')
ax1.legend(loc='upper left', fontsize=10)
ax1.grid(True, alpha=0.3)

# Compute correlation
r_gdp, p_gdp = pearsonr(log_gdp, log_fitness)
ax1.text(0.05, 0.95, f'Pearson r = {r_gdp:.3f}\np < {p_gdp:.1e}', 
         transform=ax1.transAxes, fontsize=10, verticalalignment='top',
         bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

# === Plot 2: log Fitness vs log HCI ===
ax2 = axes[1]

# Remove any zero or negative values for log scale
comparison_hci_clean = comparison_hci[(comparison_hci['fitness'] > 0) & (comparison_hci['hci'] > 0)]

log_fitness_hci = np.log10(comparison_hci_clean['fitness'])
log_hci = np.log10(comparison_hci_clean['hci'])

# Scatter plot
ax2.scatter(
    log_hci,
    log_fitness_hci,
    s=100,
    alpha=0.6,
    edgecolors='black',
    linewidth=0.5,
    c='coral'
)

# Add country code labels to all points
for _, row in comparison_hci_clean.iterrows():
    ax2.annotate(
        row['country'],
        (np.log10(row['hci']), np.log10(row['fitness'])),
        fontsize=7,
        alpha=0.7,
        ha='center',
        va='bottom'
    )

# Add trend line
z = np.polyfit(log_hci, log_fitness_hci, 1)
p = np.poly1d(z)
x_line = np.linspace(log_hci.min(), log_hci.max(), 100)
ax2.plot(x_line, p(x_line), 'r--', alpha=0.5, linewidth=2, label=f'Linear fit (R² = {np.corrcoef(log_hci, log_fitness_hci)[0,1]**2:.3f})')

ax2.set_xlabel('log₁₀(Human Capital Index)', fontsize=12)
ax2.set_ylabel('log₁₀(Economic Fitness)', fontsize=12)
ax2.set_title('Economic Fitness vs Human Capital Index (2020)', fontsize=14, fontweight='bold')
ax2.legend(loc='upper left', fontsize=10)
ax2.grid(True, alpha=0.3)

# Compute correlation
r_hci, p_hci = pearsonr(log_hci, log_fitness_hci)
ax2.text(0.05, 0.95, f'Pearson r = {r_hci:.3f}\np < {p_hci:.1e}', 
         transform=ax2.transAxes, fontsize=10, verticalalignment='top',
         bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

print(f"\n{'='*70}")
print("CORRELATION ANALYSIS")
print(f"{'='*70}")
print(f"\nlog(Fitness) vs log(GDP per Capita):")
print(f"  Pearson r = {r_gdp:.4f} (p < {p_gdp:.1e})")
print(f"  R² = {r_gdp**2:.4f}")
print(f"  N = {len(comparison_gdp_clean)} countries")

print(f"\nlog(Fitness) vs log(HCI):")
print(f"  Pearson r = {r_hci:.4f} (p < {p_hci:.1e})")
print(f"  R² = {r_hci**2:.4f}")
print(f"  N = {len(comparison_hci_clean)} countries")

print(f"\n{'='*70}")

In [None]:
# Identify interesting countries (outliers and extremes)

print("="*70)
print("NOTABLE COUNTRIES")
print("="*70)

# High fitness, low GDP (overperformers in complexity relative to GDP)
comparison_gdp_clean['fitness_gdp_ratio'] = comparison_gdp_clean['fitness'] / comparison_gdp_clean['gdp_per_capita']
high_fitness_low_gdp = comparison_gdp_clean.nlargest(10, 'fitness_gdp_ratio')[['country', 'fitness', 'gdp_per_capita']]
print("\nTop 10: High Fitness relative to GDP per Capita")
print("-"*70)
for _, row in high_fitness_low_gdp.iterrows():
    print(f"{row['country']:<10} Fitness: {row['fitness']:>8.4f}  GDP/capita: ${row['gdp_per_capita']:>12,.0f}")

# Low fitness, high GDP (underperformers in complexity relative to GDP)
low_fitness_high_gdp = comparison_gdp_clean.nsmallest(10, 'fitness_gdp_ratio')[['country', 'fitness', 'gdp_per_capita']]
print("\nTop 10: Low Fitness relative to GDP per Capita")
print("-"*70)
for _, row in low_fitness_high_gdp.iterrows():
    print(f"{row['country']:<10} Fitness: {row['fitness']:>8.4f}  GDP/capita: ${row['gdp_per_capita']:>12,.0f}")

# Highest fitness countries
print("\nTop 10: Highest Economic Fitness")
print("-"*70)
for _, row in comparison_gdp_clean.nlargest(10, 'fitness')[['country', 'fitness', 'gdp_per_capita']].iterrows():
    print(f"{row['country']:<10} Fitness: {row['fitness']:>8.4f}  GDP/capita: ${row['gdp_per_capita']:>12,.0f}")

print("\n" + "="*70)

## 11. Key Insights

This analysis reveals several important patterns:

1. **Persistence**: High correlation between fitness in 2000 and 2020 shows that economic capabilities are persistent - countries that were fit in 2000 tend to remain fit.

2. **Diversification matters**: The positive correlation between the number of products exported and economic fitness confirms that capability diversity is important.

3. **Complexity evolution**: Product complexity rankings are relatively stable, but some products become more or less complex as global production patterns shift.

4. **Winners and losers**: Some countries made significant gains in fitness (often emerging economies), while others declined (often due to specialization in declining sectors).

5. **Validation with traditional metrics**: Economic fitness correlates strongly with GDP per capita and Human Capital Index, validating it as a meaningful measure of productive capabilities. However, the relationship is not deterministic - some countries overperform or underperform their GDP/HCI levels in terms of production complexity.

6. **Complementary perspectives**: While GDP measures output and HCI measures human capital, fitness measures the diversity and sophistication of productive capabilities embedded in a country's export basket.

## References

- Tacchella, A., Cristelli, M., Caldarelli, G., Gabrielli, A., & Pietronero, L. (2012). A new metrics for countries' fitness and products' complexity. *Scientific reports*, 2(1), 723.
- Hausmann, R., et al. (2014). *The Atlas of Economic Complexity*. MIT Press.
- Lawrence, N.D. (2024). "Conditional Likelihood Interpretation of Economic Fitness" (working paper).
- Data sources: 
  - Harvard Growth Lab's Atlas of Economic Complexity (https://atlas.hks.harvard.edu/)
  - World Bank World Development Indicators (GDP per capita, Human Capital Index)