# 08 - Sheaf Cohomology Analysis

Apply sheaf-theoretic framework to measure governance coherence in OSS projects.

**Theoretical Foundation:** See `theory/sheaf-cohomology-framework.md`

**Goals:**
1. Construct project topology from contributor/module data
2. Build governance sheaf from extracted rules/decisions/monitoring
3. Compute Čech cohomology groups (H⁰, H¹, H²)
4. Calculate cohomological health index χ_gov
5. Test fork prediction hypothesis (H² spike precedes forks)

**Key Hypotheses (from H7):**
- H² spike precedes fork events by 6-12 months
- H¹ correlates with organizational entropy
- χ_gov predicts project sustainability
- Quadrant-specific cohomology signatures exist

## Setup

In [None]:
import os
import sys
import json
from pathlib import Path
from datetime import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv

# Add src to path
sys.path.insert(0, '../src')
sys.path.insert(0, '../data')

from analysis.entropy_calculation import EntropyCalculator

# Load environment from .env file
env_path = Path("../.env")
if env_path.exists():
    load_dotenv(env_path)
    print(f"✅ Loaded .env from {env_path.resolve()}")
else:
    load_dotenv()

# Visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print("✅ Setup complete!")

## 1. Theoretical Background

### Sheaf Theory for OSS Governance

**Key Concepts:**

| Sheaf Concept | OSS Interpretation |
|---------------|--------------------|
| Base Space X | Project topology (contributors, modules, time periods) |
| Open Sets U | Subprojects, teams, modules |
| Stalks | Local governance data at a point |
| Sections | Consistent governance rules across regions |
| Gluing Axiom | Local decisions combine into coherent global policy |

### Čech Cohomology Interpretation

| Cohomology | Meaning |
|------------|--------|
| H⁰ | Global sections = universal governance rules |
| H¹ | Governance conflicts = incompatible local policies |
| H² | Structural obstructions = deep incompatibilities, fork precursors |

### Health Metrics

**Cohomological Health Index:**
$$\chi_{gov}(X) = \dim H^0 - \dim H^1 + \dim H^2$$

**Governance Health Ratio:**
$$\rho_{gov}(X) = \frac{\dim H^0}{\dim H^0 + \dim H^1}$$

## 2. Load Collected Data

In [None]:
# Load all collected project data
data_dir = Path("../data/raw")
data_files = list(data_dir.glob("*_data.json"))

print(f"Found {len(data_files)} collected projects:\n")

projects = {}
for f in sorted(data_files):
    repo_name = f.stem.replace('_data', '').replace('_', '/')
    with open(f) as fp:
        projects[repo_name] = json.load(fp)
    stars = projects[repo_name].get('repository', {}).get('stargazers_count', 0)
    print(f"  - {repo_name}: {stars:,} stars")

## 3. Build Project Topology

Construct the simplicial complex representing project structure:
- **0-simplices (vertices):** Contributors
- **1-simplices (edges):** Collaboration relationships
- **2-simplices (triangles):** Team structures

In [None]:
def build_collaboration_matrix(project_data):
    """
    Build collaboration matrix from project data.
    
    Contributors are considered to collaborate if they:
    - Reviewed each other's PRs
    - Contributed to the same files
    - Participated in the same issues
    """
    contributors = project_data.get('contributors', [])
    if not contributors:
        return None, []
    
    # Get contributor names
    names = [c['login'] for c in contributors[:50]]  # Limit to top 50
    n = len(names)
    
    # Initialize collaboration matrix
    collab = np.zeros((n, n))
    
    # Use contribution counts as proxy for collaboration strength
    # More sophisticated: analyze PR reviews, co-commits, issue discussions
    contributions = np.array([c['contributions'] for c in contributors[:50]])
    
    # Simple model: collaboration strength ~ geometric mean of contributions
    for i in range(n):
        for j in range(i+1, n):
            collab[i,j] = np.sqrt(contributions[i] * contributions[j])
            collab[j,i] = collab[i,j]
    
    # Normalize
    if collab.max() > 0:
        collab = collab / collab.max()
    
    return collab, names

# Test on first project
test_project = list(projects.keys())[0]
collab_matrix, contributor_names = build_collaboration_matrix(projects[test_project])

if collab_matrix is not None:
    print(f"Project: {test_project}")
    print(f"Contributors: {len(contributor_names)}")
    print(f"Collaboration matrix shape: {collab_matrix.shape}")
    print(f"Non-zero collaborations: {np.count_nonzero(collab_matrix)}")

In [None]:
# Visualize collaboration network
if collab_matrix is not None:
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Show top 20 contributors
    n_show = min(20, len(contributor_names))
    
    im = ax.imshow(collab_matrix[:n_show, :n_show], cmap='Blues')
    ax.set_xticks(range(n_show))
    ax.set_yticks(range(n_show))
    ax.set_xticklabels(contributor_names[:n_show], rotation=45, ha='right')
    ax.set_yticklabels(contributor_names[:n_show])
    ax.set_title(f"Collaboration Matrix: {test_project}")
    plt.colorbar(im, ax=ax, label='Collaboration Strength')
    plt.tight_layout()
    plt.show()

## 4. Define Governance Sheaf

Extract governance data for each region of the project:
- **Rules (R):** Coding standards, review requirements, CI checks
- **Decisions (D):** Merge patterns, issue resolution, RFC processes
- **Monitoring (M):** CI/CD, bots, review requirements

In [None]:
def extract_governance_section(project_data):
    """
    Extract governance data from project.
    Returns a governance section (rules, decisions, monitoring).
    """
    governance = {
        'rules': set(),
        'decisions': set(),
        'monitoring': set()
    }
    
    # Rules from governance files
    gov_files = project_data.get('governance_files', {})
    for file_type, content in gov_files.items():
        if content:
            governance['rules'].add(f'has_{file_type}')
    
    # Decision patterns from PR data
    pr_stats = project_data.get('pull_requests', {}).get('statistics', {})
    if pr_stats:
        merge_rate = pr_stats.get('merged_count', 0) / max(pr_stats.get('total_prs', 1), 1)
        governance['decisions'].add(f'merge_rate_{int(merge_rate*10)*10}')
        
        avg_time = pr_stats.get('avg_time_to_merge', 0)
        if avg_time < 24:
            governance['decisions'].add('fast_merge')
        elif avg_time > 168:  # > 1 week
            governance['decisions'].add('slow_merge')
    
    # Monitoring from maintainer data
    maintainers = project_data.get('maintainers', {}).get('statistics', {})
    active = maintainers.get('active_maintainers_6mo', 0)
    if active <= 3:
        governance['monitoring'].add('stadium_style')
    elif active <= 10:
        governance['monitoring'].add('club_style')
    else:
        governance['monitoring'].add('federation_style')
    
    return governance

# Extract governance for all projects
governance_data = {}
for repo, data in projects.items():
    governance_data[repo] = extract_governance_section(data)

# Display sample
print("Sample Governance Sections:\n")
for repo in list(governance_data.keys())[:3]:
    print(f"{repo}:")
    for category, items in governance_data[repo].items():
        print(f"  {category}: {items}")
    print()

## 5. Compute Čech Cohomology (Simplified)

For this initial implementation, we use a simplified cohomology calculation:

- **H⁰ (Global Sections):** Count of governance rules that apply universally
- **H¹ (Conflicts):** Measure of governance inconsistencies
- **H² (Obstructions):** Higher-order structural issues

A full implementation would use GUDHI or Dionysus for proper simplicial cohomology.

In [None]:
def compute_simplified_cohomology(project_data, governance_section):
    """
    Compute simplified cohomology metrics.
    
    This is a proxy calculation - full implementation would use
    proper Čech complex construction and linear algebra.
    """
    # H⁰: Global governance rules (count of formal governance indicators)
    h0 = len(governance_section['rules'])
    
    # H¹: Governance conflicts (proxy: contributor concentration + PR conflict rate)
    contributors = project_data.get('contributors', [])
    if contributors:
        total = sum(c['contributions'] for c in contributors)
        top = contributors[0]['contributions'] if contributors else 0
        concentration = top / max(total, 1)
        
        # High concentration with multiple contributors = potential conflict
        if concentration > 0.5 and len(contributors) > 5:
            h1 = 2  # Moderate conflict potential
        elif concentration > 0.7:
            h1 = 1  # Low conflict (BDFL model)
        else:
            h1 = len(contributors) // 20  # More contributors = more potential conflicts
    else:
        h1 = 0
    
    pr_stats = project_data.get('pull_requests', {}).get('statistics', {})
    conflict_rate = pr_stats.get('conflict_rate', 0)
    if conflict_rate > 0.1:
        h1 += 1
    
    # H²: Structural obstructions (proxy: lack of governance docs + high contributor count)
    gov_files = project_data.get('governance_files', {})
    has_governance_docs = any(gov_files.values())
    maintainers = project_data.get('maintainers', {}).get('statistics', {})
    active = maintainers.get('active_maintainers_6mo', 0)
    
    h2 = 0
    if not has_governance_docs and active > 5:
        h2 = 1  # Missing governance with many contributors = structural issue
    if active > 15 and len(governance_section['rules']) < 2:
        h2 += 1  # Many people, few rules = potential for deep conflicts
    
    # Calculate health metrics
    chi_gov = h0 - h1 + h2  # Euler characteristic-like
    rho_gov = h0 / max(h0 + h1, 1)  # Health ratio
    
    return {
        'H0': h0,
        'H1': h1,
        'H2': h2,
        'chi_gov': chi_gov,
        'rho_gov': rho_gov
    }

# Compute cohomology for all projects
cohomology_results = {}
for repo, data in projects.items():
    gov = governance_data[repo]
    cohomology_results[repo] = compute_simplified_cohomology(data, gov)

# Display results
print("Cohomology Results:\n")
print(f"{'Project':<40} {'H⁰':>4} {'H¹':>4} {'H²':>4} {'χ_gov':>6} {'ρ_gov':>6}")
print("─" * 70)
for repo, result in cohomology_results.items():
    print(f"{repo:<40} {result['H0']:>4} {result['H1']:>4} {result['H2']:>4} {result['chi_gov']:>6} {result['rho_gov']:>6.2f}")

## 6. Cohomology by Project Type

In [None]:
# Classify projects and analyze cohomology by type
entropy_calc = EntropyCalculator()

cohomology_df = []
for repo, data in projects.items():
    contributors = data.get('contributors', [])
    classification = entropy_calc.classify_project(contributors)
    coh = cohomology_results[repo]
    
    cohomology_df.append({
        'repo': repo,
        'classification': classification['classification'],
        'H0': coh['H0'],
        'H1': coh['H1'],
        'H2': coh['H2'],
        'chi_gov': coh['chi_gov'],
        'rho_gov': coh['rho_gov'],
        'entropy': classification['metrics'].get('normalized_entropy', 0),
        'gini': classification['metrics'].get('gini_coefficient', 0)
    })

df = pd.DataFrame(cohomology_df)

# Summary by classification
print("Cohomology by Project Type:\n")
summary = df.groupby('classification').agg({
    'H0': 'mean',
    'H1': 'mean',
    'H2': 'mean',
    'chi_gov': 'mean',
    'rho_gov': 'mean',
    'repo': 'count'
}).rename(columns={'repo': 'count'})

print(summary.round(2))

In [None]:
# Visualize cohomology metrics
if len(df) > 1:
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # H⁰ vs H¹
    ax = axes[0]
    for cls in df['classification'].unique():
        subset = df[df['classification'] == cls]
        ax.scatter(subset['H0'], subset['H1'], label=cls, alpha=0.7, s=100)
    ax.set_xlabel('H⁰ (Global Sections)')
    ax.set_ylabel('H¹ (Conflicts)')
    ax.set_title('Governance Cohomology')
    ax.legend()
    
    # χ_gov distribution
    ax = axes[1]
    df.boxplot(column='chi_gov', by='classification', ax=ax)
    ax.set_title('Cohomological Health Index')
    ax.set_xlabel('Classification')
    ax.set_ylabel('χ_gov')
    plt.suptitle('')
    
    # ρ_gov distribution
    ax = axes[2]
    df.boxplot(column='rho_gov', by='classification', ax=ax)
    ax.set_title('Governance Health Ratio')
    ax.set_xlabel('Classification')
    ax.set_ylabel('ρ_gov')
    plt.suptitle('')
    
    plt.tight_layout()
    plt.savefig('../results/figures/cohomology_by_type.png', dpi=150)
    plt.show()

## 7. Entropy-Cohomology Correlation

Test hypothesis: H¹ correlates with organizational entropy

In [None]:
from scipy import stats

if len(df) > 3:
    # Correlation between H¹ and entropy
    corr_h1_entropy, p_h1 = stats.pearsonr(df['H1'], df['entropy'])
    print(f"H¹ vs Entropy: r = {corr_h1_entropy:.3f}, p = {p_h1:.4f}")
    
    # Correlation between χ_gov and Gini
    corr_chi_gini, p_chi = stats.pearsonr(df['chi_gov'], df['gini'])
    print(f"χ_gov vs Gini: r = {corr_chi_gini:.3f}, p = {p_chi:.4f}")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    ax = axes[0]
    ax.scatter(df['entropy'], df['H1'], alpha=0.7)
    ax.set_xlabel('Normalized Entropy')
    ax.set_ylabel('H¹ (Conflicts)')
    ax.set_title(f'Entropy vs Conflicts (r={corr_h1_entropy:.2f})')
    
    ax = axes[1]
    ax.scatter(df['gini'], df['chi_gov'], alpha=0.7)
    ax.set_xlabel('Gini Coefficient')
    ax.set_ylabel('χ_gov (Health Index)')
    ax.set_title(f'Inequality vs Health (r={corr_chi_gini:.2f})')
    
    plt.tight_layout()
    plt.show()
else:
    print("Need more projects for correlation analysis")

## 8. Next Steps

### Immediate
1. **Collect more projects** - Need 25-30 for statistical power
2. **Implement proper Čech complex** using GUDHI library
3. **Add temporal analysis** - Track cohomology over time

### Fork Prediction Study
1. Identify known fork events (Node.js/io.js, Bitcoin/Bitcoin Cash, etc.)
2. Reconstruct historical project states
3. Compute H² trajectory before fork
4. Test prediction hypothesis

### Full Implementation
```python
# TODO: Implement with GUDHI
import gudhi

def build_rips_complex(collab_matrix):
    """Build Rips complex from collaboration distances."""
    distance_matrix = 1.0 / (collab_matrix + 0.01)
    rips = gudhi.RipsComplex(distance_matrix=distance_matrix, max_edge_length=2.0)
    return rips.create_simplex_tree(max_dimension=3)

def compute_persistence(simplex_tree):
    """Compute persistent homology."""
    simplex_tree.compute_persistence()
    return simplex_tree.betti_numbers()
```

In [None]:
# Summary statistics
print("\n" + "=" * 60)
print("SHEAF COHOMOLOGY ANALYSIS SUMMARY")
print("=" * 60)
print(f"\nProjects analyzed: {len(df)}")
print(f"\nMean cohomology values:")
print(f"  H⁰ (Global Sections): {df['H0'].mean():.2f}")
print(f"  H¹ (Conflicts): {df['H1'].mean():.2f}")
print(f"  H² (Obstructions): {df['H2'].mean():.2f}")
print(f"\nHealth metrics:")
print(f"  Mean χ_gov: {df['chi_gov'].mean():.2f}")
print(f"  Mean ρ_gov: {df['rho_gov'].mean():.2f}")
print("\n" + "=" * 60)