# 🧠 Circuit Analysis Pipeline: Anthropological Feature Analysis

**Analisi antropologica completa delle feature di LLM attraverso Circuit Tracing**

---

## 📊 Overview

Questo notebook implementa l'intera pipeline di analisi delle feature estratte da un attribution graph generato con Circuit Tracer.

### Risultati Finali
- ✅ **52.3% logit influence coverage** (post influence-first refactoring)
- ✅ **23 supernodi** (15 semantici + 8 computazionali)
- ✅ **483 feature** coperte nei supernodi
- ✅ **2.4% BOS leakage** (controllato)

### Metodologia
- **Influence-First Filtering**: Ammissione basata su causalità (`logit_influence`)
- **Dual View**: Separazione "Situational Core" vs "Generalizable Scaffold"
- **Supernodi Semantici**: Clustering narrativo con crescita controllata
- **Validazione Empirica**: Coverage logit influence + correlazioni metriche


---

## 🔄 Workflow Completo

```mermaid
flowchart TD
    A[🎯 COLAB: Circuit Tracer] --> B[example_graph.pt]
    A --> C[graph_feature_static_metrics.csv]
    A --> D[acts_compared.csv]
    
    B --> E[📥 Download file localmente]
    C --> E
    D --> E
    
    E --> F[Step 1: Anthropological Basic]
    F --> G[feature_personalities_corrected.json]
    F --> H[feature_typology.json]
    
    G --> I[Step 2: Compute Thresholds]
    I --> J[robust_thresholds.json]
    
    G --> K[Step 3: Cicciotti Supernodes]
    J --> K
    K --> L[cicciotti_supernodes.json]
    
    L --> M[Step 4: Final Clustering]
    J --> M
    M --> N[final_anthropological_optimized.json]
    
    N --> O[Step 5: Verify Logit Influence]
    G --> O
    O --> P[logit_influence_validation.json]
    
    N --> Q[Step 6: Visualizzazioni]
    H --> Q
    Q --> R[feature_space_3d.png]
    Q --> S[neuronpedia_url.txt]
    
    style A fill:#e1f5ff
    style F fill:#fff3e0
    style I fill:#fff3e0
    style K fill:#fff3e0
    style M fill:#fff3e0
    style O fill:#e8f5e9
    style Q fill:#f3e5f5
```


---

## 🎯 Fase COLAB (Prerequisito)

La fase Colab utilizza il modello Gemma-2-2B con Circuit Tracer per generare i file di input.

### Output da Colab (da scaricare in `output/`):
1. ✅ `example_graph.pt` - Grafo attribution (~167MB)
2. ✅ `graph_feature_static_metrics.csv` - Metriche statiche (logit_influence, frac_external)
3. ✅ `acts_compared.csv` - Attivazioni su concetti semantici (Dallas, Texas, Capital, etc.)

**💡 Questi file devono esistere prima di eseguire questa pipeline.**

---

## ⚙️ Setup e Verifiche


In [None]:
# Importazioni
import json
import csv
import pandas as pd
import numpy as np
from pathlib import Path
from collections import defaultdict, Counter

# Setup
OUTPUT_DIR = Path('output')
OUTPUT_DIR.mkdir(exist_ok=True)

# Verifica prerequisiti
required_files = [
    'output/example_graph.pt',
    'output/graph_feature_static_metrics (1).csv',
    'output/acts_compared.csv'
]

print("📊 Verifica prerequisiti:\n")
missing = []
for f in required_files:
    if Path(f).exists():
        size_mb = Path(f).stat().st_size / (1024*1024)
        print(f"✅ {f} ({size_mb:.1f} MB)")
    else:
        print(f"❌ {f} - MANCANTE")
        missing.append(f)

if missing:
    print("\n⚠️ File mancanti! Scaricali da Colab prima di continuare.")
else:
    print("\n✅ Tutti i prerequisiti presenti. Puoi procedere!")


---

## 📊 Step 1: Anthropological Basic

Calcola metriche antropologiche per ogni feature:
- `mean_consistency`: generalizzabilità cross-prompt
- `max_affinity`: specializzazione semantica
- `conditional_consistency`: consistency quando attiva
- `activation_threshold`: soglia adattiva (p75 + Otsu)

**Output**: `feature_personalities_corrected.json`, `feature_typology.json`, `quality_scores.json`, `metric_correlations.json`


In [None]:
print("🎭 STEP 1: ANTHROPOLOGICAL BASIC ANALYSIS")
print("=" * 60)

# Esegui script
exec(open('scripts/02_anthropological_basic.py', encoding='utf-8').read())

# Mostra risultati
with open('output/feature_personalities_corrected.json', 'r') as f:
    personalities = json.load(f)
with open('output/feature_typology.json', 'r') as f:
    typology = json.load(f)

print(f"\n📊 Risultati:")
print(f"   Feature totali: {len(personalities)}")
print(f"\n   Typology distribution:")
for t, features in typology.items():
    print(f"      {t}: {len(features)} feature")


---

## 🎯 Step 2: Compute Robust Thresholds

Calcola soglie robuste per influence-first filtering:
- **τ_inf**: max(p90, cutoff_80% cumulata)
- **τ_aff**: 0.60 (configurabile)
- **τ_inf_very_high**: p95 (BOS filter)

**Criterio**: `admitted = (logit_influence >= τ_inf) OR (max_affinity >= τ_aff)`

**Output**: `robust_thresholds.json`


In [None]:
print("🎯 STEP 2: COMPUTE ROBUST THRESHOLDS")
print("=" * 60)

exec(open('scripts/03_compute_thresholds.py', encoding='utf-8').read())

# Mostra thresholds
with open('output/robust_thresholds.json', 'r') as f:
    thresholds = json.load(f)

print(f"\n📊 Thresholds:")
print(f"   τ_inf: {thresholds['tau_inf']:.6f}")
print(f"   Situational core: {thresholds['situational_core']['n_features']} features ({thresholds['situational_core']['influence_coverage']*100:.1f}% coverage)")
print(f"   Generalizable scaffold: {thresholds['generalizable_scaffold']['n_features']} features")
print(f"   BOS leakage: {thresholds['bos_leakage_pct']:.1f}% {'✅' if thresholds['bos_leakage_ok'] else '❌'}")


---

## 🌱 Step 3: Cicciotti Supernodes (Semantic)

Costruisce supernodi semantici tramite:
1. **Seed selection influence-first**: ordina per (logit_influence, max_affinity)
2. **Narrative-guided growth**: compatibilità = 0.4·cosine + 0.3·jaccard + 0.2·consistency + 0.1·layer
3. **Coherence tracking**: stop quando coherence < soglia

**Output**: `cicciotti_supernodes.json`


In [None]:
print("🌱 STEP 3: CICCIOTTI SUPERNODES (SEMANTIC)")
print("=" * 60)

exec(open('scripts/04_cicciotti_supernodes.py', encoding='utf-8').read())

# Mostra supernodi
with open('output/cicciotti_supernodes.json', 'r') as f:
    cicciotti = json.load(f)

print(f"\n📊 Supernodi Semantici: {len(cicciotti)}\n")
for name, data in list(cicciotti.items())[:3]:
    print(f"   {name}: {data['narrative_theme']}")
    print(f"      {data['n_members']} members, influence={data['seed_logit_influence']:.4f}, coherence={data['final_coherence']:.3f}")


---

## 🔧 Step 4: Final Optimized Clustering

Clusterizza feature residue (non nei supernodi semantici) che superano i threshold:
- Identifica quality residuals con τ_inf/τ_aff
- Clustering per dominant_token e layer_range
- Merge con supernodi semantici

**Output**: `final_anthropological_optimized.json`


In [None]:
print("🔧 STEP 4: FINAL OPTIMIZED CLUSTERING")
print("=" * 60)

exec(open('scripts/05_final_optimized_clustering.py', encoding='utf-8').read())

# Mostra risultati
with open('output/final_anthropological_optimized.json', 'r') as f:
    final = json.load(f)

n_semantic = len(final.get('semantic_supernodes', {}))
n_computational = len(final.get('computational_supernodes', {}))
total_features = sum(sn['n_members'] for sn in final.get('semantic_supernodes', {}).values())
total_features += sum(sn['n_members'] for sn in final.get('computational_supernodes', {}).values())

print(f"\n📊 Supernodi Finali:")
print(f"   Semantic: {n_semantic}")
print(f"   Computational: {n_computational}")
print(f"   TOTALE: {n_semantic + n_computational}")
print(f"   Feature coperte: {total_features}")


---

## ✅ Step 5: Verify Logit Influence

Valida copertura logit influence dei supernodi:
- Calcola % influence totale coperta
- Breakdown per feature type
- Rating (EXCELLENT/GOOD/MODERATE/WEAK)

**Output**: `logit_influence_validation.json`


In [None]:
print("✅ STEP 5: VERIFY LOGIT INFLUENCE")
print("=" * 60)

exec(open('scripts/06_verify_logit_influence.py', encoding='utf-8').read())

# Mostra validazione
with open('output/logit_influence_validation.json', 'r') as f:
    validation = json.load(f)

print(f"\n📊 Validazione:")
print(f"   Coverage: {validation['coverage_percentage']:.1f}%")
print(f"   Rating: {validation['rating']}")
print(f"\n   Breakdown:")
for t, data in validation['type_breakdown'].items():
    print(f"      {t}: {data['count']} features ({data['pct']:.1f}%)")


---

## 📈 Step 6: Visualizzazioni e Export

Genera visualizzazioni e export:
- Plot 3D spazio typology (mean_consistency × max_affinity × logit_influence)
- Export supernodi su Neuronpedia (URL interattivo)

**Output**: `feature_space_3d.png`, `neuronpedia_url_improved.txt`


In [None]:
print("📈 STEP 6: VISUALIZZAZIONI E EXPORT")
print("=" * 60)

# Visualizzazione 3D
try:
    exec(open('scripts/visualization/visualize_feature_space_3d.py', encoding='utf-8').read())
    print("\n✅ Visualizzazione 3D: output/feature_space_3d.png")
except Exception as e:
    print(f"⚠️ Errore viz 3D: {e}")

# Export Neuronpedia
try:
    exec(open('scripts/visualization/neuronpedia_export.py', encoding='utf-8').read())
    print("✅ Export Neuronpedia: output/neuronpedia_url_improved.txt")
except Exception as e:
    print(f"⚠️ Errore export: {e}")


---

## 🎉 Riepilogo Finale


In [None]:
print("\n" + "=" * 60)
print("🎉 PIPELINE COMPLETATA")
print("=" * 60)

# Carica risultati finali
with open('output/robust_thresholds.json', 'r') as f:
    th = json.load(f)
with open('output/final_anthropological_optimized.json', 'r') as f:
    final = json.load(f)
with open('output/logit_influence_validation.json', 'r') as f:
    val = json.load(f)

n_sem = len(final.get('semantic_supernodes', {}))
n_comp = len(final.get('computational_supernodes', {}))
total_feat = sum(s['n_members'] for s in final.get('semantic_supernodes', {}).values())
total_feat += sum(s['n_members'] for s in final.get('computational_supernodes', {}).values())

print(f"\n✅ Supernodi: {n_sem + n_comp} ({n_sem} semantic + {n_comp} computational)")
print(f"✅ Feature coperte: {total_feat}")
print(f"✅ Coverage logit influence: {val['coverage_percentage']:.1f}% ({val['rating']})")
print(f"✅ BOS leakage: {th['bos_leakage_pct']:.1f}%")
print(f"\n📁 Output directory: {OUTPUT_DIR.absolute()}")


---

## 📚 Riferimenti e Documentazione

### Metriche Chiave

| Metrica | Range | Significato |
|---------|-------|-------------|
| `mean_consistency` | 0-1 | Generalizzabilità cross-prompt |
| `max_affinity` | 0-1 | Specializzazione semantica |
| `logit_influence` | 0-∞ | Impatto causale sull'output |

### Typology

- **Generalist**: Alta consistency + Alta affinity + Bassa influence
- **Specialist**: Bassa consistency + Alta affinity + Alta influence
- **Computational**: Alta consistency + Bassa affinity
- **Hybrid**: Combinazioni miste

### Dual View (Influence-First)

1. **Situational Core**: `logit_influence >= τ_inf` - Feature causalmente determinanti
2. **Generalizable Scaffold**: `max_affinity >= τ_aff OR mean_consistency >= τ_cons` - Feature stabili

### Links

- **Circuit Tracer**: https://github.com/safety-research/circuit-tracer
- **Paper**: https://transformer-circuits.pub/2025/attribution-graphs/
- **Neuronpedia**: https://www.neuronpedia.org

---

**Version**: 2.0 (Influence-First)  
**Model**: Gemma-2-2B  
**Last Updated**: 2025-10-09
