# GraphRAG Step 2: Graph Construction & Community Detection

This notebook continues the GraphRAG pipeline:

## Pipeline Steps
1. **Load extraction results** - Multi-source data from notebook 01
2. **Build NetworkX graph** - Entities as nodes, relationships as edges, with source provenance
3. **Compute graph metrics** - PageRank, centrality, degree
4. **Community detection** - Louvain algorithm for topic clustering
5. **Generate community summaries** - LLM-powered hierarchical summaries
6. **Store in SQLite** - Persist graph structure, sources, and summaries
7. **Interactive graph visualization** - Standalone HTML with Cytoscape.js (community summaries, chunk expansion)

## Setup

In [17]:
import json
import sqlite3
import webbrowser
from pathlib import Path
from dataclasses import dataclass, asdict

import httpx
import networkx as nx
import community as community_louvain  # python-louvain

OLLAMA_BASE_URL = "http://localhost:11434"
MODEL = "qwen2.5:3b"
DB_PATH = Path("graphrag.db")
GRAPH_HTML_PATH = Path("knowledge_graph.html")

In [18]:
def chat_ollama(prompt: str, system: str = "", temperature: float = 0.0) -> str:
    """Send a chat request to Ollama and return the response."""
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    
    response = httpx.post(
        f"{OLLAMA_BASE_URL}/api/chat",
        json={
            "model": MODEL,
            "messages": messages,
            "stream": False,
            "options": {"temperature": temperature}
        },
        timeout=120.0
    )
    response.raise_for_status()
    return response.json()["message"]["content"]

## Step 1: Load Extraction Results

In [19]:
# Load multi-source results from notebook 01
with open("extraction_results.json", "r") as f:
    data = json.load(f)

# Global merged data
entities = data["merged"]["entities"]
relationships = data["merged"]["relationships"]
claims = data["merged"]["claims"]
entity_source_map = data["merged"]["entity_source_map"]
entity_chunk_map = data["merged"].get("entity_chunk_map", {})

# Per-source data
sources_data = data["sources"]

# Build flat chunk lookup: global_chunk_index -> {text, source_id}
# Chunks are indexed sequentially across sources in document order
all_chunks = []
chunk_lookup: dict[int, dict] = {}
for source in sources_data:
    for chunk_text in source["chunks"]:
        idx = len(all_chunks)
        entry = {"index": idx, "source_id": source["source_id"], "text": chunk_text}
        all_chunks.append(entry)
        chunk_lookup[idx] = entry

print(f"Loaded: {len(entities)} entities, {len(relationships)} relationships, {len(claims)} claims")
print(f"From {len(sources_data)} sources, {len(all_chunks)} total chunks")
print(f"Entity chunk provenance: {len(entity_chunk_map)} entities mapped to chunks")
print()
print("=== PER-SOURCE BREAKDOWN ===")
for source in sources_data:
    print(f"  [{source['source_type']}] {source['source_id']}: "
          f"{len(source['entities'])}E {len(source['relationships'])}R {len(source['claims'])}C "
          f"({source['content_length']} chars)")

Loaded: 101 entities, 53 relationships, 2 claims
From 7 sources, 52 total chunks
Entity chunk provenance: 101 entities mapped to chunks

=== PER-SOURCE BREAKDOWN ===
  [arxiv] arxiv:2404.16130: 7E 4R 0C (1500 chars)
  [arxiv] arxiv:2404.18021: 7E 1R 0C (1673 chars)
  [arxiv] arxiv:2312.14090: 12E 7R 0C (1724 chars)
  [arxiv] arxiv:2304.04869: 11E 4R 0C (1025 chars)
  [web] web:quanta-memory: 3E 0R 0C (1481 chars)
  [web] web:imf-ai-economy: 29E 19R 1C (5408 chars)
  [web] web:planetary-voyager: 35E 18R 1C (7012 chars)


In [20]:
# Preview entities
print(f"=== ENTITIES ({len(entities)} total from {len(sources_data)} sources) ===")
for e in entities[:10]:
    sources = entity_source_map.get(e["name"], [])
    src_str = f" [{len(sources)} sources]" if len(sources) > 1 else ""
    print(f"  [{e['type']}] {e['name']}{src_str}")
if len(entities) > 10:
    print(f"  ... and {len(entities) - 10} more")

=== ENTITIES (101 total from 7 sources) ===
  [CONCEPT] RETRIEVAL-AUGMENTED GENERATION
  [ORGANIZATION] LLM [2 sources]
  [LOCATION] PRIVATE DOCUMENT COLLECTIONS
  [LOCATION] EXTERNAL KNOWLEDGE SOURCE
  [ORGANIZATION] RAG SYSTEMS
  [PRODUCT] GRAPHRAG
  [UNKNOWN] 
  [CONCEPT] CRISPR-GPT
  [CONCEPT] AGENTIC AUTOMATION
  [CONCEPT] CRISPR
  ... and 91 more


## Step 2: Build NetworkX Graph

In [21]:
# Create directed graph
G = nx.DiGraph()

# Add entity nodes with attributes including source provenance
for entity in entities:
    source_refs = entity_source_map.get(entity["name"], [])
    G.add_node(
        entity["name"],
        type=entity["type"],
        description=entity["description"],
        source_refs=json.dumps(source_refs),
        num_sources=len(source_refs),
    )

# Add relationship edges with attributes
for rel in relationships:
    # Only add edge if both nodes exist
    if rel["source"] in G.nodes and rel["target"] in G.nodes:
        G.add_edge(
            rel["source"],
            rel["target"],
            description=rel["description"],
            weight=rel["strength"]
        )

print(f"Graph created: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
multi_source_nodes = sum(1 for _, d in G.nodes(data=True) if d.get("num_sources", 1) > 1)
print(f"Multi-source nodes (in 2+ sources): {multi_source_nodes}")

Graph created: 101 nodes, 46 edges
Multi-source nodes (in 2+ sources): 3


In [22]:
# Display graph structure
print("\n=== GRAPH NODES ===")
for node, attrs in list(G.nodes(data=True))[:10]:
    print(f"  {node} ({attrs.get('type', 'N/A')})")

print("\n=== GRAPH EDGES ===")
for source, target, attrs in list(G.edges(data=True))[:10]:
    print(f"  {source} --> {target}")
    print(f"    {attrs.get('description', 'N/A')[:60]}...")


=== GRAPH NODES ===
  RETRIEVAL-AUGMENTED GENERATION (CONCEPT)
  LLM (ORGANIZATION)
  PRIVATE DOCUMENT COLLECTIONS (LOCATION)
  EXTERNAL KNOWLEDGE SOURCE (LOCATION)
  RAG SYSTEMS (ORGANIZATION)
  GRAPHRAG (PRODUCT)
   (UNKNOWN)
  CRISPR-GPT (CONCEPT)
  AGENTIC AUTOMATION (CONCEPT)
  CRISPR (CONCEPT)

=== GRAPH EDGES ===
  RETRIEVAL-AUGMENTED GENERATION --> LLM
    RAG is used to generate relevant information for LLMs...
  RETRIEVAL-AUGMENTED GENERATION --> EXTERNAL KNOWLEDGE SOURCE
    RAG retrieves information from an external knowledge source...
  LLM --> PRIVATE DOCUMENT COLLECTIONS
    LLMs can answer questions over private document collections ...
  GRAPHRAG --> LLM
    GraphRAG uses LLM to build a graph index in two stages...
  AGENTIC AUTOMATION --> CRISPR-GPT
    Agentic Automation uses CRISPR-GPT for gene-editing experime...
  SOCIAL NETWORKS --> MOBILE PHONES
    Social Networks and Mobile Phones contribute to the emergenc...
  CRIMINOLOGISTS --> SEXTOSTRONGERS
    Criminolo

## Step 3: Compute Graph Metrics

Calculate centrality scores for ranking node importance.

In [23]:
# Convert to undirected for some algorithms
G_undirected = G.to_undirected()

# PageRank - importance based on incoming connections
pagerank = nx.pagerank(G, weight="weight")

# Degree centrality - number of connections
degree_centrality = nx.degree_centrality(G)

# Betweenness centrality - bridges between clusters
betweenness = nx.betweenness_centrality(G_undirected)

# Store metrics on nodes
for node in G.nodes:
    G.nodes[node]["pagerank"] = pagerank.get(node, 0)
    G.nodes[node]["degree_centrality"] = degree_centrality.get(node, 0)
    G.nodes[node]["betweenness"] = betweenness.get(node, 0)

print("Graph metrics computed: pagerank, degree_centrality, betweenness")

Graph metrics computed: pagerank, degree_centrality, betweenness


In [24]:
# Top entities by PageRank
print("\n=== TOP ENTITIES BY PAGERANK ===")
sorted_by_pagerank = sorted(pagerank.items(), key=lambda x: -x[1])
for node, score in sorted_by_pagerank[:10]:
    node_type = G.nodes[node].get("type", "N/A")
    print(f"  {score:.4f} | [{node_type}] {node}")


=== TOP ENTITIES BY PAGERANK ===
  0.0242 | [CONCEPT] AI
  0.0227 | [LOCATION] PRIVATE DOCUMENT COLLECTIONS
  0.0211 | [LOCATION] INTERSTELLAR SPACE
  0.0211 | [CONCEPT] HYDROGEN ATOM
  0.0208 | [PRODUCT] TRUSTED BLOCKCHAIN DECENTRALIZED APPLICATIONS
  0.0201 | [ORGANIZATION] STARTUPXYZ
  0.0175 | [ORGANIZATION] LLM
  0.0153 | [EVENT] SEXTORTION CASES
  0.0148 | [LOCATION] ADVANCED ECONOMIES
  0.0144 | [CONCEPT] CRISPR-GPT


## Step 4: Community Detection

Using Louvain algorithm to find clusters of related entities.

In [25]:
# Louvain community detection (works on undirected graphs)
partition = community_louvain.best_partition(G_undirected, weight="weight", resolution=1.0)

# Store community assignment on nodes
for node, community_id in partition.items():
    G.nodes[node]["community"] = community_id

# Count communities
num_communities = max(partition.values()) + 1
print(f"Detected {num_communities} communities")

# Modularity score (quality of partition)
modularity = community_louvain.modularity(partition, G_undirected, weight="weight")
print(f"Modularity score: {modularity:.4f}")

Detected 58 communities
Modularity score: 0.8496


In [26]:
# Group entities by community
communities: dict[int, list[str]] = {}
for node, community_id in partition.items():
    if community_id not in communities:
        communities[community_id] = []
    communities[community_id].append(node)

print("\n=== COMMUNITIES ===")
for comm_id, members in sorted(communities.items()):
    # Sort members by PageRank within community
    sorted_members = sorted(members, key=lambda x: -pagerank.get(x, 0))
    print(f"\nCommunity {comm_id} ({len(members)} members):")
    for member in sorted_members[:5]:
        node_type = G.nodes[member].get("type", "N/A")
        print(f"  [{node_type}] {member}")
    if len(members) > 5:
        print(f"  ... and {len(members) - 5} more")


=== COMMUNITIES ===

Community 0 (1 members):
  [PERSON] JIMMY CARTER

Community 1 (5 members):
  [LOCATION] PRIVATE DOCUMENT COLLECTIONS
  [ORGANIZATION] LLM
  [LOCATION] EXTERNAL KNOWLEDGE SOURCE
  [CONCEPT] RETRIEVAL-AUGMENTED GENERATION
  [PRODUCT] GRAPHRAG

Community 2 (1 members):
  [ORGANIZATION] RAG SYSTEMS

Community 3 (1 members):
  [UNKNOWN] 

Community 4 (2 members):
  [CONCEPT] CRISPR-GPT
  [CONCEPT] AGENTIC AUTOMATION

Community 5 (1 members):
  [CONCEPT] CRISPR

Community 6 (1 members):
  [ORGANIZATION] CRISPR CORP

Community 7 (3 members):
  [ORGANIZATION] STARTUPXYZ
  [LOCATION] OBSERVATORY
  [PRODUCT] TELESCOPE

Community 8 (1 members):
  [ORGANIZATION] NATURE

Community 9 (1 members):
  [EVENT] SEXTORTION

Community 10 (1 members):
  [CONCEPT] SOCIAL DECENTRALIZED AUTONOMOUS ORGANIZATIONS

Community 11 (2 members):
  [PRODUCT] MOBILE PHONES
  [CONCEPT] SOCIAL NETWORKS

Community 12 (2 members):
  [PERSON] SEXTOSTRONGERS
  [ORGANIZATION] CRIMINOLOGISTS

Community 13 

## Step 5: Generate Community Summaries

Create LLM-powered summaries for each community following GraphRAG's report format.

In [27]:
@dataclass
class CommunitySummary:
    community_id: int
    title: str
    summary: str
    key_entities: list[str]
    key_insights: list[str]

COMMUNITY_SUMMARY_PROMPT = """
You are an expert analyst creating a summary report for a knowledge graph community.

Given the following entities and their relationships, create a structured summary.

ENTITIES IN THIS COMMUNITY:
{entities_info}

RELATIONSHIPS:
{relationships_info}

RELEVANT CLAIMS:
{claims_info}

Create a JSON response with:
1. title: A short descriptive title for this community (5-10 words)
2. summary: A 2-3 sentence executive summary of what this community represents
3. key_insights: 3-5 bullet points of key facts or relationships

Return ONLY valid JSON:
{{
  "title": "...",
  "summary": "...",
  "key_insights": ["...", "...", "..."]
}}

JSON OUTPUT:
"""

def generate_community_summary(community_id: int, members: list[str], G: nx.DiGraph, claims: list[dict]) -> CommunitySummary:
    """Generate a summary for a community using the LLM."""
    
    # Gather entity info
    entities_info = []
    for member in members:
        node_data = G.nodes[member]
        entities_info.append(f"- {member} ({node_data.get('type', 'N/A')}): {node_data.get('description', 'N/A')}")
    
    # Gather relationships within community
    relationships_info = []
    for source, target, data in G.edges(data=True):
        if source in members and target in members:
            relationships_info.append(f"- {source} -> {target}: {data.get('description', 'N/A')}")
    
    # Gather relevant claims
    claims_info = []
    for claim in claims:
        if claim["subject"] in members:
            claims_info.append(f"- [{claim['claim_type']}] {claim['subject']}: {claim['description']}")
    
    prompt = COMMUNITY_SUMMARY_PROMPT.format(
        entities_info="\n".join(entities_info) or "No entities",
        relationships_info="\n".join(relationships_info) or "No relationships",
        claims_info="\n".join(claims_info[:10]) or "No claims"  # Limit claims
    )
    
    response = chat_ollama(prompt)
    
    # Parse JSON
    json_str = response.strip()
    if json_str.startswith("```"):
        json_str = json_str.split("```")[1]
        if json_str.startswith("json"):
            json_str = json_str[4:]
    json_str = json_str.strip()
    
    try:
        data = json.loads(json_str)
        return CommunitySummary(
            community_id=community_id,
            title=data.get("title", f"Community {community_id}"),
            summary=data.get("summary", ""),
            key_entities=members[:5],  # Top 5 by PageRank
            key_insights=data.get("key_insights", [])
        )
    except json.JSONDecodeError as ex:
        print(f"Failed to parse JSON for community {community_id}: {ex}")
        print(f"Raw response: {response}")
        return CommunitySummary(
            community_id=community_id,
            title=f"Community {community_id}",
            summary="Summary generation failed",
            key_entities=members[:5],
            key_insights=[]
        )

In [28]:
# Generate summaries for each community
community_summaries: list[CommunitySummary] = []

for comm_id, members in sorted(communities.items()):
    print(f"Generating summary for Community {comm_id} ({len(members)} members)...")
    # Sort members by PageRank
    sorted_members = sorted(members, key=lambda x: -pagerank.get(x, 0))
    summary = generate_community_summary(comm_id, sorted_members, G, claims)
    community_summaries.append(summary)
    print(f"  Title: {summary.title}")

print(f"\nGenerated {len(community_summaries)} community summaries")

Generating summary for Community 0 (1 members)...
  Title: Voyager Message Addressing Community
Generating summary for Community 1 (5 members)...
  Title: Knowledge Graph Community for Retrieval-Augmented Generation
Generating summary for Community 2 (1 members)...
  Title: RAG Systems Community
Generating summary for Community 3 (1 members)...
  Title: Unknown Community Analysis Report
Generating summary for Community 4 (2 members)...
  Title: CRISPR-GPT and Agentic Automation Community
Generating summary for Community 5 (1 members)...
  Title: CRISPR Technology Community
Generating summary for Community 6 (1 members)...
  Title: CRISPR CORP Community
Generating summary for Community 7 (3 members)...
  Title: Observatory and Telescope Community
Generating summary for Community 8 (1 members)...
  Title: Nature Journal Community
Generating summary for Community 9 (1 members)...
  Title: Sextortion Case Community
Generating summary for Community 10 (1 members)...
  Title: Social Decentra

In [29]:
# Display community summaries
print("\n" + "="*60)
print("COMMUNITY SUMMARIES")
print("="*60)

for summary in community_summaries:
    print(f"\n### Community {summary.community_id}: {summary.title}")
    print(f"\n{summary.summary}")
    print(f"\nKey Entities: {', '.join(summary.key_entities)}")
    print(f"\nKey Insights:")
    for insight in summary.key_insights:
        print(f"  - {insight}")


COMMUNITY SUMMARIES

### Community 0: Voyager Message Addressing Community

This community focuses on the historical event where U.S. President Jimmy Carter addressed a message inscribed on Voyager spacecraft, representing a unique moment in space communication history.

Key Entities: JIMMY CARTER

Key Insights:
  - The community revolves around the interaction between human leadership and interstellar communication.
  - It highlights the significance of preserving messages for future civilizations by NASA's Voyager program.
  - Jimmy Carter's address underscores the importance of international cooperation in space exploration.

### Community 1: Knowledge Graph Community for Retrieval-Augmented Generation

This community focuses on the integration of Large Language Models (LLMs) with retrieval-augmented generation techniques to enhance information retrieval and document understanding over private collections. It also explores a proposed method, GraphRAG, which uses LLMs in two stages 

## Step 6: Store in SQLite

Persist the graph structure, metrics, sources, and summaries.

In [30]:
# Create SQLite database (delete and recreate for clean schema on re-runs)
# We delete the file instead of DROP TABLE because notebook 03 creates
# sqlite-vec virtual tables (vec0 module) that can't be dropped without
# loading the extension.
if DB_PATH.exists():
    DB_PATH.unlink()
    print(f"Deleted existing {DB_PATH}")

conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()

# Create tables with current schema
cursor.executescript("""
-- Sources table: tracks ingested documents
CREATE TABLE sources (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_id TEXT UNIQUE NOT NULL,
    source_type TEXT,
    title TEXT,
    url TEXT,
    content_type TEXT,
    content_length INTEGER,
    fetched_at TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Entities table
CREATE TABLE entities (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT UNIQUE NOT NULL,
    type TEXT,
    description TEXT,
    pagerank REAL DEFAULT 0,
    degree_centrality REAL DEFAULT 0,
    betweenness REAL DEFAULT 0,
    community_id INTEGER,
    source_refs TEXT,           -- JSON array of source_ids
    num_sources INTEGER DEFAULT 1,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Relationships table
CREATE TABLE relationships (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_id INTEGER REFERENCES entities(id),
    target_id INTEGER REFERENCES entities(id),
    description TEXT,
    weight REAL DEFAULT 1.0,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Claims table
CREATE TABLE claims (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    subject_id INTEGER REFERENCES entities(id),
    claim_type TEXT,
    description TEXT,
    claim_date TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Community summaries table
CREATE TABLE community_summaries (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    community_id INTEGER UNIQUE NOT NULL,
    title TEXT,
    summary TEXT,
    key_entities TEXT,  -- JSON array
    key_insights TEXT,  -- JSON array
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Chunks table (source text)
CREATE TABLE chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    content TEXT,
    chunk_index INTEGER,
    source_ref TEXT,            -- references sources.source_id
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create indexes
CREATE INDEX idx_entities_name ON entities(name);
CREATE INDEX idx_entities_community ON entities(community_id);
CREATE INDEX idx_relationships_source ON relationships(source_id);
CREATE INDEX idx_relationships_target ON relationships(target_id);
CREATE INDEX idx_chunks_source ON chunks(source_ref);
CREATE INDEX idx_sources_source_id ON sources(source_id);
""")

conn.commit()
print("Database tables created (fresh schema with sources table and source provenance columns)")

Deleted existing graphrag.db
Database tables created (fresh schema with sources table and source provenance columns)


In [31]:
# Insert entities (with source provenance)
entity_id_map: dict[str, int] = {}

for node, attrs in G.nodes(data=True):
    cursor.execute("""
        INSERT INTO entities (name, type, description, pagerank, degree_centrality, betweenness, community_id, source_refs, num_sources)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        node,
        attrs.get("type"),
        attrs.get("description"),
        attrs.get("pagerank", 0),
        attrs.get("degree_centrality", 0),
        attrs.get("betweenness", 0),
        attrs.get("community"),
        attrs.get("source_refs", "[]"),
        attrs.get("num_sources", 1),
    ))
    entity_id_map[node] = cursor.lastrowid

conn.commit()
print(f"Inserted {len(entity_id_map)} entities")

Inserted 101 entities


In [32]:
# Insert relationships
rel_count = 0
for source, target, attrs in G.edges(data=True):
    source_id = entity_id_map.get(source)
    target_id = entity_id_map.get(target)
    if source_id and target_id:
        cursor.execute("""
            INSERT INTO relationships (source_id, target_id, description, weight)
            VALUES (?, ?, ?, ?)
        """, (
            source_id,
            target_id,
            attrs.get("description"),
            attrs.get("weight", 1.0)
        ))
        rel_count += 1

conn.commit()
print(f"Inserted {rel_count} relationships")

Inserted 46 relationships


In [33]:
# Insert claims
claim_count = 0
for claim in claims:
    subject_id = entity_id_map.get(claim["subject"])
    if subject_id:
        cursor.execute("""
            INSERT INTO claims (subject_id, claim_type, description, claim_date)
            VALUES (?, ?, ?, ?)
        """, (
            subject_id,
            claim.get("claim_type"),
            claim.get("description"),
            claim.get("date")
        ))
        claim_count += 1

conn.commit()
print(f"Inserted {claim_count} claims")

Inserted 1 claims


In [34]:
# Insert community summaries
for summary in community_summaries:
    cursor.execute("""
        INSERT INTO community_summaries (community_id, title, summary, key_entities, key_insights)
        VALUES (?, ?, ?, ?, ?)
    """, (
        summary.community_id,
        summary.title,
        summary.summary,
        json.dumps(summary.key_entities),
        json.dumps(summary.key_insights)
    ))

conn.commit()
print(f"Inserted {len(community_summaries)} community summaries")

Inserted 58 community summaries


In [35]:
# Insert chunks with source provenance
chunk_count = 0
for source in sources_data:
    source_id = source["source_id"]
    for i, chunk in enumerate(source["chunks"]):
        cursor.execute("""
            INSERT INTO chunks (content, chunk_index, source_ref)
            VALUES (?, ?, ?)
        """, (chunk, chunk_count, source_id))
        chunk_count += 1

conn.commit()
print(f"Inserted {chunk_count} chunks (with source_ref)")

Inserted 52 chunks (with source_ref)


In [36]:
# Insert source records
for source in sources_data:
    cursor.execute("""
        INSERT INTO sources (source_id, source_type, title, url, content_type, content_length, fetched_at)
        VALUES (?, ?, ?, ?, ?, ?, ?)
    """, (
        source["source_id"],
        source["source_type"],
        source["title"],
        source["url"],
        source["content_type"],
        source["content_length"],
        source.get("fetched_at", ""),
    ))

conn.commit()
print(f"Inserted {len(sources_data)} source records")

Inserted 7 source records


In [37]:
# Verify data
print("\n=== DATABASE SUMMARY ===")
for table in ["sources", "entities", "relationships", "claims", "community_summaries", "chunks"]:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"  {table}: {count} rows")

# Show multi-source entities
cursor.execute("SELECT name, source_refs, num_sources FROM entities WHERE num_sources > 1 ORDER BY num_sources DESC")
multi = cursor.fetchall()
if multi:
    print(f"\n=== MULTI-SOURCE ENTITIES ({len(multi)}) ===")
    for name, refs, n in multi:
        print(f"  {name}: {n} sources — {refs}")


=== DATABASE SUMMARY ===
  sources: 7 rows
  entities: 101 rows
  relationships: 46 rows
  claims: 1 rows
  community_summaries: 58 rows
  chunks: 52 rows

=== MULTI-SOURCE ENTITIES (3) ===
  LLM: 2 sources — ["arxiv:2404.16130", "arxiv:2404.18021"]
  STARTUPXYZ: 2 sources — ["arxiv:2404.18021", "arxiv:2304.04869"]
  AI: 2 sources — ["arxiv:2312.14090", "web:imf-ai-economy"]


In [38]:
# Sample query: Top entities by PageRank
print("\n=== TOP ENTITIES (from DB) ===")
cursor.execute("""
    SELECT name, type, pagerank, community_id 
    FROM entities 
    ORDER BY pagerank DESC 
    LIMIT 10
""")
for row in cursor.fetchall():
    print(f"  {row[2]:.4f} | [{row[1]}] {row[0]} (Community {row[3]})")


=== TOP ENTITIES (from DB) ===
  0.0242 | [CONCEPT] AI (Community 15)
  0.0227 | [LOCATION] PRIVATE DOCUMENT COLLECTIONS (Community 1)
  0.0211 | [LOCATION] INTERSTELLAR SPACE (Community 54)
  0.0211 | [CONCEPT] HYDROGEN ATOM (Community 56)
  0.0208 | [PRODUCT] TRUSTED BLOCKCHAIN DECENTRALIZED APPLICATIONS (Community 15)
  0.0201 | [ORGANIZATION] STARTUPXYZ (Community 7)
  0.0175 | [ORGANIZATION] LLM (Community 1)
  0.0153 | [EVENT] SEXTORTION CASES (Community 15)
  0.0148 | [LOCATION] ADVANCED ECONOMIES (Community 30)
  0.0144 | [CONCEPT] CRISPR-GPT (Community 4)


In [39]:
# Sample query: Get community with its entities
print("\n=== COMMUNITY 0 DETAILS (from DB) ===")
cursor.execute("""
    SELECT title, summary FROM community_summaries WHERE community_id = 0
""")
row = cursor.fetchone()
if row:
    print(f"Title: {row[0]}")
    print(f"Summary: {row[1]}")
    
    cursor.execute("""
        SELECT name, type FROM entities WHERE community_id = 0 ORDER BY pagerank DESC LIMIT 5
    """)
    print("\nTop Members:")
    for row in cursor.fetchall():
        print(f"  [{row[1]}] {row[0]}")


=== COMMUNITY 0 DETAILS (from DB) ===
Title: Voyager Message Addressing Community
Summary: This community focuses on the historical event where U.S. President Jimmy Carter addressed a message inscribed on Voyager spacecraft, representing a unique moment in space communication history.

Top Members:
  [PERSON] JIMMY CARTER


In [40]:
# Close connection
conn.close()
print(f"\nDatabase saved to: {DB_PATH.absolute()}")


Database saved to: /Users/pacho-home-server/daily-knowledge-ingestion-assistant/notebooks/graphrag.db


## Step 7: Interactive Graph Visualization

Generate a standalone HTML file with Cytoscape.js and open it in the browser.
Nodes are colored by community, sized by PageRank.
Click a node to see its details, community summary, and expand source chunks.
Click a community in the legend to see its LLM-generated summary.

In [41]:
# Prepare graph data for Cytoscape.js export

COMMUNITY_COLORS = [
    "#e6194b", "#3cb44b", "#4363d8", "#f58231", "#911eb4",
    "#42d4f4", "#f032e6", "#bfef45", "#fabed4", "#469990",
    "#dcbeff", "#9A6324", "#fffac8", "#800000", "#aaffc3",
]

def scale_pagerank_to_size(pr: float, min_size: int = 25, max_size: int = 90) -> int:
    """Map PageRank value to node pixel size."""
    all_pr = [G.nodes[n].get("pagerank", 0) for n in G.nodes]
    pr_min, pr_max = min(all_pr), max(all_pr)
    if pr_max == pr_min:
        return (min_size + max_size) // 2
    normalized = (pr - pr_min) / (pr_max - pr_min)
    return int(min_size + normalized * (max_size - min_size))

# Build Cytoscape.js elements array
cyto_elements = []
skipped = 0

for node in G.nodes:
    if not node or not node.strip():
        skipped += 1
        continue

    attrs = G.nodes[node]
    comm_id = attrs.get("community", 0)
    pr = attrs.get("pagerank", 0)
    node_type = attrs.get("type", "UNKNOWN")
    chunk_refs = entity_chunk_map.get(node, [])

    cyto_elements.append({
        "data": {
            "id": node,
            "label": node,
            "type": node_type,
            "description": attrs.get("description", ""),
            "community": comm_id,
            "pagerank": round(pr, 6),
            "degree_centrality": round(attrs.get("degree_centrality", 0), 4),
            "betweenness": round(attrs.get("betweenness", 0), 4),
            "num_sources": attrs.get("num_sources", 1),
            "source_refs": attrs.get("source_refs", "[]"),
            "color": COMMUNITY_COLORS[comm_id % len(COMMUNITY_COLORS)],
            "size": scale_pagerank_to_size(pr),
            "chunk_count": len(chunk_refs),
        }
    })

valid_node_ids = {e["data"]["id"] for e in cyto_elements}

for source, target, attrs in G.edges(data=True):
    if source in valid_node_ids and target in valid_node_ids:
        cyto_elements.append({
            "data": {
                "id": f"{source}-->{target}",
                "source": source,
                "target": target,
                "description": attrs.get("description", ""),
                "weight": attrs.get("weight", 1.0),
            }
        })

# Build chunk data for the visualization
cyto_chunk_map = {}
for entity_name in valid_node_ids:
    chunk_refs = entity_chunk_map.get(entity_name, [])
    if chunk_refs:
        chunks_for_entity = []
        for ref in chunk_refs:
            chunk_idx = ref["chunk_index"]
            chunk_data = chunk_lookup.get(chunk_idx)
            if chunk_data:
                chunks_for_entity.append({
                    "index": chunk_idx,
                    "source_id": ref["source_id"],
                    "text": chunk_data["text"],
                })
        if chunks_for_entity:
            cyto_chunk_map[entity_name] = chunks_for_entity

# Build community summary data for the visualization
cyto_community_summaries = {}
for summary in community_summaries:
    cyto_community_summaries[summary.community_id] = {
        "title": summary.title,
        "summary": summary.summary,
        "key_entities": summary.key_entities,
        "key_insights": summary.key_insights,
    }

node_count = len([e for e in cyto_elements if "source" not in e["data"]])
edge_count = len([e for e in cyto_elements if "source" in e["data"]])
entities_with_chunks = sum(1 for v in cyto_chunk_map.values() if v)
total_chunk_refs = sum(len(v) for v in cyto_chunk_map.values())
print(f"Prepared {node_count} nodes and {edge_count} edges for Cytoscape.js")
print(f"Chunk provenance: {entities_with_chunks} entities linked to {total_chunk_refs} chunk references")
print(f"Community summaries: {len(cyto_community_summaries)} communities with LLM summaries")
if skipped:
    print(f"Skipped {skipped} nodes with empty names")

Prepared 100 nodes and 46 edges for Cytoscape.js
Chunk provenance: 100 entities linked to 138 chunk references
Community summaries: 58 communities with LLM summaries
Skipped 1 nodes with empty names


In [42]:
# Generate standalone HTML with Cytoscape.js and open in browser

# Build community legend data for the sidebar
legend_items = []
for comm_id, members in sorted(communities.items()):
    color = COMMUNITY_COLORS[comm_id % len(COMMUNITY_COLORS)]
    sorted_members = sorted(members, key=lambda x: -pagerank.get(x, 0))
    summary = cyto_community_summaries.get(comm_id, {})
    legend_items.append({
        "id": comm_id,
        "color": color,
        "count": len(members),
        "members": sorted_members[:5],
        "title": summary.get("title", f"Community {comm_id}"),
    })

html_content = """<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>DKIA Knowledge Graph</title>
<script src="https://cdnjs.cloudflare.com/ajax/libs/cytoscape/3.30.4/cytoscape.min.js"></script>
<style>
  * { margin: 0; padding: 0; box-sizing: border-box; }
  body {
    font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
    background: #0a0a0f;
    color: #e0e0e0;
    display: flex;
    height: 100vh;
    overflow: hidden;
  }
  #cy {
    flex: 1;
    background: #0a0a0f;
    position: relative;
  }
  #sidebar {
    width: 360px;
    background: #12121a;
    border-left: 1px solid #2a2a3a;
    display: flex;
    flex-direction: column;
    overflow: hidden;
  }
  #sidebar h2 {
    padding: 16px 20px 12px;
    font-size: 14px;
    text-transform: uppercase;
    letter-spacing: 1.5px;
    color: #f58231;
    border-bottom: 1px solid #2a2a3a;
  }
  #node-info {
    padding: 16px 20px;
    border-bottom: 1px solid #2a2a3a;
    min-height: 120px;
    max-height: 380px;
    overflow-y: auto;
    font-size: 13px;
    line-height: 1.6;
  }
  #node-info .name {
    font-size: 16px;
    font-weight: 600;
    color: #fff;
    margin-bottom: 8px;
  }
  #node-info .type-badge {
    display: inline-block;
    padding: 2px 8px;
    border-radius: 4px;
    font-size: 11px;
    font-weight: 600;
    margin-bottom: 10px;
  }
  #node-info .metric { color: #888; }
  #node-info .metric span { color: #e0e0e0; font-weight: 500; }
  #node-info .desc {
    margin-top: 10px;
    color: #aaa;
    font-style: italic;
    font-size: 12px;
  }
  #node-info .placeholder { color: #555; font-style: italic; }
  #node-info .chunk-hint {
    margin-top: 8px;
    padding: 6px 10px;
    background: #1a1a2e;
    border-radius: 4px;
    font-size: 11px;
    color: #42d4f4;
  }
  /* Community summary panel in node-info */
  .comm-summary-box {
    margin-top: 12px;
    padding: 10px 12px;
    background: #1a1a2e;
    border-radius: 6px;
    border-left: 3px solid #f58231;
  }
  .comm-summary-box .comm-title {
    font-size: 13px;
    font-weight: 600;
    color: #f58231;
    margin-bottom: 6px;
  }
  .comm-summary-box .comm-text {
    font-size: 12px;
    color: #aaa;
    line-height: 1.5;
    margin-bottom: 8px;
  }
  .comm-summary-box .comm-insights {
    list-style: none;
    padding: 0;
  }
  .comm-summary-box .comm-insights li {
    font-size: 11px;
    color: #888;
    padding: 2px 0 2px 12px;
    position: relative;
    line-height: 1.4;
  }
  .comm-summary-box .comm-insights li::before {
    content: '';
    position: absolute;
    left: 0;
    top: 8px;
    width: 5px;
    height: 5px;
    border-radius: 50%;
    background: #f58231;
  }
  #chunk-panel {
    border-bottom: 1px solid #2a2a3a;
    max-height: 0;
    overflow: hidden;
    transition: max-height 0.3s ease;
  }
  #chunk-panel.open {
    max-height: 400px;
    overflow-y: auto;
  }
  #chunk-panel .chunk-header {
    padding: 10px 20px 6px;
    font-size: 11px;
    text-transform: uppercase;
    letter-spacing: 1px;
    color: #42d4f4;
    position: sticky;
    top: 0;
    background: #12121a;
  }
  .chunk-card {
    margin: 4px 12px;
    padding: 10px 14px;
    background: #1a1a2e;
    border-radius: 6px;
    border-left: 3px solid #42d4f4;
    cursor: default;
    transition: background 0.15s;
  }
  .chunk-card:hover { background: #222240; }
  .chunk-card .chunk-source {
    font-size: 10px;
    color: #666;
    margin-bottom: 4px;
  }
  .chunk-card .chunk-text {
    font-size: 12px;
    color: #bbb;
    line-height: 1.5;
    max-height: 44px;
    overflow: hidden;
    transition: max-height 0.3s ease;
  }
  .chunk-card:hover .chunk-text { max-height: 600px; }
  .chunk-card .expand-hint {
    font-size: 10px;
    color: #555;
    margin-top: 4px;
  }
  .chunk-card:hover .expand-hint { display: none; }
  #legend-header {
    padding: 12px 20px 8px;
    font-size: 12px;
    text-transform: uppercase;
    letter-spacing: 1px;
    color: #888;
    border-bottom: 1px solid #2a2a3a;
  }
  #legend {
    flex: 1;
    overflow-y: auto;
    padding: 8px 20px;
  }
  .legend-item {
    display: flex;
    align-items: flex-start;
    gap: 8px;
    padding: 6px 0;
    cursor: pointer;
    border-radius: 4px;
    transition: background 0.15s;
  }
  .legend-item:hover { background: #1a1a2a; }
  .legend-dot {
    width: 12px;
    height: 12px;
    border-radius: 50%;
    flex-shrink: 0;
    margin-top: 3px;
  }
  .legend-text {
    font-size: 12px;
    line-height: 1.4;
    color: #aaa;
  }
  .legend-text .count { color: #666; }
  .legend-text .title { color: #ccc; font-weight: 500; }
  #controls {
    padding: 12px 20px;
    border-top: 1px solid #2a2a3a;
    display: flex;
    gap: 8px;
    flex-wrap: wrap;
  }
  #controls button {
    padding: 6px 12px;
    background: #1a1a2a;
    border: 1px solid #2a2a3a;
    color: #ccc;
    border-radius: 4px;
    cursor: pointer;
    font-size: 12px;
    transition: all 0.15s;
  }
  #controls button:hover { background: #2a2a3a; color: #fff; }
  #stats {
    padding: 8px 20px;
    font-size: 11px;
    color: #555;
    border-top: 1px solid #2a2a3a;
  }
  #chunk-tooltip {
    display: none;
    position: fixed;
    max-width: 450px;
    padding: 14px 18px;
    background: #1a1a2e;
    border: 1px solid #42d4f4;
    border-radius: 8px;
    font-size: 12px;
    color: #ccc;
    line-height: 1.6;
    z-index: 9999;
    pointer-events: none;
    box-shadow: 0 8px 32px rgba(0,0,0,0.6);
  }
  #chunk-tooltip .tt-source {
    font-size: 10px;
    color: #42d4f4;
    margin-bottom: 6px;
    text-transform: uppercase;
    letter-spacing: 0.5px;
  }
  #chunk-tooltip .tt-text {
    white-space: pre-wrap;
    word-wrap: break-word;
  }
</style>
</head>
<body>
<div id="cy"></div>
<div id="chunk-tooltip"></div>
<div id="sidebar">
  <h2>DKIA Knowledge Graph</h2>
  <div id="node-info">
    <div class="placeholder">Click a node to see details<br>Click a community in the legend to see its summary</div>
  </div>
  <div id="chunk-panel"></div>
  <div id="legend-header">COMMUNITIES (LEGEND_COUNT)</div>
  <div id="legend">LEGEND_HTML</div>
  <div id="controls">
    <button onclick="cy.fit(undefined, 40)">Fit</button>
    <button onclick="cy.layout({name:'cose',animate:true,nodeOverlap:20,idealEdgeLength:100,nodeRepulsion:8000,gravity:0.25,numIter:300}).run()">Re-layout</button>
    <button onclick="resetView()">Reset</button>
  </div>
  <div id="stats">NODES_COUNT nodes &middot; EDGES_COUNT edges &middot; COMM_COUNT communities</div>
</div>

<script>
const elements = ELEMENTS_JSON;
const chunkMap = CHUNK_MAP_JSON;
const commSummaries = COMMUNITY_SUMMARIES_JSON;

const cy = cytoscape({
  container: document.getElementById('cy'),
  elements: elements,
  style: [
    {
      selector: 'node',
      style: {
        'label': 'data(label)',
        'background-color': 'data(color)',
        'width': 'data(size)',
        'height': 'data(size)',
        'shape': 'ellipse',
        'font-size': '9px',
        'text-valign': 'bottom',
        'text-halign': 'center',
        'text-margin-y': 4,
        'color': '#999',
        'text-outline-color': '#0a0a0f',
        'text-outline-width': 2,
        'border-width': 1.5,
        'border-color': '#333',
        'transition-property': 'opacity, border-color, border-width',
        'transition-duration': '0.2s',
      }
    },
    {
      selector: 'edge',
      style: {
        'width': 1.5,
        'line-color': '#333',
        'target-arrow-color': '#444',
        'target-arrow-shape': 'triangle',
        'curve-style': 'bezier',
        'opacity': 0.5,
        'transition-property': 'opacity, line-color, width',
        'transition-duration': '0.2s',
      }
    },
    {
      selector: 'node:selected',
      style: {
        'border-width': 4,
        'border-color': '#f58231',
        'color': '#fff',
        'font-size': '12px',
        'font-weight': 'bold',
        'text-outline-width': 3,
        'z-index': 999,
      }
    },
    {
      selector: '.highlighted',
      style: {
        'opacity': 1,
        'border-width': 3,
        'border-color': '#f58231',
        'color': '#fff',
        'z-index': 999,
      }
    },
    {
      selector: 'edge.highlighted',
      style: {
        'opacity': 1,
        'width': 2.5,
        'line-color': '#f58231',
        'target-arrow-color': '#f58231',
      }
    },
    {
      selector: '.dimmed',
      style: { 'opacity': 0.12 }
    },
    {
      selector: '.chunk-node',
      style: {
        'background-color': '#42d4f4',
        'background-opacity': 0.25,
        'border-color': '#42d4f4',
        'border-width': 2,
        'width': 18,
        'height': 18,
        'shape': 'ellipse',
        'label': 'data(label)',
        'font-size': '8px',
        'color': '#42d4f4',
        'text-valign': 'bottom',
        'text-margin-y': 3,
        'text-outline-color': '#0a0a0f',
        'text-outline-width': 2,
        'z-index': 1000,
        'opacity': 1,
      }
    },
    {
      selector: '.chunk-node:active, .chunk-node:selected',
      style: { 'background-opacity': 0.6, 'border-width': 3 }
    },
    {
      selector: '.chunk-edge',
      style: {
        'width': 1,
        'line-color': '#42d4f4',
        'line-style': 'dashed',
        'opacity': 0.4,
        'target-arrow-shape': 'none',
        'curve-style': 'bezier',
        'z-index': 999,
      }
    },
  ],
  layout: {
    name: 'cose',
    animate: false,
    nodeOverlap: 20,
    idealEdgeLength: 100,
    nodeRepulsion: 8000,
    gravity: 0.25,
    numIter: 400,
  },
  minZoom: 0.15,
  maxZoom: 4,
  wheelSensitivity: 0.3,
});

const tooltip = document.getElementById('chunk-tooltip');
const chunkPanel = document.getElementById('chunk-panel');

function clearChunks() {
  cy.remove(cy.elements('.chunk-node, .chunk-edge'));
  chunkPanel.classList.remove('open');
  chunkPanel.innerHTML = '';
  tooltip.style.display = 'none';
}

function resetView() {
  clearChunks();
  cy.elements().removeClass('dimmed highlighted');
  document.getElementById('node-info').innerHTML =
    '<div class="placeholder">Click a node to see details<br>Click a community in the legend to see its summary</div>';
}

function buildCommSummaryHtml(commId, color) {
  const s = commSummaries[commId];
  if (!s) return '';
  let html = '<div class="comm-summary-box" style="border-left-color:' + (color || '#f58231') + '">';
  html += '<div class="comm-title">Community ' + commId + ': ' + s.title + '</div>';
  html += '<div class="comm-text">' + s.summary + '</div>';
  if (s.key_insights && s.key_insights.length > 0) {
    html += '<ul class="comm-insights">';
    s.key_insights.forEach(function(insight) {
      html += '<li>' + insight.replace(/</g, '&lt;') + '</li>';
    });
    html += '</ul>';
  }
  html += '</div>';
  return html;
}

function showChunks(entityId) {
  clearChunks();
  const chunks = chunkMap[entityId];
  if (!chunks || chunks.length === 0) return;

  const entityNode = cy.getElementById(entityId);
  const pos = entityNode.position();
  const radius = 120 + chunks.length * 8;

  const addedElements = [];
  chunks.forEach(function(chunk, i) {
    const angle = (2 * Math.PI * i) / chunks.length - Math.PI / 2;
    const chunkId = 'chunk-' + entityId + '-' + chunk.index;
    const preview = chunk.source_id.split(':').pop();

    addedElements.push({
      group: 'nodes',
      data: {
        id: chunkId,
        label: '#' + chunk.index + ' ' + preview,
        fullText: chunk.text,
        sourceId: chunk.source_id,
        chunkIndex: chunk.index,
      },
      position: {
        x: pos.x + radius * Math.cos(angle),
        y: pos.y + radius * Math.sin(angle),
      },
      classes: 'chunk-node',
    });
    addedElements.push({
      group: 'edges',
      data: {
        id: 'cedge-' + chunkId,
        source: entityId,
        target: chunkId,
      },
      classes: 'chunk-edge',
    });
  });

  cy.add(addedElements);

  // Sidebar chunk panel
  let html = '<div class="chunk-header">' + chunks.length + ' source chunks</div>';
  chunks.forEach(function(chunk) {
    const fullText = chunk.text.replace(/</g, '&lt;').replace(/>/g, '&gt;');
    html += '<div class="chunk-card" data-chunk-id="chunk-' + entityId + '-' + chunk.index + '">' +
      '<div class="chunk-source">#' + chunk.index + ' &middot; ' + chunk.source_id + '</div>' +
      '<div class="chunk-text">' + fullText + '</div>' +
      '<div class="expand-hint">hover to expand</div>' +
      '</div>';
  });
  chunkPanel.innerHTML = html;
  chunkPanel.classList.add('open');

  chunkPanel.querySelectorAll('.chunk-card').forEach(function(card) {
    card.addEventListener('mouseenter', function() {
      const cnode = cy.getElementById(card.dataset.chunkId);
      if (cnode.length) {
        cnode.style('background-opacity', 0.7);
        cnode.style('border-width', 3);
        cnode.style('width', 24);
        cnode.style('height', 24);
      }
    });
    card.addEventListener('mouseleave', function() {
      const cnode = cy.getElementById(card.dataset.chunkId);
      if (cnode.length) {
        cnode.style('background-opacity', 0.25);
        cnode.style('border-width', 2);
        cnode.style('width', 18);
        cnode.style('height', 18);
      }
    });
  });
}

// Hover chunk nodes: tooltip
cy.on('mouseover', '.chunk-node', function(evt) {
  const node = evt.target;
  const d = node.data();
  const rpos = node.renderedPosition();
  const container = cy.container().getBoundingClientRect();

  tooltip.innerHTML =
    '<div class="tt-source">#' + d.chunkIndex + ' &middot; ' + d.sourceId + '</div>' +
    '<div class="tt-text">' + d.fullText.replace(/</g, '&lt;').replace(/>/g, '&gt;') + '</div>';

  let left = container.left + rpos.x + 20;
  let top = container.top + rpos.y - 20;
  if (left + 460 > window.innerWidth) left = container.left + rpos.x - 470;
  if (top + 300 > window.innerHeight) top = window.innerHeight - 310;
  if (top < 10) top = 10;

  tooltip.style.left = left + 'px';
  tooltip.style.top = top + 'px';
  tooltip.style.display = 'block';
});

cy.on('mouseout', '.chunk-node', function() {
  tooltip.style.display = 'none';
});

// Click entity node: info + community summary + chunks
cy.on('tap', 'node', function(evt) {
  const node = evt.target;
  if (node.hasClass('chunk-node')) return;

  const d = node.data();

  clearChunks();
  cy.elements().addClass('dimmed').removeClass('highlighted');
  const neighborhood = node.neighborhood().add(node);
  neighborhood.removeClass('dimmed').addClass('highlighted');

  const sources = JSON.parse(d.source_refs || '[]');
  const sourceStr = sources.length > 0 ? sources.join(', ') : 'single source';

  let infoHtml =
    '<div class="name">' + d.label + '</div>' +
    '<span class="type-badge" style="background:' + d.color + '33; color:' + d.color + '">' + d.type + '</span>' +
    ' <span class="type-badge" style="background:#2a2a3a; color:#888">C' + d.community + '</span>' +
    '<div class="metric">PageRank: <span>' + d.pagerank.toFixed(4) + '</span></div>' +
    '<div class="metric">Degree: <span>' + d.degree_centrality.toFixed(4) + '</span></div>' +
    '<div class="metric">Betweenness: <span>' + d.betweenness.toFixed(4) + '</span></div>' +
    '<div class="metric">Sources: <span>' + d.num_sources + '</span> (' + sourceStr + ')</div>' +
    (d.description ? '<div class="desc">' + d.description + '</div>' : '');

  if (d.chunk_count > 0) {
    infoHtml += '<div class="chunk-hint">' + d.chunk_count + ' source chunks — expanding on graph</div>';
  }

  // Community summary for this entity's community
  infoHtml += buildCommSummaryHtml(d.community, d.color);

  document.getElementById('node-info').innerHTML = infoHtml;

  if (d.chunk_count > 0) {
    showChunks(d.id);
  }
});

// Click background: reset
cy.on('tap', function(evt) {
  if (evt.target === cy) {
    resetView();
  }
});

// Legend: click to focus community and show summary
document.querySelectorAll('.legend-item').forEach(function(item) {
  item.addEventListener('click', function() {
    clearChunks();
    const commId = parseInt(item.dataset.community);
    const color = item.dataset.color;
    cy.elements().addClass('dimmed').removeClass('highlighted');
    const commNodes = cy.nodes().filter(function(n) { return n.data('community') === commId; });
    const commEdges = commNodes.edgesWith(commNodes);
    commNodes.add(commEdges).removeClass('dimmed').addClass('highlighted');
    cy.fit(commNodes, 60);

    // Show community summary in sidebar
    const s = commSummaries[commId];
    let infoHtml = '';
    if (s) {
      infoHtml += '<div class="name" style="color:' + color + '">Community ' + commId + '</div>';
      infoHtml += '<div class="metric">Members: <span>' + commNodes.length + ' entities</span></div>';
      infoHtml += buildCommSummaryHtml(commId, color);
      const memberNames = [];
      commNodes.forEach(function(n) { memberNames.push(n.data('label')); });
      infoHtml += '<div style="margin-top:10px;font-size:11px;color:#666">Members: ' +
        memberNames.join(', ') + '</div>';
    }
    document.getElementById('node-info').innerHTML = infoHtml;
  });
});
</script>
</body>
</html>"""

# Build legend HTML
legend_html_parts = []
for item in legend_items:
    members_str = ", ".join(item["members"][:3])
    if len(item["members"]) > 3:
        members_str += f" +{item['count'] - 3}"
    legend_html_parts.append(
        f'<div class="legend-item" data-community="{item["id"]}" data-color="{item["color"]}">'
        f'<div class="legend-dot" style="background:{item["color"]}"></div>'
        f'<div class="legend-text"><span class="title">{item["title"]}</span> '
        f'<span class="count">({item["count"]})</span><br>{members_str}</div>'
        f'</div>'
    )

# Substitute placeholders
html_content = html_content.replace("ELEMENTS_JSON", json.dumps(cyto_elements))
html_content = html_content.replace("CHUNK_MAP_JSON", json.dumps(cyto_chunk_map))
html_content = html_content.replace("COMMUNITY_SUMMARIES_JSON", json.dumps(cyto_community_summaries))
html_content = html_content.replace("LEGEND_HTML", "\n".join(legend_html_parts))
html_content = html_content.replace("LEGEND_COUNT", str(len(legend_items)))
html_content = html_content.replace("NODES_COUNT", str(node_count))
html_content = html_content.replace("EDGES_COUNT", str(edge_count))
html_content = html_content.replace("COMM_COUNT", str(num_communities))

# Write and open
GRAPH_HTML_PATH.write_text(html_content)
webbrowser.open(f"file://{GRAPH_HTML_PATH.absolute()}")

print(f"Graph visualization written to: {GRAPH_HTML_PATH.absolute()}")
print(f"Opened in browser. {node_count} nodes, {edge_count} edges, {num_communities} communities.")
print()
print("Interactions:")
print("  - Click entity node: highlight neighborhood + community summary + expand source chunks")
print("  - Click community in legend: focus cluster + show LLM-generated summary")
print("  - Hover chunk node on graph: tooltip with full text")
print("  - Hover chunk card in sidebar: highlights corresponding node on graph")
print("  - Click background: reset view")

Graph visualization written to: /Users/pacho-home-server/daily-knowledge-ingestion-assistant/notebooks/knowledge_graph.html
Opened in browser. 100 nodes, 46 edges, 58 communities.

Interactions:
  - Click entity node: highlight neighborhood + community summary + expand source chunks
  - Click community in legend: focus cluster + show LLM-generated summary
  - Hover chunk node on graph: tooltip with full text
  - Hover chunk card in sidebar: highlights corresponding node on graph
  - Click background: reset view


## Summary

This notebook completed:

1. **Graph Construction** - Built NetworkX graph from multi-source entities and relationships
2. **Graph Metrics** - Computed PageRank, degree centrality, and betweenness
3. **Community Detection** - Applied Louvain algorithm to find cross-domain topic clusters
4. **Community Summaries** - Generated LLM-powered summaries for each cluster
5. **SQLite Storage** - Persisted graph, sources, and provenance data
6. **Interactive Visualization** - Standalone HTML with Cytoscape.js: community summaries, chunk expansion, source provenance

## Next Steps

In the next notebook we will:
1. **Add embeddings** - Embed entities and chunks using nomic-embed-text
2. **Vector search** - Set up sqlite-vec for semantic retrieval
3. **Triple-factor retrieval** - Combine semantic + temporal (content-type-aware) + graph centrality
4. **Cross-domain queries** - Test retrieval across all 7 source domains