# GraphRAG Step 2: Graph Construction & Community Detection

This notebook continues the GraphRAG pipeline:

## Pipeline Steps
1. **Load extraction results** - Multi-source data from notebook 01
2. **Build NetworkX graph** - Entities as nodes, relationships as edges, with source provenance
3. **Compute graph metrics** - PageRank, centrality, degree
4. **Community detection** - Leiden algorithm for topic clustering
5. **Generate community summaries** - LLM-powered hierarchical summaries
6. **Store in SQLite** - Persist graph structure, sources, and summaries
7. **Interactive graph visualization** - Standalone HTML with Cytoscape.js (community summaries, chunk expansion)

## Setup

In [15]:
import json
import sqlite3
import webbrowser
from pathlib import Path
from dataclasses import dataclass, asdict

import httpx
import networkx as nx
import igraph as ig
import leidenalg

OLLAMA_BASE_URL = "http://localhost:11434"
MODEL = "qwen2.5:3b"
DB_PATH = Path("graphrag.db")
GRAPH_HTML_PATH = Path("knowledge_graph.html")

In [16]:
def chat_ollama(prompt: str, system: str = "", temperature: float = 0.0) -> str:
    """Send a chat request to Ollama and return the response."""
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    
    response = httpx.post(
        f"{OLLAMA_BASE_URL}/api/chat",
        json={
            "model": MODEL,
            "messages": messages,
            "stream": False,
            "options": {"temperature": temperature}
        },
        timeout=120.0
    )
    response.raise_for_status()
    return response.json()["message"]["content"]

## Step 1: Load Extraction Results

In [17]:
# Load multi-source results from notebook 01
with open("extraction_results.json", "r") as f:
    data = json.load(f)

# Global merged data
entities = data["merged"]["entities"]
relationships = data["merged"]["relationships"]
claims = data["merged"]["claims"]
entity_source_map = data["merged"]["entity_source_map"]
entity_chunk_map = data["merged"].get("entity_chunk_map", {})

# Semantic entity groups (non-destructive overlay from notebook 01)
semantic_entity_groups = data.get("semantic_entity_groups", [])
entity_to_semantic_group = data.get("entity_to_semantic_group", {})

# Per-source data
sources_data = data["sources"]

# Build flat chunk lookup: global_chunk_index -> {text, source_id}
# Chunks are indexed sequentially across sources in document order (considering all sources)
all_chunks = []
chunk_lookup: dict[int, dict] = {}
for source in sources_data:
    for chunk_text in source["chunks"]:
        idx = len(all_chunks)
        entry = {"index": idx, "source_id": source["source_id"], "text": chunk_text}
        all_chunks.append(entry)
        chunk_lookup[idx] = entry

print(f"Loaded: {len(entities)} entities, {len(relationships)} relationships, {len(claims)} claims")
print(f"From {len(sources_data)} sources, {len(all_chunks)} total chunks")
print(f"Entity chunk provenance: {len(entity_chunk_map)} entities mapped to chunks")
print(f"Semantic entity groups: {len(semantic_entity_groups)} groups, {len(entity_to_semantic_group)} entities grouped")
print()
print("=== PER-SOURCE BREAKDOWN ===")
for source in sources_data:
    print(f"  [{source['source_type']}] {source['source_id']}: "
          f"{len(source['entities'])}E {len(source['relationships'])}R {len(source['claims'])}C "
          f"({source['content_length']} chars)")

Loaded: 466 entities, 198 relationships, 5 claims
From 1 sources, 193 total chunks
Entity chunk provenance: 466 entities mapped to chunks
Semantic entity groups: 45 groups, 320 entities grouped

=== PER-SOURCE BREAKDOWN ===
  [arxiv] arxiv:2404.16130: 466E 198R 5C (89608 chars)


In [18]:
# Preview entities
print(f"=== ENTITIES ({len(entities)} total from {len(sources_data)} sources) ===")
for e in entities[:10]:
    sources = entity_source_map.get(e["name"], [])
    src_str = f" [{len(sources)} sources]" if len(sources) > 1 else ""
    print(f"  [{e['type']}] {e['name']}{src_str}")
if len(entities) > 10:
    print(f"  ... and {len(entities) - 10} more")

=== ENTITIES (466 total from 1 sources) ===
  [ORGANIZATION] MICROSOFT RESEARCH
  [ORGANIZATION] MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES
  [ORGANIZATION] MICROSOFT OFFICE OF THE CTO
  [CONCEPT] RETRIEVAL-AUGMENTED GENERATION
  [ORGANIZATION] LARGE LANGUAGE MODELS
  [ORGANIZATION] LLMS
  [CONCEPT] RETRIEVAL-AUGMENTED GENERATION (RAG)
  [LOCATION] EXTERNAL KNOWLEDGE SOURCE
  [LOCATION] PRIVATE DOCUMENT COLLECTIONS
  [EVENT] GLOBAL QUESTIONS
  ... and 456 more


## Step 2: Build NetworkX Graph

In [19]:
# Create directed graph
G = nx.DiGraph()

# Add entity nodes with attributes including source provenance
for entity in entities:
    source_refs = entity_source_map.get(entity["name"], [])
    G.add_node(
        entity["name"],
        type=entity["type"],
        description=entity["description"],
        source_refs=json.dumps(source_refs),
        num_sources=len(source_refs),
    )

# Add relationship edges with attributes
for rel in relationships:
    # Only add edge if both nodes exist
    if rel["source"] in G.nodes and rel["target"] in G.nodes:
        G.add_edge(
            rel["source"],
            rel["target"],
            description=rel["description"],
            weight=rel["strength"]
        )

print(f"Graph created: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
multi_source_nodes = sum(1 for _, d in G.nodes(data=True) if d.get("num_sources", 1) > 1)
print(f"Multi-source nodes (in 2+ sources): {multi_source_nodes}")

Graph created: 466 nodes, 146 edges
Multi-source nodes (in 2+ sources): 0


In [20]:
# Display graph structure
print("\n=== GRAPH NODES ===")
for node, attrs in list(G.nodes(data=True))[:10]:
    print(f"  {node} ({attrs.get('type', 'N/A')})")

print("\n=== GRAPH EDGES ===")
for source, target, attrs in list(G.edges(data=True))[:10]:
    print(f"  {source} --> {target}")
    print(f"    {attrs.get('description', 'N/A')[:60]}...")


=== GRAPH NODES ===
  MICROSOFT RESEARCH (ORGANIZATION)
  MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES (ORGANIZATION)
  MICROSOFT OFFICE OF THE CTO (ORGANIZATION)
  RETRIEVAL-AUGMENTED GENERATION (CONCEPT)
  LARGE LANGUAGE MODELS (ORGANIZATION)
  LLMS (ORGANIZATION)
  RETRIEVAL-AUGMENTED GENERATION (RAG) (CONCEPT)
  EXTERNAL KNOWLEDGE SOURCE (LOCATION)
  PRIVATE DOCUMENT COLLECTIONS (LOCATION)
  GLOBAL QUESTIONS (EVENT)

=== GRAPH EDGES ===
  MICROSOFT RESEARCH --> MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES
    Microsoft Research is a parent entity of Microsoft Strategic...
  MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES --> MICROSOFT OFFICE OF THE CTO
    Microsoft Strategic Missions and Technologies is a sub-entit...
  RETRIEVAL-AUGMENTED GENERATION --> LARGE LANGUAGE MODELS
    Retrieval-augmented generation (RAG) is used by large langua...
  RETRIEVAL-AUGMENTED GENERATION --> EXTERNAL KNOWLEDGE SOURCE
    Retrieval-augmented generation (RAG) retrieves relevant info...
  RETRIE

## Step 3: Compute Graph Metrics

Calculate centrality scores for ranking node importance.

In [21]:
# Convert to undirected for some algorithms
G_undirected = G.to_undirected()

# PageRank - importance based on incoming connections
pagerank = nx.pagerank(G, weight="weight")

# Degree centrality - number of connections
degree_centrality = nx.degree_centrality(G)

# Betweenness centrality - bridges between clusters
betweenness = nx.betweenness_centrality(G_undirected)

# Store metrics on nodes
for node in G.nodes:
    G.nodes[node]["pagerank"] = pagerank.get(node, 0)
    G.nodes[node]["degree_centrality"] = degree_centrality.get(node, 0)
    G.nodes[node]["betweenness"] = betweenness.get(node, 0)

print("Graph metrics computed: pagerank, degree_centrality, betweenness")

Graph metrics computed: pagerank, degree_centrality, betweenness


In [22]:
# Top entities by PageRank
print("\n=== TOP ENTITIES BY PAGERANK ===")
sorted_by_pagerank = sorted(pagerank.items(), key=lambda x: -x[1])
for node, score in sorted_by_pagerank[:10]:
    node_type = G.nodes[node].get("type", "N/A")
    print(f"  {score:.4f} | [{node_type}] {node}")


=== TOP ENTITIES BY PAGERANK ===
  0.0173 | [PRODUCT] NEBULAGRAPH
  0.0161 | [] 
  0.0125 | [PRODUCT] SS
  0.0109 | [ORGANIZATION] LANGCHAIN
  0.0099 | [ORGANIZATION] LLAMAINDEX
  0.0097 | [CONCEPT] GLOBAL ANSWER
  0.0086 | [PRODUCT] TS
  0.0085 | [PRODUCT] GRAPHRAG
  0.0069 | [PRODUCT] NEO4J
  0.0054 | [ORGANIZATION] CANADIAN CONFERENCE ON ARTIFICIAL INTELLIGENCE


In [23]:
import plotly.graph_objects as go

# Use spring layout for positions (only connected nodes for clarity)
connected = [n for n in G.nodes if G.degree(n) > 0]
H = G.subgraph(connected)
pos = nx.spring_layout(H, k=0.5, seed=42)

# Edge traces
edge_x, edge_y = [], []
mid_x, mid_y, mid_text = [], [], []
for u, v, attrs in H.edges(data=True):
    x0, y0 = pos[u]
    x1, y1 = pos[v]
    edge_x += [x0, x1, None]
    edge_y += [y0, y1, None]
    # Midpoint for hover label
    mid_x.append((x0 + x1) / 2)
    mid_y.append((y0 + y1) / 2)
    desc = attrs.get("description", "")
    mid_text.append(f"{u} → {v}<br>{desc}")

edge_trace = go.Scatter(x=edge_x, y=edge_y, mode='lines',
                        line=dict(width=0.5, color='#888'), hoverinfo='none')

# Invisible midpoint markers for edge hover
edge_mid_trace = go.Scatter(
    x=mid_x, y=mid_y, mode='markers', hovertext=mid_text, hoverinfo='text',
    marker=dict(size=10, color='rgba(0,0,0,0)'),  # invisible
    showlegend=False)

# Node traces
node_x = [pos[n][0] for n in H.nodes]
node_y = [pos[n][1] for n in H.nodes]
node_pr = [pagerank.get(n, 0) for n in H.nodes]
node_text = [f"{n}<br>PR={pagerank.get(n,0):.4f}<br>deg={G.degree(n)}" for n in H.nodes]

node_trace = go.Scatter(
    x=node_x, y=node_y, mode='markers', hovertext=node_text, hoverinfo='text',
    marker=dict(size=[max(6, pr * 3000) for pr in node_pr],
                color=node_pr, colorscale='YlOrRd', showscale=True,
                colorbar=dict(title='PageRank'), line=dict(width=0.5, color='#333')))

fig = go.Figure(data=[edge_trace, edge_mid_trace, node_trace],
                layout=go.Layout(
                    title=f'Knowledge Graph ({H.number_of_nodes()} connected nodes, {H.number_of_edges()} edges)',
                    showlegend=False, hovermode='closest',
                    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                    template='plotly_dark', width=900, height=700))

# Write to HTML and open in browser (same pattern as Cytoscape viz)
fig.write_html("graph_plotly.html")
webbrowser.open(f"file://{Path('graph_plotly.html').absolute()}")
print(f"Plotly graph opened in browser ({H.number_of_nodes()} nodes, {H.number_of_edges()} edges)")

Plotly graph opened in browser (159 nodes, 146 edges)


## Step 4: Community Detection

Using Leiden algorithm to find clusters of related entities. Leiden guarantees well-connected communities, fixing a known flaw in Louvain where communities can become internally disconnected.

In [24]:
# Leiden community detection (works on undirected graphs)
# Convert NetworkX -> igraph (Leiden requires igraph)
G_ig = ig.Graph.from_networkx(G_undirected)

# Map edge weights (from_networkx stores them as edge attributes)
weights = G_ig.es["weight"] if "weight" in G_ig.es.attributes() else None

# Run Leiden with modularity optimization
leiden_partition = leidenalg.find_partition(
    G_ig,
    leidenalg.ModularityVertexPartition,
    weights=weights,
)

# Build partition dict: node_name -> community_id
# igraph preserves node order from NetworkX via the _nx_name attribute
partition = {}
for comm_id, members in enumerate(leiden_partition):
    for vertex_idx in members:
        node_name = G_ig.vs[vertex_idx]["_nx_name"]
        partition[node_name] = comm_id

# Store community assignment on nodes
for node, community_id in partition.items():
    G.nodes[node]["community"] = community_id

# Count communities
num_communities = max(partition.values()) + 1 if partition else 0
print(f"Detected {num_communities} communities (Leiden)")

# Modularity score (quality of partition)
modularity = leiden_partition.modularity
print(f"Modularity score: {modularity:.4f}")

Detected 347 communities (Leiden)
Modularity score: 0.9172


In [25]:
# Group entities by community
communities: dict[int, list[str]] = {}
for node, community_id in partition.items():
    if community_id not in communities:
        communities[community_id] = []
    communities[community_id].append(node)

print("\n=== COMMUNITIES ===")
for comm_id, members in sorted(communities.items()):
    # Sort members by PageRank within community
    sorted_members = sorted(members, key=lambda x: -pagerank.get(x, 0))
    print(f"\nCommunity {comm_id} ({len(members)} members):")
    for member in sorted_members[:5]:
        node_type = G.nodes[member].get("type", "N/A")
        print(f"  [{node_type}] {member}")
    if len(members) > 5:
        print(f"  ... and {len(members) - 5} more")


=== COMMUNITIES ===

Community 0 (13 members):
  [CONCEPT] GLOBAL ANSWER
  [EVENT] QUERY-FOCUSED SUMMARIZATION
  [CONCEPT] ENTITY EXTRACTOR
  [PRODUCT] COMMUNITY ANSWERS
  [CONCEPT] KNOWLEDGE GRAPH
  ... and 8 more

Community 1 (13 members):
  [PRODUCT] NEBULAGRAPH
  [ORGANIZATION] LANGCHAIN
  [ORGANIZATION] LLAMAINDEX
  [PRODUCT] GRAPHRAG
  [PRODUCT] NEO4J
  ... and 8 more

Community 2 (11 members):
  [] 
  [PERSON] AMBER HOAK
  [PERSON] ANDR´ES MORALES ESQUIVEL
  [PERSON] BEN CUTLER
  [PERSON] BILLIE RINALDI
  ... and 6 more

Community 3 (10 members):
  [ORGANIZATION] NAACL-HLT
  [ORGANIZATION] OPENAI
  [LOCATION] KAU
  [PERSON] MELNYK ET AL.
  [PERSON] TRAJANOSSKA ET AL.
  ... and 5 more

Community 4 (10 members):
  [PRODUCT] SS
  [PRODUCT] TS
  [PRODUCT] VECTOR RAG (SS)
  [CONCEPT] COMPREHENSIVENESS
  [EVENT] C3
  ... and 5 more

Community 5 (8 members):
  [LOCATION] APPENDIX E
  [CONCEPT] CHUNG ET AL.
  [EVENT] EXPERIMENT 1
  [LOCATION] APPENDIX F
  [LOCATION] APPENDIX G
  ... an

## Step 5: Generate Community Summaries

Create LLM-powered summaries for each community following GraphRAG's report format.

In [26]:
@dataclass
class CommunitySummary:
    community_id: int
    title: str
    summary: str
    key_entities: list[str]
    key_insights: list[str]

COMMUNITY_SUMMARY_PROMPT = """
You are an expert analyst creating a summary report for a knowledge graph community.

Given the following entities and their relationships, create a structured summary.

ENTITIES IN THIS COMMUNITY:
{entities_info}

RELATIONSHIPS:
{relationships_info}

RELEVANT CLAIMS:
{claims_info}

Create a JSON response with:
1. title: A short descriptive title for this community (5-10 words)
2. summary: A 2-3 sentence executive summary of what this community represents
3. key_insights: 3-5 bullet points of key facts or relationships

Return ONLY valid JSON:
{{
  "title": "...",
  "summary": "...",
  "key_insights": ["...", "...", "..."]
}}

JSON OUTPUT:
"""

def generate_community_summary(community_id: int, members: list[str], G: nx.DiGraph, claims: list[dict]) -> CommunitySummary:
    """Generate a summary for a community using the LLM."""
    
    # Gather entity info
    entities_info = []
    for member in members:
        node_data = G.nodes[member]
        entities_info.append(f"- {member} ({node_data.get('type', 'N/A')}): {node_data.get('description', 'N/A')}")
    
    # Gather relationships within community
    relationships_info = []
    for source, target, data in G.edges(data=True):
        if source in members and target in members:
            relationships_info.append(f"- {source} -> {target}: {data.get('description', 'N/A')}")
    
    # Gather relevant claims
    claims_info = []
    for claim in claims:
        if claim["subject"] in members:
            claims_info.append(f"- [{claim['claim_type']}] {claim['subject']}: {claim['description']}")
    
    prompt = COMMUNITY_SUMMARY_PROMPT.format(
        entities_info="\n".join(entities_info) or "No entities",
        relationships_info="\n".join(relationships_info) or "No relationships",
        claims_info="\n".join(claims_info[:10]) or "No claims"  # Limit claims
    )
    
    response = chat_ollama(prompt)
    
    # Parse JSON
    json_str = response.strip()
    if json_str.startswith("```"):
        json_str = json_str.split("```")[1]
        if json_str.startswith("json"):
            json_str = json_str[4:]
    json_str = json_str.strip()
    
    try:
        data = json.loads(json_str)
        return CommunitySummary(
            community_id=community_id,
            title=data.get("title", f"Community {community_id}"),
            summary=data.get("summary", ""),
            key_entities=members[:5],  # Top 5 by PageRank
            key_insights=data.get("key_insights", [])
        )
    except json.JSONDecodeError as ex:
        print(f"Failed to parse JSON for community {community_id}: {ex}")
        print(f"Raw response: {response}")
        return CommunitySummary(
            community_id=community_id,
            title=f"Community {community_id}",
            summary="Summary generation failed",
            key_entities=members[:5],
            key_insights=[]
        )

In [27]:
# Generate summaries for each community
community_summaries: list[CommunitySummary] = []

for comm_id, members in sorted(communities.items()):
    print(f"Generating summary for Community {comm_id} ({len(members)} members)...")
    # Sort members by PageRank
    sorted_members = sorted(members, key=lambda x: -pagerank.get(x, 0))
    summary = generate_community_summary(comm_id, sorted_members, G, claims)
    community_summaries.append(summary)
    print(f"  Title: {summary.title}")

print(f"\nGenerated {len(community_summaries)} community summaries")

Generating summary for Community 0 (13 members)...
  Title: Community Knowledge Hub
Generating summary for Community 1 (13 members)...
  Title: GraphRAG and Related Technologies Community
Generating summary for Community 2 (11 members)...
  Title: Contributor Community Network
Generating summary for Community 3 (10 members)...
  Title: Knowledge Graph Extraction Community
Generating summary for Community 4 (10 members)...
  Title: Text Summarization and Stock Market Analysis Community
Generating summary for Community 5 (8 members)...
  Title: Graph Analysis Community Report
Generating summary for Community 6 (6 members)...
  Title: Knowledge Graph Community Analysis
Generating summary for Community 7 (6 members)...
  Title: Knowledge Graph Community Overview
Generating summary for Community 8 (4 members)...
  Title: Example Corp Community Overview
Generating summary for Community 9 (4 members)...
  Title: Knowledge Graph Collaboration Network
Generating summary for Community 10 (4 memb

In [28]:
# Display community summaries
print("\n" + "="*60)
print("COMMUNITY SUMMARIES")
print("="*60)

for summary in community_summaries:
    print(f"\n### Community {summary.community_id}: {summary.title}")
    print(f"\n{summary.summary}")
    print(f"\nKey Entities: {', '.join(summary.key_entities)}")
    print(f"\nKey Insights:")
    for insight in summary.key_insights:
        print(f"  - {insight}")


COMMUNITY SUMMARIES

### Community 0: Community Knowledge Hub

The Community Knowledge Hub is a structured ecosystem where user queries are processed through various stages of summarization and entity extraction to generate comprehensive community summaries, domain-tailored summaries, and ultimately, global answers for users.

Key Entities: GLOBAL ANSWER, QUERY-FOCUSED SUMMARIZATION, ENTITY EXTRACTOR, COMMUNITY ANSWERS, KNOWLEDGE GRAPH

Key Insights:
  - User queries drive the entire process from start to finish in this knowledge hub.
  - Entity extraction is a critical step that informs multiple downstream processes including community detection and summarization.
  - The Knowledge Graph integrates all processed information to provide a comprehensive understanding of the domain.
  - Community answers are generated by combining summaries, ensuring each user receives relevant and tailored responses.
  - Query-focused summarization ensures that only pertinent information reaches the fin

## Step 6: Store in SQLite

Persist the graph structure, metrics, sources, and summaries.

In [42]:
# Create SQLite database (delete and recreate for clean schema on re-runs)
# We delete the file instead of DROP TABLE because notebook 03 creates
# sqlite-vec virtual tables (vec0 module) that can't be dropped without
# loading the extension.
if DB_PATH.exists():
    DB_PATH.unlink()
    print(f"Deleted existing {DB_PATH}")

conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()

# Create tables with current schema
cursor.executescript("""
-- Sources table: tracks ingested documents
CREATE TABLE sources (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_id TEXT UNIQUE NOT NULL,
    source_type TEXT,
    title TEXT,
    url TEXT,
    content_type TEXT,
    content_length INTEGER,
    fetched_at TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Entities table
CREATE TABLE entities (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT UNIQUE NOT NULL,
    type TEXT,
    description TEXT,
    pagerank REAL DEFAULT 0,
    degree_centrality REAL DEFAULT 0,
    betweenness REAL DEFAULT 0,
    community_id INTEGER,
    source_refs TEXT,           -- JSON array of source_ids
    num_sources INTEGER DEFAULT 1,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Relationships table
CREATE TABLE relationships (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_id INTEGER REFERENCES entities(id),
    target_id INTEGER REFERENCES entities(id),
    description TEXT,
    weight REAL DEFAULT 1.0,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Claims table
CREATE TABLE claims (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    subject_id INTEGER REFERENCES entities(id),
    claim_type TEXT,
    description TEXT,
    claim_date TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Community summaries table
CREATE TABLE community_summaries (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    community_id INTEGER UNIQUE NOT NULL,
    title TEXT,
    summary TEXT,
    key_entities TEXT,  -- JSON array
    key_insights TEXT,  -- JSON array
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Chunks table (source text)
CREATE TABLE chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    content TEXT,
    chunk_index INTEGER,
    source_ref TEXT,            -- references sources.source_id
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create indexes
CREATE INDEX idx_entities_name ON entities(name);
CREATE INDEX idx_entities_community ON entities(community_id);
CREATE INDEX idx_relationships_source ON relationships(source_id);
CREATE INDEX idx_relationships_target ON relationships(target_id);
CREATE INDEX idx_chunks_source ON chunks(source_ref);
CREATE INDEX idx_sources_source_id ON sources(source_id);
""")

conn.commit()
print("Database tables created (fresh schema with sources table and source provenance columns)")

Deleted existing graphrag.db
Database tables created (fresh schema with sources table and source provenance columns)


In [43]:
# Insert entities (with source provenance)
entity_id_map: dict[str, int] = {}

for node, attrs in G.nodes(data=True):
    cursor.execute("""
        INSERT INTO entities (name, type, description, pagerank, degree_centrality, betweenness, community_id, source_refs, num_sources)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        node,
        attrs.get("type"),
        attrs.get("description"),
        attrs.get("pagerank", 0),
        attrs.get("degree_centrality", 0),
        attrs.get("betweenness", 0),
        attrs.get("community"),
        attrs.get("source_refs", "[]"),
        attrs.get("num_sources", 1),
    ))
    entity_id_map[node] = cursor.lastrowid

conn.commit()
print(f"Inserted {len(entity_id_map)} entities")

Inserted 466 entities


In [44]:
# Insert relationships
rel_count = 0
for source, target, attrs in G.edges(data=True):
    source_id = entity_id_map.get(source)
    target_id = entity_id_map.get(target)
    if source_id and target_id:
        cursor.execute("""
            INSERT INTO relationships (source_id, target_id, description, weight)
            VALUES (?, ?, ?, ?)
        """, (
            source_id,
            target_id,
            attrs.get("description"),
            attrs.get("weight", 1.0)
        ))
        rel_count += 1

conn.commit()
print(f"Inserted {rel_count} relationships")

Inserted 146 relationships


In [45]:
# Insert claims
claim_count = 0
for claim in claims:
    subject_id = entity_id_map.get(claim["subject"])
    if subject_id:
        cursor.execute("""
            INSERT INTO claims (subject_id, claim_type, description, claim_date)
            VALUES (?, ?, ?, ?)
        """, (
            subject_id,
            claim.get("claim_type"),
            claim.get("description"),
            claim.get("date")
        ))
        claim_count += 1

conn.commit()
print(f"Inserted {claim_count} claims")

Inserted 5 claims


In [46]:
# Insert community summaries
for summary in community_summaries:
    cursor.execute("""
        INSERT INTO community_summaries (community_id, title, summary, key_entities, key_insights)
        VALUES (?, ?, ?, ?, ?)
    """, (
        summary.community_id,
        summary.title,
        summary.summary,
        json.dumps(summary.key_entities),
        json.dumps(summary.key_insights)
    ))

conn.commit()
print(f"Inserted {len(community_summaries)} community summaries")

Inserted 347 community summaries


In [47]:
# Insert chunks with source provenance
chunk_count = 0
for source in sources_data:
    source_id = source["source_id"]
    for i, chunk in enumerate(source["chunks"]):
        cursor.execute("""
            INSERT INTO chunks (content, chunk_index, source_ref)
            VALUES (?, ?, ?)
        """, (chunk, chunk_count, source_id))
        chunk_count += 1

conn.commit()
print(f"Inserted {chunk_count} chunks (with source_ref)")

Inserted 193 chunks (with source_ref)


In [48]:
# Insert source records
for source in sources_data:
    cursor.execute("""
        INSERT INTO sources (source_id, source_type, title, url, content_type, content_length, fetched_at)
        VALUES (?, ?, ?, ?, ?, ?, ?)
    """, (
        source["source_id"],
        source["source_type"],
        source["title"],
        source["url"],
        source["content_type"],
        source["content_length"],
        source.get("fetched_at", ""),
    ))

conn.commit()
print(f"Inserted {len(sources_data)} source records")

Inserted 1 source records


In [49]:
# Verify data
print("\n=== DATABASE SUMMARY ===")
for table in ["sources", "entities", "relationships", "claims", "community_summaries", "chunks"]:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"  {table}: {count} rows")

# Show multi-source entities
cursor.execute("SELECT name, source_refs, num_sources FROM entities WHERE num_sources > 1 ORDER BY num_sources DESC")
multi = cursor.fetchall()
if multi:
    print(f"\n=== MULTI-SOURCE ENTITIES ({len(multi)}) ===")
    for name, refs, n in multi:
        print(f"  {name}: {n} sources — {refs}")


=== DATABASE SUMMARY ===
  sources: 1 rows
  entities: 466 rows
  relationships: 146 rows
  claims: 5 rows
  community_summaries: 347 rows
  chunks: 193 rows


In [50]:
# Sample query: Top entities by PageRank
print("\n=== TOP ENTITIES (from DB) ===")
cursor.execute("""
    SELECT name, type, pagerank, community_id 
    FROM entities 
    ORDER BY pagerank DESC 
    LIMIT 10
""")
for row in cursor.fetchall():
    print(f"  {row[2]:.4f} | [{row[1]}] {row[0]} (Community {row[3]})")


=== TOP ENTITIES (from DB) ===
  0.0173 | [PRODUCT] NEBULAGRAPH (Community 1)
  0.0161 | []  (Community 2)
  0.0125 | [PRODUCT] SS (Community 4)
  0.0109 | [ORGANIZATION] LANGCHAIN (Community 1)
  0.0099 | [ORGANIZATION] LLAMAINDEX (Community 1)
  0.0097 | [CONCEPT] GLOBAL ANSWER (Community 0)
  0.0086 | [PRODUCT] TS (Community 4)
  0.0085 | [PRODUCT] GRAPHRAG (Community 1)
  0.0069 | [PRODUCT] NEO4J (Community 1)
  0.0054 | [ORGANIZATION] CANADIAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (Community 9)


In [51]:
# Sample query: Get community with its entities
print("\n=== COMMUNITY 0 DETAILS (from DB) ===")
cursor.execute("""
    SELECT title, summary FROM community_summaries WHERE community_id = 0
""")
row = cursor.fetchone()
if row:
    print(f"Title: {row[0]}")
    print(f"Summary: {row[1]}")
    
    cursor.execute("""
        SELECT name, type FROM entities WHERE community_id = 0 ORDER BY pagerank DESC LIMIT 5
    """)
    print("\nTop Members:")
    for row in cursor.fetchall():
        print(f"  [{row[1]}] {row[0]}")


=== COMMUNITY 0 DETAILS (from DB) ===
Title: Community Knowledge Hub
Summary: The Community Knowledge Hub is a structured ecosystem where user queries are processed through various stages of summarization and entity extraction to generate comprehensive community summaries, domain-tailored summaries, and ultimately, global answers for users.

Top Members:
  [CONCEPT] GLOBAL ANSWER
  [EVENT] QUERY-FOCUSED SUMMARIZATION
  [CONCEPT] ENTITY EXTRACTOR
  [PRODUCT] COMMUNITY ANSWERS
  [CONCEPT] KNOWLEDGE GRAPH


In [52]:
# Close connection
conn.close()
print(f"\nDatabase saved to: {DB_PATH.absolute()}")


Database saved to: /Users/pacho-home-server/daily-knowledge-ingestion-assistant/notebooks/graphrag.db


## Step 7: Interactive Graph Visualization

Generate a standalone HTML file with Cytoscape.js and open it in the browser.
Nodes are colored by community, sized by PageRank.
Click a node to see its details, community summary, and expand source chunks.
Click a community in the legend to see its LLM-generated summary.

In [61]:
# Prepare graph data for Cytoscape.js export — multi-level drill-down

COMMUNITY_COLORS = [
    "#e6194b", "#3cb44b", "#4363d8", "#f58231", "#911eb4",
    "#42d4f4", "#f032e6", "#bfef45", "#fabed4", "#469990",
    "#dcbeff", "#9A6324", "#fffac8", "#800000", "#aaffc3",
]

MAX_COMPOUND_SIZE = 15
SEMANTIC_GROUP_COLOR = "#bfef45"

def scale_pagerank_to_size(pr: float, min_size: int = 25, max_size: int = 90) -> int:
    """Map PageRank value to node pixel size."""
    all_pr = [G.nodes[n].get("pagerank", 0) for n in G.nodes]
    pr_min, pr_max = min(all_pr), max(all_pr)
    if pr_max == pr_min:
        return (min_size + max_size) // 2
    normalized = (pr - pr_min) / (pr_max - pr_min)
    return int(min_size + normalized * (max_size - min_size))

# --- Filter visualization to connected nodes only ---
node_degree: dict[str, int] = {}
for source, target, _ in G.edges(data=True):
    node_degree[source] = node_degree.get(source, 0) + 1
    node_degree[target] = node_degree.get(target, 0) + 1

viz_nodes = {n for n in G.nodes if node_degree.get(n, 0) > 0 and n.strip()}
isolated_count = G.number_of_nodes() - len(viz_nodes)
print(f"Visualization filter: {len(viz_nodes)} connected nodes "
      f"(filtered out {isolated_count} isolated nodes, kept in DB)")

entity_node_ids = {n for n in viz_nodes}

# Build community summary data for viz
MIN_COMMUNITY_SIZE_FOR_VIZ = 2

viz_community_counts: dict[int, int] = {}
for node in viz_nodes:
    comm_id = G.nodes[node].get("community", -1)
    viz_community_counts[comm_id] = viz_community_counts.get(comm_id, 0) + 1

cyto_community_summaries = {}
other_community_count = 0
other_node_count = 0

for summary in community_summaries:
    member_count = viz_community_counts.get(summary.community_id, 0)
    if member_count >= MIN_COMMUNITY_SIZE_FOR_VIZ:
        cyto_community_summaries[summary.community_id] = {
            "title": summary.title,
            "summary": summary.summary,
            "key_entities": summary.key_entities,
            "key_insights": summary.key_insights,
        }
    elif member_count > 0:
        other_community_count += 1
        other_node_count += member_count

# Build chunk data (deduplicated)
chunk_texts: list[str] = []
chunk_text_to_idx: dict[str, int] = {}
cyto_chunk_refs: dict[str, list] = {}

for entity_name in entity_node_ids:
    refs = entity_chunk_map.get(entity_name, [])
    if not refs:
        continue
    refs_for_entity = []
    for ref in refs:
        chunk_idx = ref["chunk_index"]
        chunk_data = chunk_lookup.get(chunk_idx)
        if not chunk_data:
            continue
        text = chunk_data["text"]
        if text not in chunk_text_to_idx:
            chunk_text_to_idx[text] = len(chunk_texts)
            chunk_texts.append(text)
        refs_for_entity.append({
            "index": chunk_idx,
            "source_id": ref["source_id"],
            "text_idx": chunk_text_to_idx[text],
        })
    if refs_for_entity:
        cyto_chunk_refs[entity_name] = refs_for_entity

# Semantic groups lookup
cyto_semantic_groups = {}
for group in semantic_entity_groups:
    gid = group["group_id"]
    members = group["members"]
    if len(members) > MAX_COMPOUND_SIZE:
        continue
    valid_members = [m for m in members if m in entity_node_ids]
    if len(valid_members) < 2:
        continue
    cyto_semantic_groups[gid] = {
        "canonical": group["canonical"],
        "members": valid_members,
        "member_similarities": group.get("member_similarities", {}),
    }

# ============================================================
# LEVEL 0: Community meta-nodes + inter-community edges
# ============================================================

def scale_community_size(member_count, min_size=40, max_size=120):
    """Map community member count to meta-node pixel size."""
    counts = [viz_community_counts[c] for c in viz_community_counts
              if viz_community_counts[c] >= MIN_COMMUNITY_SIZE_FOR_VIZ]
    if not counts:
        return (min_size + max_size) // 2
    c_min, c_max = min(counts), max(counts)
    if c_max == c_min:
        return (min_size + max_size) // 2
    normalized = (member_count - c_min) / (c_max - c_min)
    return int(min_size + normalized * (max_size - min_size))

community_meta_elements = []

for comm_id, summary_data in cyto_community_summaries.items():
    member_count = viz_community_counts[comm_id]
    color = COMMUNITY_COLORS[comm_id % len(COMMUNITY_COLORS)]
    top_members = sorted(
        [n for n in viz_nodes if G.nodes[n].get("community") == comm_id],
        key=lambda x: -pagerank.get(x, 0)
    )[:5]
    pr_sum = sum(pagerank.get(m, 0) for m in top_members)

    community_meta_elements.append({
        "data": {
            "id": f"comm-{comm_id}",
            "label": summary_data["title"][:35],
            "type": "COMMUNITY",
            "community": comm_id,
            "member_count": member_count,
            "top_members": [m[:25] for m in top_members],
            "color": color,
            "size": scale_community_size(member_count),
            "pagerank_sum": round(pr_sum, 4),
        }
    })

# "Other" meta-node for singleton communities
if other_node_count > 0:
    community_meta_elements.append({
        "data": {
            "id": "comm-other",
            "label": f"Other ({other_community_count} small)",
            "type": "COMMUNITY",
            "community": -1,
            "member_count": other_node_count,
            "top_members": [],
            "color": "#555555",
            "size": 40,
            "pagerank_sum": 0,
        }
    })

# Inter-community edges (aggregated from entity-level edges)
inter_comm_edges = {}
for src, tgt, attrs in G.edges(data=True):
    if src not in viz_nodes or tgt not in viz_nodes:
        continue
    src_comm = G.nodes[src].get("community", -1)
    tgt_comm = G.nodes[tgt].get("community", -1)
    if src_comm == tgt_comm:
        continue
    src_in = src_comm in cyto_community_summaries
    tgt_in = tgt_comm in cyto_community_summaries
    if not src_in and not tgt_in:
        continue
    src_id = f"comm-{src_comm}" if src_in else "comm-other"
    tgt_id = f"comm-{tgt_comm}" if tgt_in else "comm-other"
    key = (src_id, tgt_id)
    if key not in inter_comm_edges:
        inter_comm_edges[key] = {"count": 0, "descriptions": []}
    inter_comm_edges[key]["count"] += 1
    desc = attrs.get("description", "")
    if desc and len(inter_comm_edges[key]["descriptions"]) < 5:
        inter_comm_edges[key]["descriptions"].append(
            f"{src} \u2192 {tgt}: {desc[:80]}"
        )

for (src_id, tgt_id), data in inter_comm_edges.items():
    community_meta_elements.append({
        "data": {
            "id": f"{src_id}-->{tgt_id}",
            "source": src_id,
            "target": tgt_id,
            "weight": data["count"],
            "description": f"{data['count']} cross-community relationships",
            "details": data["descriptions"],
        }
    })

# ============================================================
# LEVEL 1: Per-community entity data (loaded on expand)
# ============================================================

community_entity_data = {}

for comm_id in cyto_community_summaries:
    comm_members = [n for n in viz_nodes if G.nodes[n].get("community") == comm_id]
    member_set = set(comm_members)

    entities = []
    for node in comm_members:
        attrs = G.nodes[node]
        pr = attrs.get("pagerank", 0)
        chunk_refs = entity_chunk_map.get(node, [])
        entities.append({
            "data": {
                "id": node,
                "label": node,
                "parent": f"comm-{comm_id}",
                "type": attrs.get("type", "UNKNOWN"),
                "description": attrs.get("description", ""),
                "community": comm_id,
                "pagerank": round(pr, 6),
                "degree_centrality": round(attrs.get("degree_centrality", 0), 4),
                "betweenness": round(attrs.get("betweenness", 0), 4),
                "num_sources": attrs.get("num_sources", 1),
                "source_refs": attrs.get("source_refs", "[]"),
                "color": COMMUNITY_COLORS[comm_id % len(COMMUNITY_COLORS)],
                "size": scale_pagerank_to_size(pr),
                "chunk_count": len(chunk_refs),
            }
        })

    # Intra-community edges only
    edges = []
    for src, tgt, attrs in G.edges(data=True):
        if src in member_set and tgt in member_set:
            edges.append({
                "data": {
                    "id": f"{src}-->{tgt}",
                    "source": src,
                    "target": tgt,
                    "description": attrs.get("description", ""),
                    "weight": attrs.get("weight", 1.0),
                }
            })

    # Semantic groups within this community
    sg_elements = []
    for group in semantic_entity_groups:
        gid = group["group_id"]
        valid_members = [m for m in group["members"]
                         if m in member_set and m in entity_node_ids]
        if len(valid_members) < 2 or len(group["members"]) > MAX_COMPOUND_SIZE:
            continue
        parent_id = f"sg-{gid}"
        sg_elements.append({
            "data": {
                "id": parent_id,
                "label": group["canonical"],
                "parent": f"comm-{comm_id}",
                "type": "SEMANTIC_GROUP",
                "group_id": gid,
                "canonical": group["canonical"],
                "member_count": len(valid_members),
                "color": SEMANTIC_GROUP_COLOR,
            }
        })
        for ent in entities:
            if ent["data"]["id"] in valid_members:
                ent["data"]["parent"] = parent_id

    community_entity_data[comm_id] = {
        "entities": entities,
        "edges": edges,
        "semantic_groups": sg_elements,
    }

# Print summary
meta_nodes = len([e for e in community_meta_elements if "source" not in e["data"]])
meta_edges = len([e for e in community_meta_elements if "source" in e["data"]])
total_entities = sum(len(d["entities"]) for d in community_entity_data.values())
total_intra_edges = sum(len(d["edges"]) for d in community_entity_data.values())
total_sg = sum(len(d["semantic_groups"]) for d in community_entity_data.values())
entities_with_chunks = sum(1 for v in cyto_chunk_refs.values() if v)
total_chunk_refs = sum(len(v) for v in cyto_chunk_refs.values())

print(f"\nLevel 0: {meta_nodes} community meta-nodes, {meta_edges} inter-community edges")
print(f"Level 1: {total_entities} entities across {len(community_entity_data)} communities, {total_intra_edges} intra-community edges")
print(f"Level 1: {total_sg} semantic groups nested in communities")
print(f"Level 2: {entities_with_chunks} entities with {total_chunk_refs} chunk refs ({len(chunk_texts)} unique texts)")
print(f"Community summaries: {len(cyto_community_summaries)} included, {other_community_count} collapsed to 'Other' ({other_node_count} nodes)")

Visualization filter: 158 connected nodes (filtered out 308 isolated nodes, kept in DB)

Level 0: 40 community meta-nodes, 2 inter-community edges
Level 1: 158 entities across 40 communities, 134 intra-community edges
Level 1: 17 semantic groups nested in communities
Level 2: 158 entities with 268 chunk refs (96 unique texts)
Community summaries: 40 included, 0 collapsed to 'Other' (0 nodes)


In [62]:
# Generate standalone HTML with multi-level Cytoscape.js drill-down

# Build community legend
legend_items = []
for comm_id in sorted(cyto_community_summaries.keys()):
    color = COMMUNITY_COLORS[comm_id % len(COMMUNITY_COLORS)]
    members = [n for n in viz_nodes if G.nodes[n].get("community") == comm_id]
    sorted_members = sorted(members, key=lambda x: -pagerank.get(x, 0))
    summary = cyto_community_summaries.get(comm_id, {})
    legend_items.append({
        "id": comm_id,
        "color": color,
        "count": len(members),
        "members": sorted_members[:5],
        "title": summary.get("title", f"Community {comm_id}"),
    })

if other_community_count > 0:
    legend_items.append({
        "id": -1,
        "color": "#555555",
        "count": other_node_count,
        "members": [],
        "title": f"Other ({other_community_count} small communities)",
    })

legend_html_parts = []
for item in legend_items:
    members_str = ", ".join(item["members"][:3])
    if len(item["members"]) > 3:
        members_str += f" +{item['count'] - 3}"
    legend_html_parts.append(
        f'<div class="legend-item" data-community="{item["id"]}" data-color="{item["color"]}">'
        f'<div class="legend-dot" style="background:{item["color"]}"></div>'
        f'<div class="legend-text"><span class="title">{item["title"]}</span> '
        f'<span class="count">({item["count"]})</span><br>{members_str}</div>'
        f'</div>'
    )

html_content = """<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>DKIA Knowledge Graph — Multi-Level</title>
<script src="https://cdnjs.cloudflare.com/ajax/libs/cytoscape/3.30.4/cytoscape.min.js"></script>
<style>
  * { margin: 0; padding: 0; box-sizing: border-box; }
  body {
    font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
    background: #0a0a0f;
    color: #e0e0e0;
    display: flex;
    height: 100vh;
    overflow: hidden;
  }
  #cy {
    flex: 1;
    background: #0a0a0f;
    position: relative;
  }
  #sidebar {
    width: 360px;
    background: #12121a;
    border-left: 1px solid #2a2a3a;
    display: flex;
    flex-direction: column;
    overflow: hidden;
  }
  #sidebar h2 {
    padding: 16px 20px 12px;
    font-size: 14px;
    text-transform: uppercase;
    letter-spacing: 1.5px;
    color: #f58231;
    border-bottom: 1px solid #2a2a3a;
  }
  #node-info {
    padding: 16px 20px;
    border-bottom: 1px solid #2a2a3a;
    min-height: 120px;
    max-height: 380px;
    overflow-y: auto;
    font-size: 13px;
    line-height: 1.6;
  }
  #node-info .name {
    font-size: 16px;
    font-weight: 600;
    color: #fff;
    margin-bottom: 8px;
  }
  #node-info .type-badge {
    display: inline-block;
    padding: 2px 8px;
    border-radius: 4px;
    font-size: 11px;
    font-weight: 600;
    margin-bottom: 10px;
  }
  #node-info .metric { color: #888; }
  #node-info .metric span { color: #e0e0e0; font-weight: 500; }
  #node-info .desc {
    margin-top: 10px;
    color: #aaa;
    font-style: italic;
    font-size: 12px;
  }
  #node-info .placeholder { color: #555; font-style: italic; }
  #node-info .chunk-hint {
    margin-top: 8px;
    padding: 6px 10px;
    background: #1a1a2e;
    border-radius: 4px;
    font-size: 11px;
    color: #42d4f4;
  }
  .comm-summary-box {
    margin-top: 12px;
    padding: 10px 12px;
    background: #1a1a2e;
    border-radius: 6px;
    border-left: 3px solid #f58231;
  }
  .comm-summary-box .comm-title {
    font-size: 13px;
    font-weight: 600;
    color: #f58231;
    margin-bottom: 6px;
  }
  .comm-summary-box .comm-text {
    font-size: 12px;
    color: #aaa;
    line-height: 1.5;
    margin-bottom: 8px;
  }
  .comm-summary-box .comm-insights {
    list-style: none;
    padding: 0;
  }
  .comm-summary-box .comm-insights li {
    font-size: 11px;
    color: #888;
    padding: 2px 0 2px 12px;
    position: relative;
    line-height: 1.4;
  }
  .comm-summary-box .comm-insights li::before {
    content: '';
    position: absolute;
    left: 0;
    top: 8px;
    width: 5px;
    height: 5px;
    border-radius: 50%;
    background: #f58231;
  }
  #chunk-panel {
    border-bottom: 1px solid #2a2a3a;
    max-height: 0;
    overflow: hidden;
    transition: max-height 0.3s ease;
  }
  #chunk-panel.open {
    max-height: 400px;
    overflow-y: auto;
  }
  #chunk-panel .chunk-header {
    padding: 10px 20px 6px;
    font-size: 11px;
    text-transform: uppercase;
    letter-spacing: 1px;
    color: #42d4f4;
    position: sticky;
    top: 0;
    background: #12121a;
  }
  .chunk-card {
    margin: 4px 12px;
    padding: 10px 14px;
    background: #1a1a2e;
    border-radius: 6px;
    border-left: 3px solid #42d4f4;
    cursor: default;
    transition: background 0.15s;
  }
  .chunk-card:hover { background: #222240; }
  .chunk-card .chunk-source {
    font-size: 10px;
    color: #666;
    margin-bottom: 4px;
  }
  .chunk-card .chunk-text {
    font-size: 12px;
    color: #bbb;
    line-height: 1.5;
    max-height: 44px;
    overflow: hidden;
    transition: max-height 0.3s ease;
  }
  .chunk-card:hover .chunk-text { max-height: 600px; }
  .chunk-card .expand-hint {
    font-size: 10px;
    color: #555;
    margin-top: 4px;
  }
  .chunk-card:hover .expand-hint { display: none; }
  #chunk-tooltip {
    display: none;
    position: fixed;
    max-width: 450px;
    padding: 14px 18px;
    background: #1a1a2e;
    border: 1px solid #42d4f4;
    border-radius: 8px;
    font-size: 12px;
    color: #ccc;
    line-height: 1.6;
    z-index: 9999;
    pointer-events: none;
    box-shadow: 0 8px 32px rgba(0,0,0,0.6);
  }
  #chunk-tooltip .tt-source {
    font-size: 10px;
    color: #42d4f4;
    margin-bottom: 6px;
    text-transform: uppercase;
    letter-spacing: 0.5px;
  }
  #chunk-tooltip .tt-text {
    white-space: pre-wrap;
    word-wrap: break-word;
  }
  #legend-header {
    padding: 12px 20px 8px;
    font-size: 12px;
    text-transform: uppercase;
    letter-spacing: 1px;
    color: #888;
    border-bottom: 1px solid #2a2a3a;
  }
  #legend {
    flex: 1;
    overflow-y: auto;
    padding: 8px 20px;
  }
  .legend-item {
    display: flex;
    align-items: flex-start;
    gap: 8px;
    padding: 6px 0;
    cursor: pointer;
    border-radius: 4px;
    transition: background 0.15s;
  }
  .legend-item:hover { background: #1a1a2a; }
  .legend-dot {
    width: 12px;
    height: 12px;
    border-radius: 50%;
    flex-shrink: 0;
    margin-top: 3px;
  }
  .legend-text {
    font-size: 12px;
    line-height: 1.4;
    color: #aaa;
  }
  .legend-text .count { color: #666; }
  .legend-text .title { color: #ccc; font-weight: 500; }
  .level-indicator {
    padding: 6px 20px;
    font-size: 11px;
    color: #f58231;
    background: #1a1a0e;
    border-bottom: 1px solid #2a2a3a;
    text-transform: uppercase;
    letter-spacing: 1px;
  }
  #controls {
    padding: 12px 20px;
    border-top: 1px solid #2a2a3a;
    display: flex;
    gap: 8px;
    flex-wrap: wrap;
  }
  #controls button {
    padding: 6px 12px;
    background: #1a1a2a;
    border: 1px solid #2a2a3a;
    color: #ccc;
    border-radius: 4px;
    cursor: pointer;
    font-size: 12px;
    transition: all 0.15s;
  }
  #controls button:hover { background: #2a2a3a; color: #fff; }
  #stats {
    padding: 8px 20px;
    font-size: 11px;
    color: #555;
    border-top: 1px solid #2a2a3a;
  }
</style>
</head>
<body>
<div id="cy"></div>
<div id="chunk-tooltip"></div>
<div id="sidebar">
  <h2>DKIA Knowledge Graph</h2>
  <div class="level-indicator" id="level-indicator">Level 0 — Community Overview</div>
  <div id="node-info">
    <div class="placeholder">Click a community node to expand it<br>Click a community in the legend to focus it</div>
  </div>
  <div id="chunk-panel"></div>
  <div id="legend-header">COMMUNITIES (LEGEND_COUNT)</div>
  <div id="legend">LEGEND_HTML</div>
  <div id="controls">
    <button onclick="cy.fit(undefined, 40)">Fit</button>
    <button onclick="collapseAll()">Collapse All</button>
    <button onclick="resetView()">Reset</button>
  </div>
  <div id="stats">META_NODES meta-nodes &middot; TOTAL_ENTITIES entities &middot; COMM_COUNT communities</div>
</div>

<script>
document.getElementById('level-indicator').textContent = 'JS loaded, initializing...';
if (typeof cytoscape === 'undefined') {
  document.getElementById('cy').style.background = '#220000';
  document.getElementById('cy').innerHTML = '<div style="color:#ff4444;padding:40px;font-size:18px"><b>Cytoscape.js CDN failed to load!</b><br><br>Check internet connection or refresh the page.</div>';
  throw new Error('CDN not loaded');
}
const metaElements = COMMUNITY_META_JSON;
const communityData = COMMUNITY_ENTITIES_JSON;
const chunkTexts = CHUNK_TEXTS_JSON;
const chunkRefs = CHUNK_REFS_JSON;
const commSummaries = COMMUNITY_SUMMARIES_JSON;
const semanticGroups = SEMANTIC_GROUPS_JSON;

const expandedCommunities = new Set();

let cy;
try {
  cy = cytoscape({
  container: document.getElementById('cy'),
  elements: metaElements,
  style: [
    {
      selector: 'node[type="COMMUNITY"]',
      style: {
        'shape': 'round-rectangle',
        'background-color': 'data(color)',
        'background-opacity': 0.3,
        'border-color': 'data(color)',
        'border-width': 3,
        'label': 'data(label)',
        'font-size': '11px',
        'text-valign': 'center',
        'text-halign': 'center',
        'text-wrap': 'wrap',
        'text-max-width': '100px',
        'color': '#fff',
        'width': 'data(size)',
        'height': 'data(size)',
        'text-outline-color': '#0a0a0f',
        'text-outline-width': 2,
        'transition-property': 'opacity, border-color, border-width, background-opacity',
        'transition-duration': '0.2s',
      }
    },
    {
      selector: ':parent[type="COMMUNITY"]',
      style: {
        'background-opacity': 0.08,
        'border-style': 'solid',
        'border-width': 2,
        'padding': '20px',
        'text-valign': 'top',
        'font-size': '10px',
        'shape': 'round-rectangle',
      }
    },
    {
      selector: 'node[type!="COMMUNITY"][type!="SEMANTIC_GROUP"]',
      style: {
        'label': 'data(label)',
        'background-color': 'data(color)',
        'width': 'data(size)',
        'height': 'data(size)',
        'shape': 'ellipse',
        'font-size': '9px',
        'text-valign': 'bottom',
        'text-halign': 'center',
        'text-margin-y': 4,
        'color': '#999',
        'text-outline-color': '#0a0a0f',
        'text-outline-width': 2,
        'border-width': 1.5,
        'border-color': '#333',
        'transition-property': 'opacity, border-color, border-width',
        'transition-duration': '0.2s',
      }
    },
    {
      selector: 'node[type="SEMANTIC_GROUP"]',
      style: {
        'background-color': '#151a12',
        'background-opacity': 0.5,
        'border-color': '#bfef45',
        'border-width': 2,
        'border-style': 'dashed',
        'border-opacity': 0.5,
        'shape': 'round-rectangle',
        'padding': '15px',
        'text-valign': 'top',
        'text-halign': 'center',
        'font-size': '10px',
        'color': '#bfef45',
        'text-opacity': 0.7,
        'label': 'data(label)',
        'text-outline-color': '#0a0a0f',
        'text-outline-width': 2,
      }
    },
    {
      selector: 'edge',
      style: {
        'width': 1.5,
        'line-color': '#333',
        'target-arrow-color': '#444',
        'target-arrow-shape': 'triangle',
        'curve-style': 'bezier',
        'opacity': 0.5,
        'transition-property': 'opacity, line-color, width',
        'transition-duration': '0.2s',
      }
    },
    {
      selector: 'edge[weight]',
      style: {
        'width': 3,
        'line-color': '#555',
        'target-arrow-shape': 'none',
        'line-style': 'dashed',
        'opacity': 0.4,
      }
    },
    {
      selector: 'node:selected',
      style: {
        'border-width': 4,
        'border-color': '#f58231',
        'color': '#fff',
        'font-size': '12px',
        'font-weight': 'bold',
        'text-outline-width': 3,
        'z-index': 999,
      }
    },
    {
      selector: '.highlighted',
      style: {
        'opacity': 1,
        'border-width': 3,
        'border-color': '#f58231',
        'color': '#fff',
        'z-index': 999,
      }
    },
    {
      selector: 'edge.highlighted',
      style: {
        'opacity': 1,
        'width': 2.5,
        'line-color': '#f58231',
        'target-arrow-color': '#f58231',
      }
    },
    {
      selector: '.dimmed',
      style: { 'opacity': 0.12 }
    },
    {
      selector: '.chunk-node',
      style: {
        'background-color': '#42d4f4',
        'background-opacity': 0.25,
        'border-color': '#42d4f4',
        'border-width': 2,
        'width': 18,
        'height': 18,
        'shape': 'ellipse',
        'label': 'data(label)',
        'font-size': '8px',
        'color': '#42d4f4',
        'text-valign': 'bottom',
        'text-margin-y': 3,
        'text-outline-color': '#0a0a0f',
        'text-outline-width': 2,
        'z-index': 1000,
        'opacity': 1,
      }
    },
    {
      selector: '.chunk-edge',
      style: {
        'width': 1,
        'line-color': '#42d4f4',
        'line-style': 'dashed',
        'opacity': 0.4,
        'target-arrow-shape': 'none',
        'curve-style': 'bezier',
        'z-index': 999,
      }
    },
  ],
  // Use concentric layout for Level 0: reliable for disconnected components,
  // places larger communities (more members) in the center
  layout: {
    name: 'grid',
    animate: false,
    fit: true,
    padding: 40,
  },
  minZoom: 0.15,
  maxZoom: 4,
  wheelSensitivity: 0.3,
});

// Explicit fit after layout completes
  cy.fit(undefined, 40);
  console.log('Cytoscape initialized:', cy.nodes().length, 'nodes,', cy.edges().length, 'edges');
  document.getElementById('level-indicator').textContent = 'Level 0 — ' + cy.nodes().length + ' nodes, ' + cy.edges().length + ' edges loaded';
} catch(e) {
  console.error('Cytoscape init error:', e);
  document.getElementById('cy').innerHTML = '<div style="color:#ff4444;padding:40px;font-size:16px;font-family:monospace"><b>Cytoscape Error:</b><br><pre style="white-space:pre-wrap;margin-top:12px">' + e.message + '\n\n' + e.stack + '</pre></div>';
}

const tooltip = document.getElementById('chunk-tooltip');
const chunkPanel = document.getElementById('chunk-panel');
const levelIndicator = document.getElementById('level-indicator');

function updateLevelIndicator() {
  if (expandedCommunities.size === 0) {
    levelIndicator.textContent = 'Level 0 — Community Overview';
  } else {
    levelIndicator.textContent = 'Level 1 — ' + expandedCommunities.size + ' expanded';
  }
}

function clearChunks() {
  cy.remove(cy.elements('.chunk-node, .chunk-edge'));
  chunkPanel.classList.remove('open');
  chunkPanel.innerHTML = '';
  tooltip.style.display = 'none';
}

function expandCommunity(commId) {
  const metaNode = cy.getElementById('comm-' + commId);
  if (!metaNode.length) return;

  const data = communityData[commId];
  if (!data) return;

  const toAdd = [...data.entities, ...data.edges, ...data.semantic_groups];
  cy.add(toAdd);

  const children = metaNode.children();
  children.layout({
    name: 'cose',
    animate: true,
    animationDuration: 300,
    boundingBox: metaNode.boundingBox(),
    fit: false,
    nodeDimensionsIncludeLabels: true,
    nodeRepulsion: function() { return 4000; },
    idealEdgeLength: function() { return 60; },
  }).run();

  expandedCommunities.add(commId);
  updateLevelIndicator();

  cy.animate({ fit: { eles: metaNode.union(children), padding: 60 }, duration: 300 });
}

function collapseCommunity(commId) {
  const metaNode = cy.getElementById('comm-' + commId);
  if (!metaNode.length) return;

  const children = metaNode.children();
  children.forEach(function(child) {
    cy.remove(cy.elements('.chunk-node[id ^= "chunk-' + child.id() + '"]'));
    cy.remove(cy.elements('.chunk-edge'));
  });

  const descendants = metaNode.descendants();
  cy.remove(descendants.edgesWith(descendants));
  cy.remove(descendants);

  expandedCommunities.delete(commId);
  updateLevelIndicator();
}

function collapseAll() {
  clearChunks();
  Array.from(expandedCommunities).forEach(function(commId) {
    collapseCommunity(commId);
  });
  cy.elements().removeClass('dimmed highlighted');
  resetView();
}

function resetView() {
  clearChunks();
  cy.elements().removeClass('dimmed highlighted');
  document.getElementById('node-info').innerHTML =
    '<div class="placeholder">Click a community node to expand it<br>Click a community in the legend to focus it</div>';
  updateLevelIndicator();
  cy.fit(undefined, 40);
}

function buildCommSummaryHtml(commId, color) {
  const s = commSummaries[commId];
  if (!s) return '';
  let html = '<div class="comm-summary-box" style="border-left-color:' + (color || '#f58231') + '">';
  html += '<div class="comm-title">Community ' + commId + ': ' + s.title + '</div>';
  html += '<div class="comm-text">' + s.summary + '</div>';
  if (s.key_insights && s.key_insights.length > 0) {
    html += '<ul class="comm-insights">';
    s.key_insights.forEach(function(insight) {
      html += '<li>' + insight.replace(/</g, '&lt;') + '</li>';
    });
    html += '</ul>';
  }
  html += '</div>';
  return html;
}

function buildSemanticGroupHtml(groupId) {
  var g = semanticGroups[groupId];
  if (!g) return '';
  var html = '<div class="comm-summary-box" style="border-left-color:#bfef45">';
  html += '<div class="comm-title" style="color:#bfef45">Semantic Group: ' + g.canonical + '</div>';
  html += '<div class="comm-text">' + g.members.length + ' semantically similar entities</div>';
  html += '<ul class="comm-insights">';
  g.members.forEach(function(m) {
    var sim = g.member_similarities[m];
    var simStr = sim ? ' (sim=' + sim.toFixed(4) + ')' : ' (canonical)';
    html += '<li>' + m + simStr + '</li>';
  });
  html += '</ul>';
  html += '</div>';
  return html;
}

function showCommunitySummary(commId, color) {
  const expanded = expandedCommunities.has(commId);
  const s = commSummaries[commId];
  if (!s) return;
  const memberCount = communityData[commId] ? communityData[commId].entities.length : 0;
  let html = '<div class="name" style="color:' + color + '">Community ' + commId + '</div>';
  html += '<span class="type-badge" style="background:' + color + '33; color:' + color + '">COMMUNITY</span>';
  html += '<div class="metric">Members: <span>' + memberCount + ' entities</span></div>';
  html += '<div class="metric">Status: <span>' + (expanded ? 'Expanded — click border to collapse' : 'Collapsed — click to expand') + '</span></div>';
  html += buildCommSummaryHtml(commId, color);
  document.getElementById('node-info').innerHTML = html;
}

function showChunks(entityId) {
  clearChunks();
  const refs = chunkRefs[entityId];
  if (!refs || refs.length === 0) return;
  const chunks = refs.map(function(r) {
    return { index: r.index, source_id: r.source_id, text: chunkTexts[r.text_idx] };
  });

  const entityNode = cy.getElementById(entityId);
  const pos = entityNode.position();
  const radius = 120 + chunks.length * 8;

  const addedElements = [];
  chunks.forEach(function(chunk, i) {
    const angle = (2 * Math.PI * i) / chunks.length - Math.PI / 2;
    const chunkId = 'chunk-' + entityId + '-' + chunk.index;
    const preview = chunk.source_id.split(':').pop();

    addedElements.push({
      group: 'nodes',
      data: {
        id: chunkId,
        label: '#' + chunk.index + ' ' + preview,
        fullText: chunk.text,
        sourceId: chunk.source_id,
        chunkIndex: chunk.index,
      },
      position: {
        x: pos.x + radius * Math.cos(angle),
        y: pos.y + radius * Math.sin(angle),
      },
      classes: 'chunk-node',
    });
    addedElements.push({
      group: 'edges',
      data: {
        id: 'cedge-' + chunkId,
        source: entityId,
        target: chunkId,
      },
      classes: 'chunk-edge',
    });
  });

  cy.add(addedElements);
  levelIndicator.textContent = 'Level 2 — Chunk expansion for ' + entityId;

  let html = '<div class="chunk-header">' + chunks.length + ' source chunks</div>';
  chunks.forEach(function(chunk) {
    const fullText = chunk.text.replace(/</g, '&lt;').replace(/>/g, '&gt;');
    html += '<div class="chunk-card" data-chunk-id="chunk-' + entityId + '-' + chunk.index + '">' +
      '<div class="chunk-source">#' + chunk.index + ' &middot; ' + chunk.source_id + '</div>' +
      '<div class="chunk-text">' + fullText + '</div>' +
      '<div class="expand-hint">hover to expand</div>' +
      '</div>';
  });
  chunkPanel.innerHTML = html;
  chunkPanel.classList.add('open');

  chunkPanel.querySelectorAll('.chunk-card').forEach(function(card) {
    card.addEventListener('mouseenter', function() {
      const cnode = cy.getElementById(card.dataset.chunkId);
      if (cnode.length) {
        cnode.style('background-opacity', 0.7);
        cnode.style('border-width', 3);
        cnode.style('width', 24);
        cnode.style('height', 24);
      }
    });
    card.addEventListener('mouseleave', function() {
      const cnode = cy.getElementById(card.dataset.chunkId);
      if (cnode.length) {
        cnode.style('background-opacity', 0.25);
        cnode.style('border-width', 2);
        cnode.style('width', 18);
        cnode.style('height', 18);
      }
    });
  });
}

// Hover chunk nodes: tooltip
cy.on('mouseover', '.chunk-node', function(evt) {
  const node = evt.target;
  const d = node.data();
  const rpos = node.renderedPosition();
  const container = cy.container().getBoundingClientRect();

  tooltip.innerHTML =
    '<div class="tt-source">#' + d.chunkIndex + ' &middot; ' + d.sourceId + '</div>' +
    '<div class="tt-text">' + d.fullText.replace(/</g, '&lt;').replace(/>/g, '&gt;') + '</div>';

  let left = container.left + rpos.x + 20;
  let top = container.top + rpos.y - 20;
  if (left + 460 > window.innerWidth) left = container.left + rpos.x - 470;
  if (top + 300 > window.innerHeight) top = window.innerHeight - 310;
  if (top < 10) top = 10;

  tooltip.style.left = left + 'px';
  tooltip.style.top = top + 'px';
  tooltip.style.display = 'block';
});

cy.on('mouseout', '.chunk-node', function() {
  tooltip.style.display = 'none';
});

// Hover inter-community edges: show details
cy.on('mouseover', 'edge[details]', function(evt) {
  const edge = evt.target;
  const d = edge.data();
  const rpos = edge.renderedMidpoint();
  const container = cy.container().getBoundingClientRect();

  let html = '<div class="tt-source">' + d.description + '</div>';
  if (d.details && d.details.length > 0) {
    html += '<div class="tt-text">' + d.details.join('\\n').replace(/</g, '&lt;') + '</div>';
  }
  tooltip.innerHTML = html;

  let left = container.left + rpos.x + 20;
  let top = container.top + rpos.y - 20;
  if (left + 460 > window.innerWidth) left = container.left + rpos.x - 470;
  tooltip.style.left = left + 'px';
  tooltip.style.top = top + 'px';
  tooltip.style.display = 'block';
});

cy.on('mouseout', 'edge[details]', function() {
  tooltip.style.display = 'none';
});

// Main tap handler
cy.on('tap', 'node', function(evt) {
  const node = evt.target;

  if (node.hasClass('chunk-node')) return;

  // Semantic group parent
  if (node.data('type') === 'SEMANTIC_GROUP') {
    clearChunks();
    var gid = node.data('group_id');
    cy.elements().addClass('dimmed').removeClass('highlighted');
    var children = node.children();
    children.add(node).add(children.edgesWith(children)).removeClass('dimmed').addClass('highlighted');
    cy.fit(children.add(node), 60);
    var infoHtml = '<div class="name" style="color:#bfef45">' + node.data('label') + '</div>';
    infoHtml += '<span class="type-badge" style="background:#bfef4533; color:#bfef45">SEMANTIC GROUP</span>';
    infoHtml += '<div class="metric">Members: <span>' + node.data('member_count') + ' entities</span></div>';
    infoHtml += buildSemanticGroupHtml(gid);
    document.getElementById('node-info').innerHTML = infoHtml;
    return;
  }

  // Community meta-node: expand/collapse
  if (node.data('type') === 'COMMUNITY') {
    const commId = node.data('community');
    if (commId === -1) return;
    clearChunks();
    cy.elements().removeClass('dimmed highlighted');

    if (expandedCommunities.has(commId)) {
      collapseCommunity(commId);
    } else {
      expandCommunity(commId);
    }
    showCommunitySummary(commId, node.data('color'));
    return;
  }

  // Entity node
  const d = node.data();
  clearChunks();
  cy.elements().addClass('dimmed').removeClass('highlighted');
  const neighborhood = node.neighborhood().add(node);
  const parentComm = node.parent();
  if (parentComm.length) {
    parentComm.add(parentComm.children()).removeClass('dimmed');
  }
  neighborhood.removeClass('dimmed').addClass('highlighted');

  const sources = JSON.parse(d.source_refs || '[]');
  const sourceStr = sources.length > 0 ? sources.join(', ') : 'single source';

  let infoHtml =
    '<div class="name">' + d.label + '</div>' +
    '<span class="type-badge" style="background:' + d.color + '33; color:' + d.color + '">' + d.type + '</span>' +
    ' <span class="type-badge" style="background:#2a2a3a; color:#888">C' + d.community + '</span>' +
    '<div class="metric">PageRank: <span>' + d.pagerank.toFixed(4) + '</span></div>' +
    '<div class="metric">Degree: <span>' + d.degree_centrality.toFixed(4) + '</span></div>' +
    '<div class="metric">Betweenness: <span>' + d.betweenness.toFixed(4) + '</span></div>' +
    '<div class="metric">Sources: <span>' + d.num_sources + '</span> (' + sourceStr + ')</div>' +
    (d.description ? '<div class="desc">' + d.description + '</div>' : '');

  if (d.chunk_count > 0) {
    infoHtml += '<div class="chunk-hint">' + d.chunk_count + ' source chunks — expanding on graph</div>';
  }

  infoHtml += buildCommSummaryHtml(d.community, d.color);

  var parentNode = node.parent();
  if (parentNode.length > 0 && parentNode.data('type') === 'SEMANTIC_GROUP') {
    var sgId = parentNode.data('group_id');
    infoHtml += '<span class="type-badge" style="background:#bfef4533; color:#bfef45;margin-top:8px;display:inline-block">SG: ' + parentNode.data('label') + '</span>';
    infoHtml += buildSemanticGroupHtml(sgId);
  }

  document.getElementById('node-info').innerHTML = infoHtml;

  if (d.chunk_count > 0) {
    showChunks(d.id);
  }
});

// Click background: reset
cy.on('tap', function(evt) {
  if (evt.target === cy) {
    clearChunks();
    cy.elements().removeClass('dimmed highlighted');
    document.getElementById('node-info').innerHTML =
      '<div class="placeholder">Click a community node to expand it<br>Click a community in the legend to focus it</div>';
    updateLevelIndicator();
  }
});

// Legend click
document.querySelectorAll('.legend-item').forEach(function(item) {
  item.addEventListener('click', function() {
    clearChunks();
    const commId = parseInt(item.dataset.community);
    const color = item.dataset.color;

    if (commId === -1) {
      cy.elements().addClass('dimmed').removeClass('highlighted');
      const otherNodes = cy.nodes('[type="COMMUNITY"][community = -1]');
      otherNodes.removeClass('dimmed').addClass('highlighted');
      if (otherNodes.length) cy.fit(otherNodes, 60);
      return;
    }

    cy.elements().removeClass('dimmed highlighted');

    if (!expandedCommunities.has(commId)) {
      expandCommunity(commId);
    } else {
      const metaNode = cy.getElementById('comm-' + commId);
      cy.animate({ fit: { eles: metaNode.union(metaNode.descendants()), padding: 60 }, duration: 300 });
    }

    showCommunitySummary(commId, color);
  });
});
</script>
</body>
</html>"""

# Substitute placeholders
html_content = html_content.replace("COMMUNITY_META_JSON", json.dumps(community_meta_elements))
html_content = html_content.replace("COMMUNITY_ENTITIES_JSON", json.dumps(community_entity_data))
html_content = html_content.replace("CHUNK_TEXTS_JSON", json.dumps(chunk_texts))
html_content = html_content.replace("CHUNK_REFS_JSON", json.dumps(cyto_chunk_refs))
html_content = html_content.replace("COMMUNITY_SUMMARIES_JSON", json.dumps(cyto_community_summaries))
html_content = html_content.replace("SEMANTIC_GROUPS_JSON", json.dumps(cyto_semantic_groups))
html_content = html_content.replace("LEGEND_HTML", "\n".join(legend_html_parts))
html_content = html_content.replace("LEGEND_COUNT", str(len(legend_items)))
html_content = html_content.replace("META_NODES", str(len([e for e in community_meta_elements if "source" not in e["data"]])))
html_content = html_content.replace("TOTAL_ENTITIES", str(sum(len(d["entities"]) for d in community_entity_data.values())))
html_content = html_content.replace("COMM_COUNT", str(len(cyto_community_summaries)))

# Write and open
GRAPH_HTML_PATH.write_text(html_content)
webbrowser.open(f"file://{GRAPH_HTML_PATH.absolute()}")

meta_nodes = len([e for e in community_meta_elements if "source" not in e["data"]])
meta_edges = len([e for e in community_meta_elements if "source" in e["data"]])
print(f"Multi-level graph visualization written to: {GRAPH_HTML_PATH.absolute()}")
print(f"Level 0: {meta_nodes} community meta-nodes, {meta_edges} inter-community edges")
print(f"Total entity data embedded: {sum(len(d['entities']) for d in community_entity_data.values())} entities")
print()
print("Interactions:")
print("  Level 0: Click community node to expand -> Level 1")
print("  Level 1: Click entity inside community -> chunk expansion (Level 2)")
print("  Level 1: Click community border -> collapse back to Level 0")
print("  Legend: Click community -> expand + focus")
print("  Hover inter-community edge -> tooltip with relationship details")
print("  Hover chunk node -> tooltip with full text")
print("  Click background -> reset dimming")
print("  Collapse All -> return to Level 0")

Multi-level graph visualization written to: /Users/pacho-home-server/daily-knowledge-ingestion-assistant/notebooks/knowledge_graph.html
Level 0: 40 community meta-nodes, 2 inter-community edges
Total entity data embedded: 158 entities

Interactions:
  Level 0: Click community node to expand -> Level 1
  Level 1: Click entity inside community -> chunk expansion (Level 2)
  Level 1: Click community border -> collapse back to Level 0
  Legend: Click community -> expand + focus
  Hover inter-community edge -> tooltip with relationship details
  Hover chunk node -> tooltip with full text
  Click background -> reset dimming
  Collapse All -> return to Level 0


## Summary

This notebook completed:

1. **Graph Construction** - Built NetworkX graph from multi-source entities and relationships
2. **Graph Metrics** - Computed PageRank, degree centrality, and betweenness
3. **Community Detection** - Applied Leiden algorithm to find cross-domain topic clusters
4. **Community Summaries** - Generated LLM-powered summaries for each cluster
5. **SQLite Storage** - Persisted graph, sources, and provenance data
6. **Interactive Visualization** - Standalone HTML with Cytoscape.js: multi-level drill-down, community summaries, chunk expansion, source provenance

## Next Steps

In the next notebook we will:
1. **Add embeddings** - Embed entities and chunks using nomic-embed-text
2. **Vector search** - Set up sqlite-vec for semantic retrieval
3. **Triple-factor retrieval** - Combine semantic + temporal (content-type-aware) + graph centrality
4. **Cross-domain queries** - Test retrieval across all 7 source domains