# GraphRAG Step 2: Graph Construction & Community Detection

This notebook continues the GraphRAG pipeline:

## Pipeline Steps
1. **Load extraction results** - Multi-source data from notebook 01
2. **Build NetworkX graph** - Entities as nodes, relationships as edges, with source provenance
3. **Compute graph metrics** - PageRank, centrality, degree
4. **Community detection** - Leiden algorithm for topic clustering
5. **Generate community summaries** - LLM-powered hierarchical summaries
6. **Store in SQLite** - Persist graph structure, sources, and summaries
7. **Interactive graph visualization** - Standalone HTML with Cytoscape.js (community summaries, chunk expansion)

## Setup

In [1]:
import json
import sqlite3
import webbrowser
from pathlib import Path
from dataclasses import dataclass, asdict

import httpx
import networkx as nx
import igraph as ig
import leidenalg

OLLAMA_BASE_URL = "http://localhost:11434"
MODEL = "qwen2.5:3b"
DB_PATH = Path("graphrag.db")
GRAPH_HTML_PATH = Path("knowledge_graph.html")

In [2]:
def chat_ollama(prompt: str, system: str = "", temperature: float = 0.0) -> str:
    """Send a chat request to Ollama and return the response."""
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    
    response = httpx.post(
        f"{OLLAMA_BASE_URL}/api/chat",
        json={
            "model": MODEL,
            "messages": messages,
            "stream": False,
            "options": {"temperature": temperature}
        },
        timeout=120.0
    )
    response.raise_for_status()
    return response.json()["message"]["content"]

## Step 1: Load Extraction Results

In [3]:
# Load multi-source results from notebook 01
with open("extraction_results.json", "r") as f:
    data = json.load(f)

# Global merged data
entities = data["merged"]["entities"]
relationships = data["merged"]["relationships"]
claims = data["merged"]["claims"]
entity_source_map = data["merged"]["entity_source_map"]
entity_chunk_map = data["merged"].get("entity_chunk_map", {})

# Semantic entity groups (non-destructive overlay from notebook 01)
semantic_entity_groups = data.get("semantic_entity_groups", [])
entity_to_semantic_group = data.get("entity_to_semantic_group", {})

# Per-source data
sources_data = data["sources"]

# Build flat chunk lookup: global_chunk_index -> {text, source_id}
# Chunks are indexed sequentially across sources in document order (considering all sources)
all_chunks = []
chunk_lookup: dict[int, dict] = {}
for source in sources_data:
    for chunk_text in source["chunks"]:
        idx = len(all_chunks)
        entry = {"index": idx, "source_id": source["source_id"], "text": chunk_text}
        all_chunks.append(entry)
        chunk_lookup[idx] = entry

print(f"Loaded: {len(entities)} entities, {len(relationships)} relationships, {len(claims)} claims")
print(f"From {len(sources_data)} sources, {len(all_chunks)} total chunks")
print(f"Entity chunk provenance: {len(entity_chunk_map)} entities mapped to chunks")
print(f"Semantic entity groups: {len(semantic_entity_groups)} groups, {len(entity_to_semantic_group)} entities grouped")
print()
print("=== PER-SOURCE BREAKDOWN ===")
for source in sources_data:
    print(f"  [{source['source_type']}] {source['source_id']}: "
          f"{len(source['entities'])}E {len(source['relationships'])}R {len(source['claims'])}C "
          f"({source['content_length']} chars)")

Loaded: 864 entities, 417 relationships, 5 claims
From 2 sources, 359 total chunks
Entity chunk provenance: 864 entities mapped to chunks
Semantic entity groups: 68 groups, 642 entities grouped

=== PER-SOURCE BREAKDOWN ===
  [arxiv] arxiv:2404.16130: 466E 217R 5C (89608 chars)
  [arxiv] arxiv:2404.18021: 405E 200R 0C (77759 chars)


In [4]:
# Preview entities
print(f"=== ENTITIES ({len(entities)} total from {len(sources_data)} sources) ===")
for e in entities[:10]:
    sources = entity_source_map.get(e["name"], [])
    src_str = f" [{len(sources)} sources]" if len(sources) > 1 else ""
    print(f"  [{e['type']}] {e['name']}{src_str}")
if len(entities) > 10:
    print(f"  ... and {len(entities) - 10} more")

=== ENTITIES (864 total from 2 sources) ===
  [ORGANIZATION] MICROSOFT RESEARCH
  [ORGANIZATION] MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES
  [ORGANIZATION] MICROSOFT OFFICE OF THE CTO
  [CONCEPT] RETRIEVAL-AUGMENTED GENERATION
  [ORGANIZATION] LARGE LANGUAGE MODELS [2 sources]
  [ORGANIZATION] LLMS
  [CONCEPT] RETRIEVAL-AUGMENTED GENERATION (RAG)
  [LOCATION] EXTERNAL KNOWLEDGE SOURCE
  [LOCATION] PRIVATE DOCUMENT COLLECTIONS
  [EVENT] GLOBAL QUESTIONS
  ... and 854 more


## Step 2: Build NetworkX Graph

In [5]:
# Create directed graph
G = nx.DiGraph()

# Add entity nodes with attributes including source provenance
for entity in entities:
    source_refs = entity_source_map.get(entity["name"], [])
    G.add_node(
        entity["name"],
        type=entity["type"],
        description=entity["description"],
        source_refs=json.dumps(source_refs),
        num_sources=len(source_refs),
    )

# Add relationship edges with attributes
for rel in relationships:
    # Only add edge if both nodes exist
    if rel["source"] in G.nodes and rel["target"] in G.nodes:
        G.add_edge(
            rel["source"],
            rel["target"],
            description=rel["description"],
            weight=rel["strength"]
        )

print(f"Graph created: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
multi_source_nodes = sum(1 for _, d in G.nodes(data=True) if d.get("num_sources", 1) > 1)
print(f"Multi-source nodes (in 2+ sources): {multi_source_nodes}")

Graph created: 864 nodes, 302 edges
Multi-source nodes (in 2+ sources): 7


In [6]:
# Display graph structure
print("\n=== GRAPH NODES ===")
for node, attrs in list(G.nodes(data=True))[:10]:
    print(f"  {node} ({attrs.get('type', 'N/A')})")

print("\n=== GRAPH EDGES ===")
for source, target, attrs in list(G.edges(data=True))[:10]:
    print(f"  {source} --> {target}")
    print(f"    {attrs.get('description', 'N/A')[:60]}...")


=== GRAPH NODES ===
  MICROSOFT RESEARCH (ORGANIZATION)
  MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES (ORGANIZATION)
  MICROSOFT OFFICE OF THE CTO (ORGANIZATION)
  RETRIEVAL-AUGMENTED GENERATION (CONCEPT)
  LARGE LANGUAGE MODELS (ORGANIZATION)
  LLMS (ORGANIZATION)
  RETRIEVAL-AUGMENTED GENERATION (RAG) (CONCEPT)
  EXTERNAL KNOWLEDGE SOURCE (LOCATION)
  PRIVATE DOCUMENT COLLECTIONS (LOCATION)
  GLOBAL QUESTIONS (EVENT)

=== GRAPH EDGES ===
  MICROSOFT RESEARCH --> MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES
    Microsoft Research is a parent entity of Microsoft Strategic...
  MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES --> MICROSOFT OFFICE OF THE CTO
    Microsoft Strategic Missions and Technologies is a sub-entit...
  RETRIEVAL-AUGMENTED GENERATION --> LARGE LANGUAGE MODELS
    Retrieval-augmented generation (RAG) is used by large langua...
  RETRIEVAL-AUGMENTED GENERATION --> EXTERNAL KNOWLEDGE SOURCE
    Retrieval-augmented generation (RAG) retrieves relevant info...
  RETRIE

## Step 3: Compute Graph Metrics

Calculate centrality scores for ranking node importance.

In [7]:
# Convert to undirected for some algorithms
G_undirected = G.to_undirected()

# PageRank - importance based on incoming connections
pagerank = nx.pagerank(G, weight="weight")

# Degree centrality - number of connections
degree_centrality = nx.degree_centrality(G)

# Betweenness centrality - bridges between clusters
betweenness = nx.betweenness_centrality(G_undirected)

# Store metrics on nodes
for node in G.nodes:
    G.nodes[node]["pagerank"] = pagerank.get(node, 0)
    G.nodes[node]["degree_centrality"] = degree_centrality.get(node, 0)
    G.nodes[node]["betweenness"] = betweenness.get(node, 0)

print("Graph metrics computed: pagerank, degree_centrality, betweenness")

Graph metrics computed: pagerank, degree_centrality, betweenness


In [8]:
# Top entities by PageRank
print("\n=== TOP ENTITIES BY PAGERANK ===")
sorted_by_pagerank = sorted(pagerank.items(), key=lambda x: -x[1])
for node, score in sorted_by_pagerank[:10]:
    node_type = G.nodes[node].get("type", "N/A")
    print(f"  {score:.4f} | [{node_type}] {node}")


=== TOP ENTITIES BY PAGERANK ===
  0.0093 | [PRODUCT] NEBULAGRAPH
  0.0087 | [] 
  0.0073 | [PERSON] LIANG
  0.0070 | [PERSON] W.-T.
  0.0067 | [PRODUCT] SS
  0.0059 | [ORGANIZATION] LANGCHAIN
  0.0053 | [ORGANIZATION] LLAMAINDEX
  0.0053 | [CONCEPT] GLOBAL ANSWER
  0.0047 | [PRODUCT] TASK EXECUTOR
  0.0046 | [PRODUCT] TS


In [9]:
import plotly.graph_objects as go

# Use spring layout for positions (only connected nodes for clarity)
connected = [n for n in G.nodes if G.degree(n) > 0]
H = G.subgraph(connected)
pos = nx.spring_layout(H, k=0.5, seed=42)

# Edge traces
edge_x, edge_y = [], []
mid_x, mid_y, mid_text = [], [], []
for u, v, attrs in H.edges(data=True):
    x0, y0 = pos[u]
    x1, y1 = pos[v]
    edge_x += [x0, x1, None]
    edge_y += [y0, y1, None]
    # Midpoint for hover label
    mid_x.append((x0 + x1) / 2)
    mid_y.append((y0 + y1) / 2)
    desc = attrs.get("description", "")
    mid_text.append(f"{u} → {v}<br>{desc}")

edge_trace = go.Scatter(x=edge_x, y=edge_y, mode='lines',
                        line=dict(width=0.5, color='#888'), hoverinfo='none')

# Invisible midpoint markers for edge hover
edge_mid_trace = go.Scatter(
    x=mid_x, y=mid_y, mode='markers', hovertext=mid_text, hoverinfo='text',
    marker=dict(size=10, color='rgba(0,0,0,0)'),  # invisible
    showlegend=False)

# Node traces
node_x = [pos[n][0] for n in H.nodes]
node_y = [pos[n][1] for n in H.nodes]
node_pr = [pagerank.get(n, 0) for n in H.nodes]
node_text = [f"{n}<br>PR={pagerank.get(n,0):.4f}<br>deg={G.degree(n)}" for n in H.nodes]

node_trace = go.Scatter(
    x=node_x, y=node_y, mode='markers', hovertext=node_text, hoverinfo='text',
    marker=dict(size=[max(6, pr * 3000) for pr in node_pr],
                color=node_pr, colorscale='YlOrRd', showscale=True,
                colorbar=dict(title='PageRank'), line=dict(width=0.5, color='#333')))

fig = go.Figure(data=[edge_trace, edge_mid_trace, node_trace],
                layout=go.Layout(
                    title=f'Knowledge Graph ({H.number_of_nodes()} connected nodes, {H.number_of_edges()} edges)',
                    showlegend=False, hovermode='closest',
                    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                    template='plotly_dark', width=900, height=700))

# Write to HTML and open in browser (same pattern as Cytoscape viz)
fig.write_html("graph_plotly.html")
webbrowser.open(f"file://{Path('graph_plotly.html').absolute()}")
print(f"Plotly graph opened in browser ({H.number_of_nodes()} nodes, {H.number_of_edges()} edges)")

Plotly graph opened in browser (324 nodes, 302 edges)


## Step 4: Community Detection

Using Leiden algorithm to find clusters of related entities. Leiden guarantees well-connected communities, fixing a known flaw in Louvain where communities can become internally disconnected.

In [10]:
# Leiden community detection (works on undirected graphs)
# Convert NetworkX -> igraph (Leiden requires igraph)
G_ig = ig.Graph.from_networkx(G_undirected)

# Map edge weights (from_networkx stores them as edge attributes)
weights = G_ig.es["weight"] if "weight" in G_ig.es.attributes() else None

# Run Leiden with modularity optimization
leiden_partition = leidenalg.find_partition(
    G_ig,
    leidenalg.ModularityVertexPartition,
    weights=weights,
)

# Build partition dict: node_name -> community_id
# igraph preserves node order from NetworkX via the _nx_name attribute
partition = {}
for comm_id, members in enumerate(leiden_partition):
    for vertex_idx in members:
        node_name = G_ig.vs[vertex_idx]["_nx_name"]
        partition[node_name] = comm_id

# Store community assignment on nodes
for node, community_id in partition.items():
    G.nodes[node]["community"] = community_id

# Count communities
num_communities = max(partition.values()) + 1 if partition else 0
print(f"Detected {num_communities} communities (Leiden)")

# Modularity score (quality of partition)
modularity = leiden_partition.modularity
print(f"Modularity score: {modularity:.4f}")

Detected 613 communities (Leiden)
Modularity score: 0.9273


In [11]:
# Group entities by community
communities: dict[int, list[str]] = {}
for node, community_id in partition.items():
    if community_id not in communities:
        communities[community_id] = []
    communities[community_id].append(node)

print("\n=== COMMUNITIES ===")
for comm_id, members in sorted(communities.items()):
    # Sort members by PageRank within community
    sorted_members = sorted(members, key=lambda x: -pagerank.get(x, 0))
    print(f"\nCommunity {comm_id} ({len(members)} members):")
    for member in sorted_members[:5]:
        node_type = G.nodes[member].get("type", "N/A")
        print(f"  [{node_type}] {member}")
    if len(members) > 5:
        print(f"  ... and {len(members) - 5} more")


=== COMMUNITIES ===

Community 0 (39 members):
  [PRODUCT] VALIDATION STRATEGIES
  [PRODUCT] PROTOCOL
  [PRODUCT] CRISPR-GPT
  [CONCEPT] META-TASKS
  [EVENT] ANALYSIS
  ... and 34 more

Community 1 (19 members):
  [PUBLICATION] ARXIV PREPRINT ARXIV:2312.11805
  [JOURNAL] CELL
  [LOCATION] CHROMOSOME 7
  [CONCEPT] CRISPR
  [LOCATION] HEPG2
  ... and 14 more

Community 2 (17 members):
  [PRODUCT] NEBULAGRAPH
  [ORGANIZATION] LANGCHAIN
  [ORGANIZATION] LLAMAINDEX
  [PRODUCT] GRAPHRAG
  [PRODUCT] NEO4J
  ... and 12 more

Community 3 (13 members):
  [CONCEPT] GLOBAL ANSWER
  [EVENT] QUERY-FOCUSED SUMMARIZATION
  [CONCEPT] ENTITY EXTRACTOR
  [PRODUCT] COMMUNITY ANSWERS
  [CONCEPT] KNOWLEDGE GRAPH
  ... and 8 more

Community 4 (11 members):
  [] 
  [PERSON] AMBER HOAK
  [PERSON] ANDR´ES MORALES ESQUIVEL
  [PERSON] BEN CUTLER
  [PERSON] BILLIE RINALDI
  ... and 6 more

Community 5 (10 members):
  [ORGANIZATION] NAACL-HLT
  [ORGANIZATION] OPENAI
  [LOCATION] KAU
  [PERSON] MELNYK ET AL.
  [PER

## Step 5: Generate Community Summaries

Create LLM-powered summaries for each community following GraphRAG's report format.

In [12]:
@dataclass
class CommunitySummary:
    community_id: int
    title: str
    summary: str
    key_entities: list[str]
    key_insights: list[str]

COMMUNITY_SUMMARY_PROMPT = """
You are an expert analyst creating a summary report for a knowledge graph community.

Given the following entities and their relationships, create a structured summary.

ENTITIES IN THIS COMMUNITY:
{entities_info}

RELATIONSHIPS:
{relationships_info}

RELEVANT CLAIMS:
{claims_info}

Create a JSON response with:
1. title: A short descriptive title for this community (5-10 words)
2. summary: A 2-3 sentence executive summary of what this community represents
3. key_insights: 3-5 bullet points of key facts or relationships

Return ONLY valid JSON:
{{
  "title": "...",
  "summary": "...",
  "key_insights": ["...", "...", "..."]
}}

JSON OUTPUT:
"""

def generate_community_summary(community_id: int, members: list[str], G: nx.DiGraph, claims: list[dict]) -> CommunitySummary:
    """Generate a summary for a community using the LLM."""
    
    # Gather entity info
    entities_info = []
    for member in members:
        node_data = G.nodes[member]
        entities_info.append(f"- {member} ({node_data.get('type', 'N/A')}): {node_data.get('description', 'N/A')}")
    
    # Gather relationships within community
    relationships_info = []
    for source, target, data in G.edges(data=True):
        if source in members and target in members:
            relationships_info.append(f"- {source} -> {target}: {data.get('description', 'N/A')}")
    
    # Gather relevant claims
    claims_info = []
    for claim in claims:
        if claim["subject"] in members:
            claims_info.append(f"- [{claim['claim_type']}] {claim['subject']}: {claim['description']}")
    
    prompt = COMMUNITY_SUMMARY_PROMPT.format(
        entities_info="\n".join(entities_info) or "No entities",
        relationships_info="\n".join(relationships_info) or "No relationships",
        claims_info="\n".join(claims_info[:10]) or "No claims"  # Limit claims
    )
    
    response = chat_ollama(prompt)
    
    # Parse JSON
    json_str = response.strip()
    if json_str.startswith("```"):
        json_str = json_str.split("```")[1]
        if json_str.startswith("json"):
            json_str = json_str[4:]
    json_str = json_str.strip()
    
    try:
        data = json.loads(json_str)
        return CommunitySummary(
            community_id=community_id,
            title=data.get("title", f"Community {community_id}"),
            summary=data.get("summary", ""),
            key_entities=members[:5],  # Top 5 by PageRank
            key_insights=data.get("key_insights", [])
        )
    except json.JSONDecodeError as ex:
        print(f"Failed to parse JSON for community {community_id}: {ex}")
        print(f"Raw response: {response}")
        return CommunitySummary(
            community_id=community_id,
            title=f"Community {community_id}",
            summary="Summary generation failed",
            key_entities=members[:5],
            key_insights=[]
        )

In [13]:
# Generate summaries for each community
community_summaries: list[CommunitySummary] = []

for comm_id, members in sorted(communities.items()):
    print(f"Generating summary for Community {comm_id} ({len(members)} members)...")
    # Sort members by PageRank
    sorted_members = sorted(members, key=lambda x: -pagerank.get(x, 0))
    summary = generate_community_summary(comm_id, sorted_members, G, claims)
    community_summaries.append(summary)
    print(f"  Title: {summary.title}")

print(f"\nGenerated {len(community_summaries)} community summaries")

Generating summary for Community 0 (39 members)...
  Title: CRISPR-GPT Gene Editing Community
Generating summary for Community 1 (19 members)...
  Title: CRISPR Gene Editing Community
Generating summary for Community 2 (17 members)...
  Title: GraphRAG and Related Technologies Community
Generating summary for Community 3 (13 members)...
  Title: Community Knowledge Hub
Generating summary for Community 4 (11 members)...
  Title: Contributor Community Network
Generating summary for Community 5 (10 members)...
  Title: Knowledge Graph Extraction Community
Generating summary for Community 6 (10 members)...
  Title: Text Summarization and Stock Market Analysis Community
Generating summary for Community 7 (8 members)...
  Title: Graph Analysis Community Report
Generating summary for Community 8 (6 members)...
  Title: Knowledge Graph Community Analysis
Generating summary for Community 9 (6 members)...
  Title: Knowledge Graph Community Overview
Generating summary for Community 10 (6 members)

In [14]:
# Display community summaries
print("\n" + "="*60)
print("COMMUNITY SUMMARIES")
print("="*60)

for summary in community_summaries:
    print(f"\n### Community {summary.community_id}: {summary.title}")
    print(f"\n{summary.summary}")
    print(f"\nKey Entities: {', '.join(summary.key_entities)}")
    print(f"\nKey Insights:")
    for insight in summary.key_insights:
        print(f"  - {insight}")


COMMUNITY SUMMARIES

### Community 0: CRISPR-GPT Gene Editing Community

The CRISPR-GPT gene editing community is a comprehensive platform that integrates various tools and techniques for designing, validating, and executing CRISPR-based experiments. It includes AI-driven systems like CRISPR-GPT, which offers expertly defined pipelines for gene editing tasks, along with retrieval techniques and validation strategies.

Key Entities: VALIDATION STRATEGIES, PROTOCOL, CRISPR-GPT, META-TASKS, ANALYSIS

Key Insights:
  - CRISPR-GPT is an AI-assisted tool designed to streamline the process of designing and validating CRISPR-based experiments.
  - The community leverages various tools such as LLM-powered design and planning engines, retrieval techniques, and expertly defined pipelines for gene editing tasks.
  - CRISPR-GPT provides comprehensive support from initial protocol design to final validation steps, ensuring accuracy and effectiveness in experimental outcomes.
  - Users can access di

## Step 6: Store in SQLite

Persist the graph structure, metrics, sources, and summaries.

In [15]:
# Create SQLite database (delete and recreate for clean schema on re-runs)
# We delete the file instead of DROP TABLE because notebook 03 creates
# sqlite-vec virtual tables (vec0 module) that can't be dropped without
# loading the extension.
if DB_PATH.exists():
    DB_PATH.unlink()
    print(f"Deleted existing {DB_PATH}")

conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()

# Create tables with current schema
cursor.executescript("""
-- Sources table: tracks ingested documents
CREATE TABLE sources (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_id TEXT UNIQUE NOT NULL,
    source_type TEXT,
    title TEXT,
    url TEXT,
    content_type TEXT,
    content_length INTEGER,
    fetched_at TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Entities table
CREATE TABLE entities (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT UNIQUE NOT NULL,
    type TEXT,
    description TEXT,
    pagerank REAL DEFAULT 0,
    degree_centrality REAL DEFAULT 0,
    betweenness REAL DEFAULT 0,
    community_id INTEGER,
    source_refs TEXT,           -- JSON array of source_ids
    num_sources INTEGER DEFAULT 1,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Relationships table
CREATE TABLE relationships (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_id INTEGER REFERENCES entities(id),
    target_id INTEGER REFERENCES entities(id),
    description TEXT,
    weight REAL DEFAULT 1.0,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Claims table
CREATE TABLE claims (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    subject_id INTEGER REFERENCES entities(id),
    claim_type TEXT,
    description TEXT,
    claim_date TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Community summaries table
CREATE TABLE community_summaries (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    community_id INTEGER UNIQUE NOT NULL,
    title TEXT,
    summary TEXT,
    key_entities TEXT,  -- JSON array
    key_insights TEXT,  -- JSON array
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Chunks table (source text)
CREATE TABLE chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    content TEXT,
    chunk_index INTEGER,
    source_ref TEXT,            -- references sources.source_id
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);


-- Semantic entity groups (compound node overlay from cross-doc merge)
CREATE TABLE IF NOT EXISTS semantic_groups (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    group_id INTEGER UNIQUE NOT NULL,
    canonical TEXT NOT NULL,
    members TEXT NOT NULL,            -- JSON array of entity names
    member_similarities TEXT,         -- JSON object {name: score}
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Entity-to-chunk provenance (which chunks an entity was extracted from)
CREATE TABLE IF NOT EXISTS entity_chunk_map (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    entity_name TEXT NOT NULL,
    chunk_index INTEGER NOT NULL,
    source_id TEXT NOT NULL,
    FOREIGN KEY (entity_name) REFERENCES entities(name)
);

-- Create indexes
CREATE INDEX idx_entities_name ON entities(name);
CREATE INDEX idx_entities_community ON entities(community_id);
CREATE INDEX idx_relationships_source ON relationships(source_id);
CREATE INDEX idx_relationships_target ON relationships(target_id);
CREATE INDEX idx_chunks_source ON chunks(source_ref);
CREATE INDEX idx_sources_source_id ON sources(source_id);

CREATE INDEX idx_entity_chunk_map_entity ON entity_chunk_map(entity_name);
CREATE INDEX idx_semantic_groups_gid ON semantic_groups(group_id);

""")

conn.commit()
print("Database tables created (fresh schema with sources, semantic_groups, entity_chunk_map tables)")

Deleted existing graphrag.db
Database tables created (fresh schema with sources, semantic_groups, entity_chunk_map tables)


In [16]:
# Insert entities (with source provenance)
entity_id_map: dict[str, int] = {}

for node, attrs in G.nodes(data=True):
    cursor.execute("""
        INSERT INTO entities (name, type, description, pagerank, degree_centrality, betweenness, community_id, source_refs, num_sources)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        node,
        attrs.get("type"),
        attrs.get("description"),
        attrs.get("pagerank", 0),
        attrs.get("degree_centrality", 0),
        attrs.get("betweenness", 0),
        attrs.get("community"),
        attrs.get("source_refs", "[]"),
        attrs.get("num_sources", 1),
    ))
    entity_id_map[node] = cursor.lastrowid

conn.commit()
print(f"Inserted {len(entity_id_map)} entities")

Inserted 864 entities


In [17]:
# Insert relationships
rel_count = 0
for source, target, attrs in G.edges(data=True):
    source_id = entity_id_map.get(source)
    target_id = entity_id_map.get(target)
    if source_id and target_id:
        cursor.execute("""
            INSERT INTO relationships (source_id, target_id, description, weight)
            VALUES (?, ?, ?, ?)
        """, (
            source_id,
            target_id,
            attrs.get("description"),
            attrs.get("weight", 1.0)
        ))
        rel_count += 1

conn.commit()
print(f"Inserted {rel_count} relationships")

Inserted 302 relationships


In [18]:
# Insert claims
claim_count = 0
for claim in claims:
    subject_id = entity_id_map.get(claim["subject"])
    if subject_id:
        cursor.execute("""
            INSERT INTO claims (subject_id, claim_type, description, claim_date)
            VALUES (?, ?, ?, ?)
        """, (
            subject_id,
            claim.get("claim_type"),
            claim.get("description"),
            claim.get("date")
        ))
        claim_count += 1

conn.commit()
print(f"Inserted {claim_count} claims")

Inserted 5 claims


In [19]:
# Insert community summaries
for summary in community_summaries:
    cursor.execute("""
        INSERT INTO community_summaries (community_id, title, summary, key_entities, key_insights)
        VALUES (?, ?, ?, ?, ?)
    """, (
        summary.community_id,
        summary.title,
        summary.summary,
        json.dumps(summary.key_entities),
        json.dumps(summary.key_insights)
    ))

conn.commit()
print(f"Inserted {len(community_summaries)} community summaries")

Inserted 613 community summaries


In [20]:
# Insert chunks with source provenance
chunk_count = 0
for source in sources_data:
    source_id = source["source_id"]
    for i, chunk in enumerate(source["chunks"]):
        cursor.execute("""
            INSERT INTO chunks (content, chunk_index, source_ref)
            VALUES (?, ?, ?)
        """, (chunk, chunk_count, source_id))
        chunk_count += 1

conn.commit()
print(f"Inserted {chunk_count} chunks (with source_ref)")

Inserted 359 chunks (with source_ref)


In [21]:
# Insert source records
for source in sources_data:
    cursor.execute("""
        INSERT INTO sources (source_id, source_type, title, url, content_type, content_length, fetched_at)
        VALUES (?, ?, ?, ?, ?, ?, ?)
    """, (
        source["source_id"],
        source["source_type"],
        source["title"],
        source["url"],
        source["content_type"],
        source["content_length"],
        source.get("fetched_at", ""),
    ))

conn.commit()
print(f"Inserted {len(sources_data)} source records")

Inserted 2 source records


In [22]:
# Insert semantic entity groups
for group in semantic_entity_groups:
    cursor.execute(
        "INSERT INTO semantic_groups (group_id, canonical, members, member_similarities) VALUES (?, ?, ?, ?)",
        (group["group_id"], group["canonical"],
         json.dumps(group["members"]), json.dumps(group.get("member_similarities", {}))),
    )
conn.commit()
print(f"Inserted {len(semantic_entity_groups)} semantic entity groups")

Inserted 68 semantic entity groups


In [23]:
# Insert entity-chunk provenance map
ecm_count = 0
for entity_name, refs in entity_chunk_map.items():
    for ref in refs:
        cursor.execute(
            "INSERT INTO entity_chunk_map (entity_name, chunk_index, source_id) VALUES (?, ?, ?)",
            (entity_name, ref["chunk_index"], ref["source_id"]),
        )
        ecm_count += 1
conn.commit()
print(f"Inserted {ecm_count} entity-chunk provenance records")

Inserted 1166 entity-chunk provenance records


In [24]:
# Verify data
print("\n=== DATABASE SUMMARY ===")
for table in ["sources", "entities", "relationships", "claims", "community_summaries", "chunks", "semantic_groups", "entity_chunk_map"]:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"  {table}: {count} rows")

# Show multi-source entities
cursor.execute("SELECT name, source_refs, num_sources FROM entities WHERE num_sources > 1 ORDER BY num_sources DESC")
multi = cursor.fetchall()
if multi:
    print(f"\n=== MULTI-SOURCE ENTITIES ({len(multi)}) ===")
    for name, refs, n in multi:
        print(f"  {name}: {n} sources — {refs}")


=== DATABASE SUMMARY ===
  sources: 2 rows
  entities: 864 rows
  relationships: 302 rows
  claims: 5 rows
  community_summaries: 613 rows
  chunks: 359 rows
  semantic_groups: 68 rows
  entity_chunk_map: 1166 rows

=== MULTI-SOURCE ENTITIES (7) ===
  LARGE LANGUAGE MODELS: 2 sources — ["arxiv:2404.16130", "arxiv:2404.18021"]
  LLM: 2 sources — ["arxiv:2404.16130", "arxiv:2404.18021"]
  EXAMPLE CORP: 2 sources — ["arxiv:2404.16130", "arxiv:2404.18021"]
  GPT-4: 2 sources — ["arxiv:2404.16130", "arxiv:2404.18021"]
  APPENDIX B: 2 sources — ["arxiv:2404.16130", "arxiv:2404.18021"]
  USER: 2 sources — ["arxiv:2404.16130", "arxiv:2404.18021"]
  ARXIV: 2 sources — ["arxiv:2404.16130", "arxiv:2404.18021"]


In [25]:
# Sample query: Top entities by PageRank
print("\n=== TOP ENTITIES (from DB) ===")
cursor.execute("""
    SELECT name, type, pagerank, community_id 
    FROM entities 
    ORDER BY pagerank DESC 
    LIMIT 10
""")
for row in cursor.fetchall():
    print(f"  {row[2]:.4f} | [{row[1]}] {row[0]} (Community {row[3]})")


=== TOP ENTITIES (from DB) ===
  0.0093 | [PRODUCT] NEBULAGRAPH (Community 2)
  0.0087 | []  (Community 4)
  0.0073 | [PERSON] LIANG (Community 10)
  0.0070 | [PERSON] W.-T. (Community 10)
  0.0067 | [PRODUCT] SS (Community 6)
  0.0059 | [ORGANIZATION] LANGCHAIN (Community 2)
  0.0053 | [ORGANIZATION] LLAMAINDEX (Community 2)
  0.0053 | [CONCEPT] GLOBAL ANSWER (Community 3)
  0.0047 | [PRODUCT] TASK EXECUTOR (Community 11)
  0.0046 | [PRODUCT] TS (Community 6)


In [26]:
# Sample query: Get community with its entities
print("\n=== COMMUNITY 0 DETAILS (from DB) ===")
cursor.execute("""
    SELECT title, summary FROM community_summaries WHERE community_id = 0
""")
row = cursor.fetchone()
if row:
    print(f"Title: {row[0]}")
    print(f"Summary: {row[1]}")
    
    cursor.execute("""
        SELECT name, type FROM entities WHERE community_id = 0 ORDER BY pagerank DESC LIMIT 5
    """)
    print("\nTop Members:")
    for row in cursor.fetchall():
        print(f"  [{row[1]}] {row[0]}")


=== COMMUNITY 0 DETAILS (from DB) ===
Title: CRISPR-GPT Gene Editing Community
Summary: The CRISPR-GPT gene editing community is a comprehensive platform that integrates various tools and techniques for designing, validating, and executing CRISPR-based experiments. It includes AI-driven systems like CRISPR-GPT, which offers expertly defined pipelines for gene editing tasks, along with retrieval techniques and validation strategies.

Top Members:
  [PRODUCT] VALIDATION STRATEGIES
  [PRODUCT] PROTOCOL
  [PRODUCT] CRISPR-GPT
  [CONCEPT] META-TASKS
  [EVENT] ANALYSIS


In [27]:
# Close connection
conn.close()
print(f"\nDatabase saved to: {DB_PATH.absolute()}")


Database saved to: /Users/pacho-home-server/daily-knowledge-ingestion-assistant/notebooks/graphrag.db


## Step 7: Generate Visualization

The interactive knowledge graph visualization is generated by a standalone script
that reads from the SQLite database:

```bash
python scripts/generate_viz.py
```

This produces a self-contained HTML file with multi-level Cytoscape.js drill-down:
- **Level 0**: Community meta-nodes (~20-40), sized by member count
- **Level 1**: Click community to expand entity children with intra-community edges
- **Level 2**: Click entity to expand chunk nodes with source text tooltips

See `scripts/generate_viz.py` for details.

## Summary

This notebook completed:

1. **Graph Construction** - Built NetworkX graph from multi-source entities and relationships
2. **Graph Metrics** - Computed PageRank, degree centrality, and betweenness
3. **Community Detection** - Applied Leiden algorithm to find cross-domain topic clusters
4. **Community Summaries** - Generated LLM-powered summaries for each cluster
5. **SQLite Storage** - Persisted graph, sources, provenance, semantic groups, and entity-chunk map
6. **Visualization** - Run `python scripts/generate_viz.py` to generate interactive Cytoscape.js HTML

## Next Steps

In the next notebook we will:
1. **Add embeddings** - Embed entities and chunks using nomic-embed-text
2. **Vector search** - Set up sqlite-vec for semantic retrieval
3. **Triple-factor retrieval** - Combine semantic + temporal (content-type-aware) + graph centrality
4. **Cross-domain queries** - Test retrieval across all 7 source domains