# GraphRAG Step 2: Graph Construction & Community Detection

This notebook continues the GraphRAG pipeline:

## Pipeline Steps
1. **Load extraction results** - From notebook 01
2. **Build NetworkX graph** - Entities as nodes, relationships as edges
3. **Compute graph metrics** - PageRank, centrality, degree
4. **Community detection** - Louvain algorithm for topic clustering
5. **Generate community summaries** - LLM-powered hierarchical summaries
6. **Store in SQLite** - Persist graph structure and summaries

## Setup

In [1]:
import json
import sqlite3
from pathlib import Path
from dataclasses import dataclass, asdict

import httpx
import networkx as nx
import community as community_louvain  # python-louvain

OLLAMA_BASE_URL = "http://localhost:11434"
MODEL = "qwen2.5:3b"
DB_PATH = Path("graphrag.db")

In [2]:
def chat_ollama(prompt: str, system: str = "", temperature: float = 0.0) -> str:
    """Send a chat request to Ollama and return the response."""
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    
    response = httpx.post(
        f"{OLLAMA_BASE_URL}/api/chat",
        json={
            "model": MODEL,
            "messages": messages,
            "stream": False,
            "options": {"temperature": temperature}
        },
        timeout=120.0
    )
    response.raise_for_status()
    return response.json()["message"]["content"]

## Step 1: Load Extraction Results

In [3]:
# Load results from notebook 01
with open("extraction_results.json", "r") as f:
    data = json.load(f)

entities = data["entities"]
relationships = data["relationships"]
claims = data["claims"]
chunks = data["chunks"]

print(f"Loaded: {len(entities)} entities, {len(relationships)} relationships, {len(claims)} claims")

Loaded: 14 entities, 7 relationships, 5 claims


In [4]:
# Preview entities
print("=== ENTITIES ===")
for e in entities[:5]:
    print(f"  [{e['type']}] {e['name']}")
if len(entities) > 5:
    print(f"  ... and {len(entities) - 5} more")

=== ENTITIES ===
  [ORGANIZATION] OPENAI
  [ORGANIZATION] MICROSOFT
  [PERSON] SAM ALTMAN
  [PRODUCT] GPT-5
  [PERSON] SATYA NADELLA
  ... and 9 more


## Step 2: Build NetworkX Graph

In [5]:
# Create directed graph
G = nx.DiGraph()

# Add entity nodes with attributes
for entity in entities:
    G.add_node(
        entity["name"],
        type=entity["type"],
        description=entity["description"]
    )

# Add relationship edges with attributes
for rel in relationships:
    # Only add edge if both nodes exist
    if rel["source"] in G.nodes and rel["target"] in G.nodes:
        G.add_edge(
            rel["source"],
            rel["target"],
            description=rel["description"],
            weight=rel["strength"]
        )

print(f"Graph created: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

Graph created: 14 nodes, 6 edges


In [6]:
# Display graph structure
print("\n=== GRAPH NODES ===")
for node, attrs in list(G.nodes(data=True))[:10]:
    print(f"  {node} ({attrs.get('type', 'N/A')})")

print("\n=== GRAPH EDGES ===")
for source, target, attrs in list(G.edges(data=True))[:10]:
    print(f"  {source} --> {target}")
    print(f"    {attrs.get('description', 'N/A')[:60]}...")


=== GRAPH NODES ===
  OPENAI (ORGANIZATION)
  MICROSOFT (ORGANIZATION)
  SAM ALTMAN (PERSON)
  GPT-5 (PRODUCT)
  SATYA NADELLA (PERSON)
  GOLDMAN SACHS (ORGANIZATION)
  GOOGLE (ORGANIZATION)
  SUNDAR PICHAI (PERSON)
  FEDERAL TRADE COMMISSION (ORGANIZATION)
  FTC (ORGANIZATION)

=== GRAPH EDGES ===
  OPENAI --> GPT-5
    OpenAI develops GPT-5...
  MICROSOFT --> GPT-5
    Microsoft integrates GPT-5 into its product suite...
  MICROSOFT --> GOLDMAN SACHS
    Goldman Sachs raised their price target for Microsoft stock...
  MICROSOFT --> GOOGLE
    Microsoft's primary competitor in cloud services is Google...
  GOOGLE --> SUNDAR PICHAI
    Sundar Pichai is the CEO of Google...
  FEDERAL TRADE COMMISSION --> LINA KAHN
    FEDERAL TRADE COMMISSION appoints Lina Kahn as Chair...


## Step 3: Compute Graph Metrics

Calculate centrality scores for ranking node importance.

In [7]:
# Convert to undirected for some algorithms
G_undirected = G.to_undirected()

# PageRank - importance based on incoming connections
pagerank = nx.pagerank(G, weight="weight")

# Degree centrality - number of connections
degree_centrality = nx.degree_centrality(G)

# Betweenness centrality - bridges between clusters
betweenness = nx.betweenness_centrality(G_undirected)

# Store metrics on nodes
for node in G.nodes:
    G.nodes[node]["pagerank"] = pagerank.get(node, 0)
    G.nodes[node]["degree_centrality"] = degree_centrality.get(node, 0)
    G.nodes[node]["betweenness"] = betweenness.get(node, 0)

print("Graph metrics computed: pagerank, degree_centrality, betweenness")

Graph metrics computed: pagerank, degree_centrality, betweenness


In [8]:
# Top entities by PageRank
print("\n=== TOP ENTITIES BY PAGERANK ===")
sorted_by_pagerank = sorted(pagerank.items(), key=lambda x: -x[1])
for node, score in sorted_by_pagerank[:10]:
    node_type = G.nodes[node].get("type", "N/A")
    print(f"  {score:.4f} | [{node_type}] {node}")


=== TOP ENTITIES BY PAGERANK ===
  0.1215 | [PRODUCT] GPT-5
  0.1190 | [PERSON] SUNDAR PICHAI
  0.1048 | [PERSON] LINA KAHN
  0.0733 | [ORGANIZATION] GOOGLE
  0.0715 | [ORGANIZATION] GOLDMAN SACHS
  0.0567 | [ORGANIZATION] OPENAI
  0.0567 | [ORGANIZATION] MICROSOFT
  0.0567 | [PERSON] SAM ALTMAN
  0.0567 | [PERSON] SATYA NADELLA
  0.0567 | [ORGANIZATION] FEDERAL TRADE COMMISSION


## Step 4: Community Detection

Using Louvain algorithm to find clusters of related entities.

In [9]:
# Louvain community detection (works on undirected graphs)
partition = community_louvain.best_partition(G_undirected, weight="weight", resolution=1.0)

# Store community assignment on nodes
for node, community_id in partition.items():
    G.nodes[node]["community"] = community_id

# Count communities
num_communities = max(partition.values()) + 1
print(f"Detected {num_communities} communities")

# Modularity score (quality of partition)
modularity = community_louvain.modularity(partition, G_undirected, weight="weight")
print(f"Modularity score: {modularity:.4f}")

Detected 9 communities
Modularity score: 0.4037


In [10]:
# Group entities by community
communities: dict[int, list[str]] = {}
for node, community_id in partition.items():
    if community_id not in communities:
        communities[community_id] = []
    communities[community_id].append(node)

print("\n=== COMMUNITIES ===")
for comm_id, members in sorted(communities.items()):
    # Sort members by PageRank within community
    sorted_members = sorted(members, key=lambda x: -pagerank.get(x, 0))
    print(f"\nCommunity {comm_id} ({len(members)} members):")
    for member in sorted_members[:5]:
        node_type = G.nodes[member].get("type", "N/A")
        print(f"  [{node_type}] {member}")
    if len(members) > 5:
        print(f"  ... and {len(members) - 5} more")


=== COMMUNITIES ===

Community 0 (1 members):
  [ORGANIZATION] STARTUPXYZ

Community 1 (4 members):
  [PRODUCT] GPT-5
  [ORGANIZATION] GOLDMAN SACHS
  [ORGANIZATION] OPENAI
  [ORGANIZATION] MICROSOFT

Community 2 (1 members):
  [PERSON] SAM ALTMAN

Community 3 (1 members):
  [PERSON] SATYA NADELLA

Community 4 (2 members):
  [PERSON] SUNDAR PICHAI
  [ORGANIZATION] GOOGLE

Community 5 (2 members):
  [PERSON] LINA KAHN
  [ORGANIZATION] FEDERAL TRADE COMMISSION

Community 6 (1 members):
  [ORGANIZATION] FTC

Community 7 (1 members):
  [PERSON] TREASURY SECRETARY

Community 8 (1 members):
  [ORGANIZATION] EXAMPLE CORP


## Step 5: Generate Community Summaries

Create LLM-powered summaries for each community following GraphRAG's report format.

In [11]:
@dataclass
class CommunitySummary:
    community_id: int
    title: str
    summary: str
    key_entities: list[str]
    key_insights: list[str]

COMMUNITY_SUMMARY_PROMPT = """
You are an expert analyst creating a summary report for a knowledge graph community.

Given the following entities and their relationships, create a structured summary.

ENTITIES IN THIS COMMUNITY:
{entities_info}

RELATIONSHIPS:
{relationships_info}

RELEVANT CLAIMS:
{claims_info}

Create a JSON response with:
1. title: A short descriptive title for this community (5-10 words)
2. summary: A 2-3 sentence executive summary of what this community represents
3. key_insights: 3-5 bullet points of key facts or relationships

Return ONLY valid JSON:
{{
  "title": "...",
  "summary": "...",
  "key_insights": ["...", "...", "..."]
}}

JSON OUTPUT:
"""

def generate_community_summary(community_id: int, members: list[str], G: nx.DiGraph, claims: list[dict]) -> CommunitySummary:
    """Generate a summary for a community using the LLM."""
    
    # Gather entity info
    entities_info = []
    for member in members:
        node_data = G.nodes[member]
        entities_info.append(f"- {member} ({node_data.get('type', 'N/A')}): {node_data.get('description', 'N/A')}")
    
    # Gather relationships within community
    relationships_info = []
    for source, target, data in G.edges(data=True):
        if source in members and target in members:
            relationships_info.append(f"- {source} -> {target}: {data.get('description', 'N/A')}")
    
    # Gather relevant claims
    claims_info = []
    for claim in claims:
        if claim["subject"] in members:
            claims_info.append(f"- [{claim['claim_type']}] {claim['subject']}: {claim['description']}")
    
    prompt = COMMUNITY_SUMMARY_PROMPT.format(
        entities_info="\n".join(entities_info) or "No entities",
        relationships_info="\n".join(relationships_info) or "No relationships",
        claims_info="\n".join(claims_info[:10]) or "No claims"  # Limit claims
    )
    
    response = chat_ollama(prompt)
    
    # Parse JSON
    json_str = response.strip()
    if json_str.startswith("```"):
        json_str = json_str.split("```")[1]
        if json_str.startswith("json"):
            json_str = json_str[4:]
    json_str = json_str.strip()
    
    try:
        data = json.loads(json_str)
        return CommunitySummary(
            community_id=community_id,
            title=data.get("title", f"Community {community_id}"),
            summary=data.get("summary", ""),
            key_entities=members[:5],  # Top 5 by PageRank
            key_insights=data.get("key_insights", [])
        )
    except json.JSONDecodeError as ex:
        print(f"Failed to parse JSON for community {community_id}: {ex}")
        print(f"Raw response: {response}")
        return CommunitySummary(
            community_id=community_id,
            title=f"Community {community_id}",
            summary="Summary generation failed",
            key_entities=members[:5],
            key_insights=[]
        )

In [12]:
# Generate summaries for each community
community_summaries: list[CommunitySummary] = []

for comm_id, members in sorted(communities.items()):
    print(f"Generating summary for Community {comm_id} ({len(members)} members)...")
    # Sort members by PageRank
    sorted_members = sorted(members, key=lambda x: -pagerank.get(x, 0))
    summary = generate_community_summary(comm_id, sorted_members, G, claims)
    community_summaries.append(summary)
    print(f"  Title: {summary.title}")

print(f"\nGenerated {len(community_summaries)} community summaries")

Generating summary for Community 0 (1 members)...
  Title: StartupXYZ Community
Generating summary for Community 1 (4 members)...
  Title: AI and Tech Community Insights
Generating summary for Community 2 (1 members)...
  Title: OpenAI Leadership Community
Generating summary for Community 3 (1 members)...
  Title: Microsoft Leadership Community
Generating summary for Community 4 (2 members)...
  Title: Google Cloud Services Community
Generating summary for Community 5 (2 members)...
  Title: FTC and AI Regulation Community
Generating summary for Community 6 (1 members)...
  Title: Microsoft-OpenAI Regulatory Community
Generating summary for Community 7 (1 members)...
  Title: Treasury Secretary Community
Generating summary for Community 8 (1 members)...
  Title: Technology Company Community

Generated 9 community summaries


In [13]:
# Display community summaries
print("\n" + "="*60)
print("COMMUNITY SUMMARIES")
print("="*60)

for summary in community_summaries:
    print(f"\n### Community {summary.community_id}: {summary.title}")
    print(f"\n{summary.summary}")
    print(f"\nKey Entities: {', '.join(summary.key_entities)}")
    print(f"\nKey Insights:")
    for insight in summary.key_insights:
        print(f"  - {insight}")


COMMUNITY SUMMARIES

### Community 0: StartupXYZ Community

The StartupXYZ community is a platform dedicated to startups, providing support and resources for entrepreneurs. It offers networking opportunities, educational content, and access to funding and mentorship.

Key Entities: STARTUPXYZ

Key Insights:
  - No direct relationships or claims are provided for this community.
  - Focuses on supporting startup ventures through various services.
  - Aims to foster a supportive environment for entrepreneurial growth.
  - Offers resources that can help startups succeed in their journey.
  - Potential for collaboration and partnership opportunities within the community.

### Community 1: AI and Tech Community Insights

This community focuses on the integration of AI technologies like GPT-5 by Microsoft, its impact on stock prices, and insights from financial analysts such as Goldman Sachs.

Key Entities: GPT-5, GOLDMAN SACHS, OPENAI, MICROSOFT

Key Insights:
  - GPT-5 is developed by Open

## Step 6: Store in SQLite

Persist the graph structure, metrics, and summaries.

In [14]:
# Create SQLite database
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()

# Create tables
cursor.executescript("""
-- Entities table
CREATE TABLE IF NOT EXISTS entities (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT UNIQUE NOT NULL,
    type TEXT,
    description TEXT,
    pagerank REAL DEFAULT 0,
    degree_centrality REAL DEFAULT 0,
    betweenness REAL DEFAULT 0,
    community_id INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Relationships table
CREATE TABLE IF NOT EXISTS relationships (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_id INTEGER REFERENCES entities(id),
    target_id INTEGER REFERENCES entities(id),
    description TEXT,
    weight REAL DEFAULT 1.0,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Claims table
CREATE TABLE IF NOT EXISTS claims (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    subject_id INTEGER REFERENCES entities(id),
    claim_type TEXT,
    description TEXT,
    claim_date TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Community summaries table
CREATE TABLE IF NOT EXISTS community_summaries (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    community_id INTEGER UNIQUE NOT NULL,
    title TEXT,
    summary TEXT,
    key_entities TEXT,  -- JSON array
    key_insights TEXT,  -- JSON array
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Chunks table (source text)
CREATE TABLE IF NOT EXISTS chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    content TEXT,
    chunk_index INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create indexes
CREATE INDEX IF NOT EXISTS idx_entities_name ON entities(name);
CREATE INDEX IF NOT EXISTS idx_entities_community ON entities(community_id);
CREATE INDEX IF NOT EXISTS idx_relationships_source ON relationships(source_id);
CREATE INDEX IF NOT EXISTS idx_relationships_target ON relationships(target_id);
""")

conn.commit()
print("Database tables created")

Database tables created


In [15]:
# Clear existing data (for re-runs)
cursor.executescript("""
DELETE FROM relationships;
DELETE FROM claims;
DELETE FROM community_summaries;
DELETE FROM chunks;
DELETE FROM entities;
""")
conn.commit()
print("Cleared existing data")

Cleared existing data


In [16]:
# Insert entities
entity_id_map: dict[str, int] = {}

for node, attrs in G.nodes(data=True):
    cursor.execute("""
        INSERT INTO entities (name, type, description, pagerank, degree_centrality, betweenness, community_id)
        VALUES (?, ?, ?, ?, ?, ?, ?)
    """, (
        node,
        attrs.get("type"),
        attrs.get("description"),
        attrs.get("pagerank", 0),
        attrs.get("degree_centrality", 0),
        attrs.get("betweenness", 0),
        attrs.get("community")
    ))
    entity_id_map[node] = cursor.lastrowid

conn.commit()
print(f"Inserted {len(entity_id_map)} entities")

Inserted 14 entities


In [17]:
# Insert relationships
rel_count = 0
for source, target, attrs in G.edges(data=True):
    source_id = entity_id_map.get(source)
    target_id = entity_id_map.get(target)
    if source_id and target_id:
        cursor.execute("""
            INSERT INTO relationships (source_id, target_id, description, weight)
            VALUES (?, ?, ?, ?)
        """, (
            source_id,
            target_id,
            attrs.get("description"),
            attrs.get("weight", 1.0)
        ))
        rel_count += 1

conn.commit()
print(f"Inserted {rel_count} relationships")

Inserted 6 relationships


In [18]:
# Insert claims
claim_count = 0
for claim in claims:
    subject_id = entity_id_map.get(claim["subject"])
    if subject_id:
        cursor.execute("""
            INSERT INTO claims (subject_id, claim_type, description, claim_date)
            VALUES (?, ?, ?, ?)
        """, (
            subject_id,
            claim.get("claim_type"),
            claim.get("description"),
            claim.get("date")
        ))
        claim_count += 1

conn.commit()
print(f"Inserted {claim_count} claims")

Inserted 3 claims


In [19]:
# Insert community summaries
for summary in community_summaries:
    cursor.execute("""
        INSERT INTO community_summaries (community_id, title, summary, key_entities, key_insights)
        VALUES (?, ?, ?, ?, ?)
    """, (
        summary.community_id,
        summary.title,
        summary.summary,
        json.dumps(summary.key_entities),
        json.dumps(summary.key_insights)
    ))

conn.commit()
print(f"Inserted {len(community_summaries)} community summaries")

Inserted 9 community summaries


In [20]:
# Insert chunks
for i, chunk in enumerate(chunks):
    cursor.execute("""
        INSERT INTO chunks (content, chunk_index)
        VALUES (?, ?)
    """, (chunk, i))

conn.commit()
print(f"Inserted {len(chunks)} chunks")

Inserted 5 chunks


In [21]:
# Verify data
print("\n=== DATABASE SUMMARY ===")
for table in ["entities", "relationships", "claims", "community_summaries", "chunks"]:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"  {table}: {count} rows")


=== DATABASE SUMMARY ===
  entities: 14 rows
  relationships: 6 rows
  claims: 3 rows
  community_summaries: 9 rows
  chunks: 5 rows


In [22]:
# Sample query: Top entities by PageRank
print("\n=== TOP ENTITIES (from DB) ===")
cursor.execute("""
    SELECT name, type, pagerank, community_id 
    FROM entities 
    ORDER BY pagerank DESC 
    LIMIT 10
""")
for row in cursor.fetchall():
    print(f"  {row[2]:.4f} | [{row[1]}] {row[0]} (Community {row[3]})")


=== TOP ENTITIES (from DB) ===
  0.1215 | [PRODUCT] GPT-5 (Community 1)
  0.1190 | [PERSON] SUNDAR PICHAI (Community 4)
  0.1048 | [PERSON] LINA KAHN (Community 5)
  0.0733 | [ORGANIZATION] GOOGLE (Community 4)
  0.0715 | [ORGANIZATION] GOLDMAN SACHS (Community 1)
  0.0567 | [ORGANIZATION] OPENAI (Community 1)
  0.0567 | [ORGANIZATION] MICROSOFT (Community 1)
  0.0567 | [PERSON] SAM ALTMAN (Community 2)
  0.0567 | [PERSON] SATYA NADELLA (Community 3)
  0.0567 | [ORGANIZATION] FEDERAL TRADE COMMISSION (Community 5)


In [23]:
# Sample query: Get community with its entities
print("\n=== COMMUNITY 0 DETAILS (from DB) ===")
cursor.execute("""
    SELECT title, summary FROM community_summaries WHERE community_id = 0
""")
row = cursor.fetchone()
if row:
    print(f"Title: {row[0]}")
    print(f"Summary: {row[1]}")
    
    cursor.execute("""
        SELECT name, type FROM entities WHERE community_id = 0 ORDER BY pagerank DESC LIMIT 5
    """)
    print("\nTop Members:")
    for row in cursor.fetchall():
        print(f"  [{row[1]}] {row[0]}")


=== COMMUNITY 0 DETAILS (from DB) ===
Title: StartupXYZ Community
Summary: The StartupXYZ community is a platform dedicated to startups, providing support and resources for entrepreneurs. It offers networking opportunities, educational content, and access to funding and mentorship.

Top Members:
  [ORGANIZATION] STARTUPXYZ


In [24]:
# Close connection
conn.close()
print(f"\nDatabase saved to: {DB_PATH.absolute()}")


Database saved to: /Users/pacho-home-server/daily-knowledge-ingestion-assistant/notebooks/graphrag.db


## Summary

This notebook completed:

1. **Graph Construction** - Built NetworkX graph from entities and relationships
2. **Graph Metrics** - Computed PageRank, degree centrality, and betweenness
3. **Community Detection** - Applied Louvain algorithm to find topic clusters
4. **Community Summaries** - Generated LLM-powered summaries for each cluster
5. **SQLite Storage** - Persisted everything for retrieval

## Next Steps

In the next notebook we will:
1. **Add embeddings** - Embed entities and chunks using nomic-embed-text
2. **Vector search** - Set up sqlite-vec for semantic retrieval
3. **Triple-factor retrieval** - Combine semantic + temporal + graph centrality
4. **Query interface** - Build the Navigator chat interface