# GraphRAG Step 3: Embeddings & Vector Search

This notebook adds semantic search capabilities:

## Pipeline Steps
1. **Pull embedding model** - nomic-embed-text via Ollama
2. **Generate embeddings** - Embed entities, claims, and chunks
3. **Set up sqlite-vec** - Vector storage and similarity search
4. **Triple-factor retrieval** - Combine semantic + temporal + graph centrality
5. **Query interface** - Test retrieval with sample queries

## Setup

In [1]:
import json
import sqlite3
import struct
from pathlib import Path
from datetime import datetime, timedelta
from dataclasses import dataclass

import httpx
import sqlite_vec

OLLAMA_BASE_URL = "http://localhost:11434"
CHAT_MODEL = "qwen2.5:3b"
EMBED_MODEL = "nomic-embed-text"  # 768 dimensions
EMBED_DIM = 768
DB_PATH = Path("graphrag.db")

## Step 1: Pull Embedding Model

We need `nomic-embed-text` for generating embeddings. Run this cell to pull it.

In [2]:
# Check if embedding model is available
response = httpx.get(f"{OLLAMA_BASE_URL}/api/tags")
models = [m["name"] for m in response.json().get("models", [])]
print(f"Available models: {models}")

if EMBED_MODEL not in models and f"{EMBED_MODEL}:latest" not in models:
    print(f"\n⚠️  Embedding model '{EMBED_MODEL}' not found.")
    print(f"Run this in terminal: ollama pull {EMBED_MODEL}")
    print("Then re-run this cell.")
else:
    print(f"\n✓ Embedding model '{EMBED_MODEL}' is available")

Available models: ['nomic-embed-text:latest', 'qwen2.5:3b']

✓ Embedding model 'nomic-embed-text' is available


In [3]:
def get_embedding(text: str) -> list[float]:
    """Get embedding vector from Ollama."""
    response = httpx.post(
        f"{OLLAMA_BASE_URL}/api/embed",
        json={"model": EMBED_MODEL, "input": text},
        timeout=60.0
    )
    response.raise_for_status()
    # Ollama returns {"embeddings": [[...]]}
    embeddings = response.json().get("embeddings", [[]])
    return embeddings[0] if embeddings else []

# Test embedding
test_embedding = get_embedding("Hello world")
print(f"Embedding dimension: {len(test_embedding)}")
print(f"First 5 values: {test_embedding[:5]}")

Embedding dimension: 768
First 5 values: [-0.006819655, 0.031062, -0.15550135, 0.03674758, 0.02267155]


In [4]:
def chat_ollama(prompt: str, system: str = "", temperature: float = 0.0) -> str:
    """Send a chat request to Ollama."""
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    
    response = httpx.post(
        f"{OLLAMA_BASE_URL}/api/chat",
        json={
            "model": CHAT_MODEL,
            "messages": messages,
            "stream": False,
            "options": {"temperature": temperature}
        },
        timeout=120.0
    )
    response.raise_for_status()
    return response.json()["message"]["content"]

## Step 2: Set Up sqlite-vec

Add vector tables to our existing database.

In [5]:
# Connect to database and load sqlite-vec extension
conn = sqlite3.connect(DB_PATH)
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)

cursor = conn.cursor()

# Verify sqlite-vec is loaded
cursor.execute("SELECT vec_version()")
print(f"sqlite-vec version: {cursor.fetchone()[0]}")

sqlite-vec version: v0.1.6


In [6]:
# Create vector tables
cursor.executescript(f"""
-- Entity embeddings (name + description)
CREATE VIRTUAL TABLE IF NOT EXISTS entity_embeddings USING vec0(
    entity_id INTEGER PRIMARY KEY,
    embedding FLOAT[{EMBED_DIM}]
);

-- Chunk embeddings (source text)
CREATE VIRTUAL TABLE IF NOT EXISTS chunk_embeddings USING vec0(
    chunk_id INTEGER PRIMARY KEY,
    embedding FLOAT[{EMBED_DIM}]
);

-- Claim embeddings
CREATE VIRTUAL TABLE IF NOT EXISTS claim_embeddings USING vec0(
    claim_id INTEGER PRIMARY KEY,
    embedding FLOAT[{EMBED_DIM}]
);
""")
conn.commit()
print("Vector tables created")

Vector tables created


## Step 3: Generate Embeddings

Embed all entities, chunks, and claims.

In [7]:
def serialize_embedding(embedding: list[float]) -> bytes:
    """Serialize embedding to bytes for sqlite-vec."""
    return struct.pack(f"{len(embedding)}f", *embedding)

In [8]:
# Clear existing embeddings (for re-runs)
cursor.execute("DELETE FROM entity_embeddings")
cursor.execute("DELETE FROM chunk_embeddings")
cursor.execute("DELETE FROM claim_embeddings")
conn.commit()
print("Cleared existing embeddings")

Cleared existing embeddings


In [9]:
# Embed entities (name + type + description)
cursor.execute("SELECT id, name, type, description FROM entities")
entities = cursor.fetchall()

print(f"Embedding {len(entities)} entities...")
for i, (entity_id, name, etype, description) in enumerate(entities):
    # Combine name, type, and description for richer embedding
    text = f"{name} ({etype}): {description or 'No description'}"
    embedding = get_embedding(text)
    
    cursor.execute(
        "INSERT INTO entity_embeddings (entity_id, embedding) VALUES (?, ?)",
        (entity_id, serialize_embedding(embedding))
    )
    
    if (i + 1) % 5 == 0:
        print(f"  Processed {i + 1}/{len(entities)}")

conn.commit()
print(f"Embedded {len(entities)} entities")

Embedding 14 entities...
  Processed 5/14
  Processed 10/14
Embedded 14 entities


In [10]:
# Embed chunks (source text)
cursor.execute("SELECT id, content FROM chunks")
chunks = cursor.fetchall()

print(f"Embedding {len(chunks)} chunks...")
for i, (chunk_id, content) in enumerate(chunks):
    # Truncate if too long (embedding models have limits)
    text = content[:8000] if len(content) > 8000 else content
    embedding = get_embedding(text)
    
    cursor.execute(
        "INSERT INTO chunk_embeddings (chunk_id, embedding) VALUES (?, ?)",
        (chunk_id, serialize_embedding(embedding))
    )
    
    print(f"  Processed chunk {i + 1}/{len(chunks)}")

conn.commit()
print(f"Embedded {len(chunks)} chunks")

Embedding 5 chunks...
  Processed chunk 1/5
  Processed chunk 2/5
  Processed chunk 3/5
  Processed chunk 4/5
  Processed chunk 5/5
Embedded 5 chunks


In [11]:
# Embed claims
cursor.execute("SELECT id, claim_type, description FROM claims")
claims = cursor.fetchall()

print(f"Embedding {len(claims)} claims...")
for i, (claim_id, claim_type, description) in enumerate(claims):
    text = f"[{claim_type}] {description}"
    embedding = get_embedding(text)
    
    cursor.execute(
        "INSERT INTO claim_embeddings (claim_id, embedding) VALUES (?, ?)",
        (claim_id, serialize_embedding(embedding))
    )

conn.commit()
print(f"Embedded {len(claims)} claims")

Embedding 3 claims...
Embedded 3 claims


In [12]:
# Verify embeddings
print("\n=== EMBEDDING COUNTS ===")
for table in ["entity_embeddings", "chunk_embeddings", "claim_embeddings"]:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"  {table}: {count}")


=== EMBEDDING COUNTS ===
  entity_embeddings: 14
  chunk_embeddings: 5
  claim_embeddings: 3


## Step 4: Test Vector Search

Basic semantic similarity search.

In [13]:
def search_entities(query: str, top_k: int = 5) -> list[tuple]:
    """Search entities by semantic similarity."""
    query_embedding = get_embedding(query)
    
    cursor.execute("""
        SELECT 
            e.id,
            e.name,
            e.type,
            e.description,
            e.pagerank,
            vec_distance_cosine(ee.embedding, ?) as distance
        FROM entity_embeddings ee
        JOIN entities e ON e.id = ee.entity_id
        ORDER BY distance ASC
        LIMIT ?
    """, (serialize_embedding(query_embedding), top_k))
    
    return cursor.fetchall()

In [14]:
# Test entity search
query = "artificial intelligence companies"
print(f"Query: '{query}'\n")

results = search_entities(query)
print("=== TOP MATCHING ENTITIES ===")
for entity_id, name, etype, desc, pagerank, distance in results:
    similarity = 1 - distance  # Convert distance to similarity
    print(f"\n[{etype}] {name}")
    print(f"  Similarity: {similarity:.4f} | PageRank: {pagerank:.4f}")
    print(f"  {desc[:80]}..." if desc and len(desc) > 80 else f"  {desc}")

Query: 'artificial intelligence companies'

=== TOP MATCHING ENTITIES ===

[ORGANIZATION] OPENAI
  Similarity: 0.7021 | PageRank: 0.0567
  Artificial intelligence research company | Company

[ORGANIZATION] EXAMPLE CORP
  Similarity: 0.6101 | PageRank: 0.0567
  Technology company

[ORGANIZATION] MICROSOFT
  Similarity: 0.5469 | PageRank: 0.0567
  Technology company | Company whose shares increased after announcement

[ORGANIZATION] FEDERAL TRADE COMMISSION
  Similarity: 0.5176 | PageRank: 0.0567
  Regulatory body investigating Microsoft-OpenAI relationship | Regulatory body mo...

[ORGANIZATION] GOLDMAN SACHS
  Similarity: 0.5070 | PageRank: 0.0715
  Financial institution providing price target for Microsoft stock


In [15]:
# Test with different queries
test_queries = [
    "CEO executives leadership",
    "stock market financial",
    "government regulation antitrust"
]

for query in test_queries:
    print(f"\n{'='*50}")
    print(f"Query: '{query}'")
    results = search_entities(query, top_k=3)
    for entity_id, name, etype, desc, pagerank, distance in results:
        similarity = 1 - distance
        print(f"  {similarity:.3f} | [{etype}] {name}")


Query: 'CEO executives leadership'
  0.627 | [ORGANIZATION] EXAMPLE CORP
  0.576 | [ORGANIZATION] MICROSOFT
  0.564 | [ORGANIZATION] OPENAI

Query: 'stock market financial'
  0.557 | [ORGANIZATION] GOLDMAN SACHS
  0.509 | [ORGANIZATION] MICROSOFT
  0.475 | [ORGANIZATION] EXAMPLE CORP

Query: 'government regulation antitrust'
  0.510 | [ORGANIZATION] GOLDMAN SACHS
  0.440 | [ORGANIZATION] OPENAI
  0.432 | [ORGANIZATION] MICROSOFT


## Step 5: Triple-Factor Retrieval

Combine semantic similarity + temporal decay + graph centrality.

**Formula:**
```
final_score = 0.6 * semantic_similarity + 0.2 * temporal_score + 0.2 * graph_centrality
```

In [16]:
@dataclass
class RetrievalResult:
    entity_id: int
    name: str
    entity_type: str
    description: str
    semantic_score: float
    temporal_score: float
    graph_score: float
    final_score: float
    community_id: int = None

In [17]:
def temporal_decay(age_days: float, half_life: float = 7.0) -> float:
    """Calculate temporal decay score using half-life formula.
    
    score = 0.5 ^ (age_days / half_life)
    
    Args:
        age_days: Age of content in days
        half_life: Days until relevance halves (default 7 for news)
    
    Returns:
        Score between 0 and 1 (1 = fresh, 0 = very old)
    """
    return 0.5 ** (age_days / half_life)

# Test temporal decay
print("Temporal decay examples (half_life=7 days):")
for days in [0, 1, 3, 7, 14, 30]:
    score = temporal_decay(days, half_life=7)
    print(f"  {days:2d} days old: {score:.4f}")

Temporal decay examples (half_life=7 days):
   0 days old: 1.0000
   1 days old: 0.9057
   3 days old: 0.7430
   7 days old: 0.5000
  14 days old: 0.2500
  30 days old: 0.0513


In [18]:
def triple_factor_search(
    query: str,
    top_k: int = 10,
    semantic_weight: float = 0.6,
    temporal_weight: float = 0.2,
    graph_weight: float = 0.2,
    content_age_days: float = 0.0,  # Assume fresh content for demo
    half_life: float = 7.0
) -> list[RetrievalResult]:
    """Triple-factor retrieval combining semantic, temporal, and graph signals."""
    
    # Get query embedding
    query_embedding = get_embedding(query)
    
    # Fetch all entities with their embeddings and metrics
    cursor.execute("""
        SELECT 
            e.id,
            e.name,
            e.type,
            e.description,
            e.pagerank,
            e.community_id,
            vec_distance_cosine(ee.embedding, ?) as distance
        FROM entity_embeddings ee
        JOIN entities e ON e.id = ee.entity_id
        ORDER BY distance ASC
        LIMIT ?
    """, (serialize_embedding(query_embedding), top_k * 2))  # Fetch more for re-ranking
    
    rows = cursor.fetchall()
    
    # Calculate triple-factor scores
    results = []
    
    # Get max pagerank for normalization
    max_pagerank = max(row[4] for row in rows) if rows else 1.0
    
    for entity_id, name, etype, desc, pagerank, community_id, distance in rows:
        # Semantic similarity (convert distance to similarity)
        semantic_score = 1.0 - distance
        
        # Temporal score (based on content age)
        temporal_score = temporal_decay(content_age_days, half_life)
        
        # Graph centrality score (normalized pagerank)
        graph_score = pagerank / max_pagerank if max_pagerank > 0 else 0
        
        # Combined score
        final_score = (
            semantic_weight * semantic_score +
            temporal_weight * temporal_score +
            graph_weight * graph_score
        )
        
        results.append(RetrievalResult(
            entity_id=entity_id,
            name=name,
            entity_type=etype,
            description=desc,
            semantic_score=semantic_score,
            temporal_score=temporal_score,
            graph_score=graph_score,
            final_score=final_score,
            community_id=community_id
        ))
    
    # Re-rank by final score
    results.sort(key=lambda x: -x.final_score)
    
    return results[:top_k]

In [19]:
# Test triple-factor search
query = "technology partnerships and deals"
print(f"Query: '{query}'\n")

results = triple_factor_search(query, top_k=5)

print("=== TRIPLE-FACTOR RETRIEVAL RESULTS ===")
print(f"{'Rank':<5} {'Entity':<25} {'Final':<8} {'Semantic':<10} {'Temporal':<10} {'Graph':<8}")
print("-" * 75)

for i, r in enumerate(results, 1):
    print(f"{i:<5} {r.name[:24]:<25} {r.final_score:.4f}   {r.semantic_score:.4f}     {r.temporal_score:.4f}     {r.graph_score:.4f}")

Query: 'technology partnerships and deals'

=== TRIPLE-FACTOR RETRIEVAL RESULTS ===
Rank  Entity                    Final    Semantic   Temporal   Graph   
---------------------------------------------------------------------------
1     SUNDAR PICHAI             0.6461   0.4101     1.0000     1.0000
2     FEDERAL TRADE COMMISSION  0.6296   0.5573     1.0000     0.4762
3     GOLDMAN SACHS             0.6186   0.4974     1.0000     0.6007
4     GOOGLE                    0.6155   0.4870     1.0000     0.6163
5     MICROSOFT                 0.5993   0.5068     1.0000     0.4762


In [20]:
# Compare with and without graph centrality boost
query = "AI regulation government"
print(f"Query: '{query}'\n")

print("=== SEMANTIC ONLY (100% semantic) ===")
results_semantic = triple_factor_search(query, semantic_weight=1.0, temporal_weight=0.0, graph_weight=0.0)
for i, r in enumerate(results_semantic[:5], 1):
    print(f"  {i}. [{r.entity_type}] {r.name} (score: {r.final_score:.4f})")

print("\n=== TRIPLE-FACTOR (60/20/20) ===")
results_triple = triple_factor_search(query, semantic_weight=0.6, temporal_weight=0.2, graph_weight=0.2)
for i, r in enumerate(results_triple[:5], 1):
    print(f"  {i}. [{r.entity_type}] {r.name} (score: {r.final_score:.4f}, graph: {r.graph_score:.4f})")

Query: 'AI regulation government'

=== SEMANTIC ONLY (100% semantic) ===
  1. [ORGANIZATION] EXAMPLE CORP (score: 0.5455)
  2. [ORGANIZATION] STARTUPXYZ (score: 0.5384)
  3. [PERSON] TREASURY SECRETARY (score: 0.5274)
  4. [PERSON] SAM ALTMAN (score: 0.5221)
  5. [PERSON] SATYA NADELLA (score: 0.5221)

=== TRIPLE-FACTOR (60/20/20) ===
  1. [PERSON] SUNDAR PICHAI (score: 0.7092, graph: 0.9794)
  2. [PRODUCT] GPT-5 (score: 0.6879, graph: 1.0000)
  3. [PERSON] LINA KAHN (score: 0.6848, graph: 0.8628)
  4. [ORGANIZATION] GOLDMAN SACHS (score: 0.6273, graph: 0.5883)
  5. [ORGANIZATION] EXAMPLE CORP (score: 0.6206, graph: 0.4664)


## Step 6: Query Interface

Build a simple RAG query function that retrieves context and generates an answer.

In [21]:
def get_community_context(community_id: int) -> str:
    """Get community summary for additional context."""
    cursor.execute("""
        SELECT title, summary, key_insights 
        FROM community_summaries 
        WHERE community_id = ?
    """, (community_id,))
    row = cursor.fetchone()
    if row:
        title, summary, insights_json = row
        insights = json.loads(insights_json) if insights_json else []
        return f"Topic: {title}\nSummary: {summary}\nKey insights: {'; '.join(insights)}"
    return ""

In [22]:
def get_related_claims(entity_ids: list[int]) -> list[str]:
    """Get claims related to the retrieved entities."""
    if not entity_ids:
        return []
    
    placeholders = ",".join("?" for _ in entity_ids)
    cursor.execute(f"""
        SELECT c.claim_type, c.description, e.name
        FROM claims c
        JOIN entities e ON e.id = c.subject_id
        WHERE c.subject_id IN ({placeholders})
    """, entity_ids)
    
    claims = []
    for claim_type, description, entity_name in cursor.fetchall():
        claims.append(f"[{claim_type}] {entity_name}: {description}")
    return claims

In [23]:
def navigator_query(question: str, top_k: int = 5) -> str:
    """The Navigator: Answer questions using GraphRAG retrieval."""
    
    # 1. Retrieve relevant entities using triple-factor search
    results = triple_factor_search(question, top_k=top_k)
    
    if not results:
        return "I don't have enough information to answer that question."
    
    # 2. Build context from retrieved entities
    entity_context = []
    for r in results:
        entity_context.append(f"- {r.name} ({r.entity_type}): {r.description}")
    
    # 3. Get community context for the top result
    community_context = ""
    if results[0].community_id is not None:
        community_context = get_community_context(results[0].community_id)
    
    # 4. Get related claims
    entity_ids = [r.entity_id for r in results]
    claims = get_related_claims(entity_ids)
    claims_context = "\n".join(claims[:5]) if claims else "No specific claims."
    
    # 5. Build prompt
    prompt = f"""You are a knowledgeable assistant. Answer the question based on the following context.

RELEVANT ENTITIES:
{chr(10).join(entity_context)}

TOPIC CONTEXT:
{community_context or 'No additional topic context.'}

SPECIFIC FACTS:
{claims_context}

QUESTION: {question}

Provide a clear, concise answer based on the context above. If the context doesn't contain enough information, say so.

ANSWER:"""
    
    # 6. Generate answer
    answer = chat_ollama(prompt, temperature=0.3)
    
    return answer

In [24]:
# Test the Navigator
question = "What is the partnership between OpenAI and Microsoft?"
print(f"Question: {question}\n")
print("=" * 50)
answer = navigator_query(question)
print(f"\nAnswer:\n{answer}")

Question: What is the partnership between OpenAI and Microsoft?


Answer:
Based on the provided context, there is no specific mention or information about any partnership between OpenAI and Microsoft. The context focuses on Google's cloud services, their Gemini Ultra model announcement, and regulatory scrutiny of AI capabilities among tech giants like Microsoft and Google. Therefore, I cannot provide details about a partnership between OpenAI and Microsoft from this given context.


In [25]:
# Test with more questions
questions = [
    "Who are the key people involved in AI companies?",
    "What regulatory concerns exist around AI?",
    "How did the stock market react to the AI news?"
]

for q in questions:
    print(f"\n{'='*60}")
    print(f"Q: {q}")
    print("-" * 60)
    answer = navigator_query(q)
    print(f"A: {answer}")


Q: Who are the key people involved in AI companies?
------------------------------------------------------------
A: Based on the provided context, Lina Kahn is identified as a key person involved in AI companies through her role as Chair of the Federal Trade Commission (FTC). The context does not provide specific details about other individuals or entities directly involved in AI companies. However, it mentions that regulators like Lina Kahn are closely monitoring the concentration of AI capabilities among a small number of tech giants, which includes Microsoft and OpenAI.

Q: What regulatory concerns exist around AI?
------------------------------------------------------------
A: Based on the provided context, there are no specific regulatory concerns related to AI mentioned. The context focuses on Google's announcement regarding its Gemini Ultra model and the competitive landscape of cloud services. Therefore, I cannot provide a clear, concise answer about regulatory concerns around

In [26]:
# Close database connection
conn.close()
print("\nDatabase connection closed.")


Database connection closed.


## Summary

This notebook completed:

1. **Embeddings** - Generated 768-dim vectors for entities, chunks, and claims
2. **sqlite-vec** - Set up vector storage with cosine similarity search
3. **Triple-factor retrieval** - Combined semantic (60%) + temporal (20%) + graph (20%)
4. **Navigator query** - Built RAG pipeline for question answering

## GraphRAG Pipeline Complete!

You now have a working GraphRAG system with:
- Entity/relationship/claim extraction
- Knowledge graph with community detection
- Semantic search with graph-aware ranking
- Conversational query interface

## Next Steps

For the full DKIA system:
1. **Source connectors** - Add RSS, HN, arXiv ingestion
2. **Temporal tracking** - Add timestamps and decay per content type
3. **FastAPI server** - Build the Navigator API
4. **Cytoscape.js UI** - Visualize the knowledge graph