# GraphRAG Step 3: Embeddings & Vector Search

This notebook adds semantic search capabilities:

## Pipeline Steps
1. **Pull embedding model** - nomic-embed-text via Ollama
2. **Generate embeddings** - Embed entities, claims, and chunks
3. **Set up sqlite-vec** - Vector storage and similarity search
4. **Triple-factor retrieval** - Combine semantic + temporal + graph centrality
5. **Query interface** - Test retrieval with sample queries

## Setup

In [27]:
import json
import sqlite3
import struct
from pathlib import Path
from datetime import datetime, timedelta
from dataclasses import dataclass

import httpx
import sqlite_vec

OLLAMA_BASE_URL = "http://localhost:11434"
CHAT_MODEL = "qwen2.5:3b"
EMBED_MODEL = "nomic-embed-text"  # 768 dimensions
EMBED_DIM = 768
DB_PATH = Path("graphrag.db")

## Step 1: Pull Embedding Model

We need `nomic-embed-text` for generating embeddings. Run this cell to pull it.

In [28]:
# Check if embedding model is available
response = httpx.get(f"{OLLAMA_BASE_URL}/api/tags")
models = [m["name"] for m in response.json().get("models", [])]
print(f"Available models: {models}")

if EMBED_MODEL not in models and f"{EMBED_MODEL}:latest" not in models:
    print(f"\n⚠️  Embedding model '{EMBED_MODEL}' not found.")
    print(f"Run this in terminal: ollama pull {EMBED_MODEL}")
    print("Then re-run this cell.")
else:
    print(f"\n✓ Embedding model '{EMBED_MODEL}' is available")

Available models: ['nomic-embed-text:latest', 'qwen2.5:3b']

✓ Embedding model 'nomic-embed-text' is available


In [29]:
def get_embedding(text: str) -> list[float]:
    """Get embedding vector from Ollama."""
    response = httpx.post(
        f"{OLLAMA_BASE_URL}/api/embed",
        json={"model": EMBED_MODEL, "input": text},
        timeout=60.0
    )
    response.raise_for_status()
    # Ollama returns {"embeddings": [[...]]}
    embeddings = response.json().get("embeddings", [[]])
    return embeddings[0] if embeddings else []

# Test embedding
test_embedding = get_embedding("Hello world")
print(f"Embedding dimension: {len(test_embedding)}")
print(f"First 5 values: {test_embedding[:5]}")

Embedding dimension: 768
First 5 values: [-0.006819655, 0.031062, -0.15550135, 0.03674758, 0.02267155]


In [30]:
def chat_ollama(prompt: str, system: str = "", temperature: float = 0.0) -> str:
    """Send a chat request to Ollama."""
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    
    response = httpx.post(
        f"{OLLAMA_BASE_URL}/api/chat",
        json={
            "model": CHAT_MODEL,
            "messages": messages,
            "stream": False,
            "options": {"temperature": temperature}
        },
        timeout=120.0
    )
    response.raise_for_status()
    return response.json()["message"]["content"]

## Step 2: Set Up sqlite-vec

Add vector tables to our existing database.

In [31]:
# Connect to database and load sqlite-vec extension
conn = sqlite3.connect(DB_PATH)
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)

cursor = conn.cursor()

# Verify sqlite-vec is loaded
cursor.execute("SELECT vec_version()")
print(f"sqlite-vec version: {cursor.fetchone()[0]}")

sqlite-vec version: v0.1.6


In [32]:
# Create vector tables
cursor.executescript(f"""
-- Entity embeddings (name + description)
CREATE VIRTUAL TABLE IF NOT EXISTS entity_embeddings USING vec0(
    entity_id INTEGER PRIMARY KEY,
    embedding FLOAT[{EMBED_DIM}]
);

-- Chunk embeddings (source text)
CREATE VIRTUAL TABLE IF NOT EXISTS chunk_embeddings USING vec0(
    chunk_id INTEGER PRIMARY KEY,
    embedding FLOAT[{EMBED_DIM}]
);

-- Claim embeddings
CREATE VIRTUAL TABLE IF NOT EXISTS claim_embeddings USING vec0(
    claim_id INTEGER PRIMARY KEY,
    embedding FLOAT[{EMBED_DIM}]
);
""")
conn.commit()
print("Vector tables created")

Vector tables created


## Step 3: Generate Embeddings

Embed all entities, chunks, and claims.

In [33]:
def serialize_embedding(embedding: list[float]) -> bytes:
    """Serialize embedding to bytes for sqlite-vec."""
    return struct.pack(f"{len(embedding)}f", *embedding)

In [34]:
# Clear existing embeddings (for re-runs)
cursor.execute("DELETE FROM entity_embeddings")
cursor.execute("DELETE FROM chunk_embeddings")
cursor.execute("DELETE FROM claim_embeddings")
conn.commit()
print("Cleared existing embeddings")

Cleared existing embeddings


In [35]:
# Embed entities (name + type + description)
cursor.execute("SELECT id, name, type, description FROM entities")
entities = cursor.fetchall()

print(f"Embedding {len(entities)} entities...")
for i, (entity_id, name, etype, description) in enumerate(entities):
    # Combine name, type, and description for richer embedding
    text = f"{name} ({etype}): {description or 'No description'}"
    embedding = get_embedding(text)
    
    cursor.execute(
        "INSERT INTO entity_embeddings (entity_id, embedding) VALUES (?, ?)",
        (entity_id, serialize_embedding(embedding))
    )
    
    if (i + 1) % 5 == 0:
        print(f"  Processed {i + 1}/{len(entities)}")

conn.commit()
print(f"Embedded {len(entities)} entities")

Embedding 101 entities...
  Processed 5/101
  Processed 10/101
  Processed 15/101
  Processed 20/101
  Processed 25/101
  Processed 30/101
  Processed 35/101
  Processed 40/101
  Processed 45/101
  Processed 50/101
  Processed 55/101
  Processed 60/101
  Processed 65/101
  Processed 70/101
  Processed 75/101
  Processed 80/101
  Processed 85/101
  Processed 90/101
  Processed 95/101
  Processed 100/101
Embedded 101 entities


In [36]:
# Embed chunks (source text)
cursor.execute("SELECT id, content FROM chunks")
chunks = cursor.fetchall()

print(f"Embedding {len(chunks)} chunks...")
for i, (chunk_id, content) in enumerate(chunks):
    # Truncate if too long (embedding models have limits)
    text = content[:8000] if len(content) > 8000 else content
    embedding = get_embedding(text)
    
    cursor.execute(
        "INSERT INTO chunk_embeddings (chunk_id, embedding) VALUES (?, ?)",
        (chunk_id, serialize_embedding(embedding))
    )
    
    print(f"  Processed chunk {i + 1}/{len(chunks)}")

conn.commit()
print(f"Embedded {len(chunks)} chunks")

Embedding 52 chunks...
  Processed chunk 1/52
  Processed chunk 2/52
  Processed chunk 3/52
  Processed chunk 4/52
  Processed chunk 5/52
  Processed chunk 6/52
  Processed chunk 7/52
  Processed chunk 8/52
  Processed chunk 9/52
  Processed chunk 10/52
  Processed chunk 11/52
  Processed chunk 12/52
  Processed chunk 13/52
  Processed chunk 14/52
  Processed chunk 15/52
  Processed chunk 16/52
  Processed chunk 17/52
  Processed chunk 18/52
  Processed chunk 19/52
  Processed chunk 20/52
  Processed chunk 21/52
  Processed chunk 22/52
  Processed chunk 23/52
  Processed chunk 24/52
  Processed chunk 25/52
  Processed chunk 26/52
  Processed chunk 27/52
  Processed chunk 28/52
  Processed chunk 29/52
  Processed chunk 30/52
  Processed chunk 31/52
  Processed chunk 32/52
  Processed chunk 33/52
  Processed chunk 34/52
  Processed chunk 35/52
  Processed chunk 36/52
  Processed chunk 37/52
  Processed chunk 38/52
  Processed chunk 39/52
  Processed chunk 40/52
  Processed chunk 41/52
  

In [37]:
# Embed claims
cursor.execute("SELECT id, claim_type, description FROM claims")
claims = cursor.fetchall()

print(f"Embedding {len(claims)} claims...")
for i, (claim_id, claim_type, description) in enumerate(claims):
    text = f"[{claim_type}] {description}"
    embedding = get_embedding(text)
    
    cursor.execute(
        "INSERT INTO claim_embeddings (claim_id, embedding) VALUES (?, ?)",
        (claim_id, serialize_embedding(embedding))
    )

conn.commit()
print(f"Embedded {len(claims)} claims")

Embedding 1 claims...
Embedded 1 claims


In [38]:
# Verify embeddings
print("\n=== EMBEDDING COUNTS ===")
for table in ["entity_embeddings", "chunk_embeddings", "claim_embeddings"]:
    cursor.execute(f"SELECT COUNT(*) FROM {table}")
    count = cursor.fetchone()[0]
    print(f"  {table}: {count}")


=== EMBEDDING COUNTS ===
  entity_embeddings: 101
  chunk_embeddings: 52
  claim_embeddings: 1


## Step 4: Test Vector Search

Basic semantic similarity search.

In [39]:
def search_entities(query: str, top_k: int = 5) -> list[tuple]:
    """Search entities by semantic similarity."""
    query_embedding = get_embedding(query)
    
    cursor.execute("""
        SELECT 
            e.id,
            e.name,
            e.type,
            e.description,
            e.pagerank,
            vec_distance_cosine(ee.embedding, ?) as distance
        FROM entity_embeddings ee
        JOIN entities e ON e.id = ee.entity_id
        ORDER BY distance ASC
        LIMIT ?
    """, (serialize_embedding(query_embedding), top_k))
    
    return cursor.fetchall()

In [40]:
# Test entity search
query = "artificial intelligence companies"
print(f"Query: '{query}'\n")

results = search_entities(query)
print("=== TOP MATCHING ENTITIES ===")
for entity_id, name, etype, desc, pagerank, distance in results:
    similarity = 1 - distance  # Convert distance to similarity
    print(f"\n[{etype}] {name}")
    print(f"  Similarity: {similarity:.4f} | PageRank: {pagerank:.4f}")
    print(f"  {desc[:80]}..." if desc and len(desc) > 80 else f"  {desc}")

Query: 'artificial intelligence companies'

=== TOP MATCHING ENTITIES ===

[ORGANIZATION] LLM
  Similarity: 0.5818 | PageRank: 0.0175
  Large Language Model | Large language model used in the approach to build a grap...

[ORGANIZATION] EXAMPLE CORP
  Similarity: 0.5535 | PageRank: 0.0078
  Technology company acquiring StartupXYZ

[CONCEPT] CRISPR-GPT
  Similarity: 0.5510 | PageRank: 0.0144
  AI model for agentic automation in gene editing experiments | LLM agent for auto...

[ORGANIZATION] RAG SYSTEMS
  Similarity: 0.5468 | PageRank: 0.0078
  Systems that use retrieval-augmented generation

[LOCATION] EXTERNAL KNOWLEDGE SOURCE
  Similarity: 0.5246 | PageRank: 0.0113
  Source of information for retrieval-augmented generation


In [41]:
# Test with different queries
test_queries = [
    "CEO executives leadership",
    "stock market financial",
    "government regulation antitrust"
]

for query in test_queries:
    print(f"\n{'='*50}")
    print(f"Query: '{query}'")
    results = search_entities(query, top_k=3)
    for entity_id, name, etype, desc, pagerank, distance in results:
        similarity = 1 - distance
        print(f"  {similarity:.3f} | [{etype}] {name}")


Query: 'CEO executives leadership'
  0.584 | [CONCEPT] SOCIAL DECENTRALIZED AUTONOMOUS ORGANIZATIONS
  0.568 | [LOCATION] JUPITER
  0.568 | [LOCATION] URANUS

Query: 'stock market financial'
  0.516 | [LOCATION] ADVANCED ECONOMIES
  0.494 | [LOCATION] EMERGING MARKET AND DEVELOPING ECONOMIES
  0.480 | [ORGANIZATION] EXAMPLE CORP

Query: 'government regulation antitrust'
  0.562 | [LOCATION] ADVANCED ECONOMIES
  0.537 | [PRODUCT] JAMES WEBB SPACE TELESCOPE
  0.507 | [PRODUCT] TRUSTED BLOCKCHAIN DECENTRALIZED APPLICATIONS


## Step 5: Triple-Factor Retrieval

Combine semantic similarity + temporal decay + graph centrality.

**Formula:**
```
final_score = 0.6 * semantic_similarity + 0.2 * temporal_score + 0.2 * graph_centrality
```

In [42]:
@dataclass
class RetrievalResult:
    entity_id: int
    name: str
    entity_type: str
    description: str
    semantic_score: float
    temporal_score: float
    graph_score: float
    final_score: float
    community_id: int = None
    source_refs: list[str] = None

    def __post_init__(self):
        if self.source_refs is None:
            self.source_refs = []

In [43]:
HALF_LIVES = {
    "news": 7,
    "research_paper": 30,
    "reference": 365,
}

def temporal_decay(age_days: float, half_life: float = 7.0) -> float:
    """Calculate temporal decay score using half-life formula.
    
    score = 0.5 ^ (age_days / half_life)
    
    Args:
        age_days: Age of content in days
        half_life: Days until relevance halves
    
    Returns:
        Score between 0 and 1 (1 = fresh, 0 = very old)
    """
    return 0.5 ** (age_days / half_life)


def get_half_life_for_entity(entity_id: int) -> float:
    """Look up the content_type(s) of sources for an entity and return the appropriate half-life.
    
    If an entity appears in multiple sources with different content types,
    use the longest half-life (most persistent content type wins).
    """
    cursor.execute("SELECT source_refs FROM entities WHERE id = ?", (entity_id,))
    row = cursor.fetchone()
    if not row or not row[0]:
        return HALF_LIVES["news"]  # Default
    
    try:
        source_ids = json.loads(row[0])
    except (json.JSONDecodeError, TypeError):
        return HALF_LIVES["news"]
    
    if not source_ids:
        return HALF_LIVES["news"]
    
    # Look up content_type for each source
    max_half_life = HALF_LIVES["news"]
    for sid in source_ids:
        cursor.execute("SELECT content_type FROM sources WHERE source_id = ?", (sid,))
        src_row = cursor.fetchone()
        if src_row and src_row[0]:
            hl = HALF_LIVES.get(src_row[0], HALF_LIVES["news"])
            max_half_life = max(max_half_life, hl)
    
    return max_half_life


# Test temporal decay with content-type awareness
print("Temporal decay by content type:")
print(f"{'Age':>5}  {'news (7d)':>10}  {'paper (30d)':>12}  {'ref (365d)':>12}")
print("-" * 45)
for days in [0, 1, 7, 14, 30, 90, 365]:
    scores = [temporal_decay(days, HALF_LIVES[ct]) for ct in ["news", "research_paper", "reference"]]
    print(f"{days:>5}  {scores[0]:>10.4f}  {scores[1]:>12.4f}  {scores[2]:>12.4f}")

Temporal decay by content type:
  Age   news (7d)   paper (30d)    ref (365d)
---------------------------------------------
    0      1.0000        1.0000        1.0000
    1      0.9057        0.9772        0.9981
    7      0.5000        0.8507        0.9868
   14      0.2500        0.7236        0.9738
   30      0.0513        0.5000        0.9446
   90      0.0001        0.1250        0.8429
  365      0.0000        0.0002        0.5000


In [44]:
def triple_factor_search(
    query: str,
    top_k: int = 10,
    semantic_weight: float = 0.6,
    temporal_weight: float = 0.2,
    graph_weight: float = 0.2,
    content_age_days: float = 0.0,  # Assume fresh content for demo
) -> list[RetrievalResult]:
    """Triple-factor retrieval combining semantic, temporal, and graph signals.
    
    Uses per-entity half-life based on source content_type:
    - news: 7 days
    - research_paper: 30 days
    - reference: 365 days
    """
    
    # Get query embedding
    query_embedding = get_embedding(query)
    
    # Fetch all entities with their embeddings and metrics
    cursor.execute("""
        SELECT 
            e.id,
            e.name,
            e.type,
            e.description,
            e.pagerank,
            e.community_id,
            e.source_refs,
            vec_distance_cosine(ee.embedding, ?) as distance
        FROM entity_embeddings ee
        JOIN entities e ON e.id = ee.entity_id
        ORDER BY distance ASC
        LIMIT ?
    """, (serialize_embedding(query_embedding), top_k * 2))  # Fetch more for re-ranking
    
    rows = cursor.fetchall()
    
    # Calculate triple-factor scores
    results = []
    
    # Get max pagerank for normalization
    max_pagerank = max(row[4] for row in rows) if rows else 1.0
    
    for entity_id, name, etype, desc, pagerank_val, community_id, source_refs_json, distance in rows:
        # Semantic similarity (convert distance to similarity)
        semantic_score = 1.0 - distance
        
        # Temporal score (per-entity half-life based on content type)
        entity_half_life = get_half_life_for_entity(entity_id)
        temporal_score = temporal_decay(content_age_days, entity_half_life)
        
        # Graph centrality score (normalized pagerank)
        graph_score = pagerank_val / max_pagerank if max_pagerank > 0 else 0
        
        # Combined score
        final_score = (
            semantic_weight * semantic_score +
            temporal_weight * temporal_score +
            graph_weight * graph_score
        )
        
        # Parse source refs
        try:
            source_refs = json.loads(source_refs_json) if source_refs_json else []
        except (json.JSONDecodeError, TypeError):
            source_refs = []
        
        results.append(RetrievalResult(
            entity_id=entity_id,
            name=name,
            entity_type=etype,
            description=desc,
            semantic_score=semantic_score,
            temporal_score=temporal_score,
            graph_score=graph_score,
            final_score=final_score,
            community_id=community_id,
            source_refs=source_refs,
        ))
    
    # Re-rank by final score
    results.sort(key=lambda x: -x.final_score)
    
    return results[:top_k]

In [45]:
# Test triple-factor search
query = "technology partnerships and deals"
print(f"Query: '{query}'\n")

results = triple_factor_search(query, top_k=5)

print("=== TRIPLE-FACTOR RETRIEVAL RESULTS ===")
print(f"{'Rank':<5} {'Entity':<25} {'Final':<8} {'Semantic':<10} {'Temporal':<10} {'Graph':<8}")
print("-" * 75)

for i, r in enumerate(results, 1):
    print(f"{i:<5} {r.name[:24]:<25} {r.final_score:.4f}   {r.semantic_score:.4f}     {r.temporal_score:.4f}     {r.graph_score:.4f}")

Query: 'technology partnerships and deals'

=== TRIPLE-FACTOR RETRIEVAL RESULTS ===
Rank  Entity                    Final    Semantic   Temporal   Graph   
---------------------------------------------------------------------------
1     AI                        0.7728   0.6213     1.0000     1.0000
2     STARTUPXYZ                0.6821   0.5269     1.0000     0.8298
3     ADVANCED ECONOMIES        0.6379   0.5252     1.0000     0.6140
4     MOBILE PHONES             0.6360   0.5278     1.0000     0.5968
5     EMERGING MARKET AND DEVE  0.6322   0.5702     1.0000     0.4505


In [46]:
# Compare with and without graph centrality boost
query = "AI regulation government"
print(f"Query: '{query}'\n")

print("=== SEMANTIC ONLY (100% semantic) ===")
results_semantic = triple_factor_search(query, semantic_weight=1.0, temporal_weight=0.0, graph_weight=0.0)
for i, r in enumerate(results_semantic[:5], 1):
    print(f"  {i}. [{r.entity_type}] {r.name} (score: {r.final_score:.4f})")

print("\n=== TRIPLE-FACTOR (60/20/20) ===")
results_triple = triple_factor_search(query, semantic_weight=0.6, temporal_weight=0.2, graph_weight=0.2)
for i, r in enumerate(results_triple[:5], 1):
    print(f"  {i}. [{r.entity_type}] {r.name} (score: {r.final_score:.4f}, graph: {r.graph_score:.4f})")

Query: 'AI regulation government'

=== SEMANTIC ONLY (100% semantic) ===
  1. [LOCATION] ADVANCED ECONOMIES (score: 0.6413)
  2. [PRODUCT] JAMES WEBB SPACE TELESCOPE (score: 0.6270)
  3. [CONCEPT] LABOR-MARKET (score: 0.5802)
  4. [LOCATION] JUPITER (score: 0.5796)
  5. [LOCATION] URANUS (score: 0.5796)

=== TRIPLE-FACTOR (60/20/20) ===
  1. [LOCATION] ADVANCED ECONOMIES (score: 0.7275, graph: 0.7134)
  2. [PRODUCT] TRUSTED BLOCKCHAIN DECENTRALIZED APPLICATIONS (score: 0.7151, graph: 1.0000)
  3. [PRODUCT] JAMES WEBB SPACE TELESCOPE (score: 0.6512, graph: 0.3748)
  4. [CONCEPT] LABOR-MARKET (score: 0.6231, graph: 0.3748)
  5. [LOCATION] JUPITER (score: 0.6227, graph: 0.3748)


## Step 6: Query Interface

Build a simple RAG query function that retrieves context and generates an answer.

In [47]:
def get_community_context(community_id: int) -> str:
    """Get community summary for additional context."""
    cursor.execute("""
        SELECT title, summary, key_insights 
        FROM community_summaries 
        WHERE community_id = ?
    """, (community_id,))
    row = cursor.fetchone()
    if row:
        title, summary, insights_json = row
        insights = json.loads(insights_json) if insights_json else []
        return f"Topic: {title}\nSummary: {summary}\nKey insights: {'; '.join(insights)}"
    return ""

In [48]:
def get_related_claims(entity_ids: list[int]) -> list[str]:
    """Get claims related to the retrieved entities."""
    if not entity_ids:
        return []
    
    placeholders = ",".join("?" for _ in entity_ids)
    cursor.execute(f"""
        SELECT c.claim_type, c.description, e.name
        FROM claims c
        JOIN entities e ON e.id = c.subject_id
        WHERE c.subject_id IN ({placeholders})
    """, entity_ids)
    
    claims = []
    for claim_type, description, entity_name in cursor.fetchall():
        claims.append(f"[{claim_type}] {entity_name}: {description}")
    return claims

In [49]:
def navigator_query(question: str, top_k: int = 5) -> str:
    """The Navigator: Answer questions using GraphRAG retrieval."""
    
    # 1. Retrieve relevant entities using triple-factor search
    results = triple_factor_search(question, top_k=top_k)
    
    if not results:
        return "I don't have enough information to answer that question."
    
    # 2. Build context from retrieved entities (with source provenance)
    entity_context = []
    for r in results:
        sources_str = ""
        if r.source_refs:
            sources_str = f" [from: {', '.join(r.source_refs)}]"
        entity_context.append(f"- {r.name} ({r.entity_type}): {r.description}{sources_str}")
    
    # 3. Get community context for the top result
    community_context = ""
    if results[0].community_id is not None:
        community_context = get_community_context(results[0].community_id)
    
    # 4. Get related claims
    entity_ids = [r.entity_id for r in results]
    claims = get_related_claims(entity_ids)
    claims_context = "\n".join(claims[:5]) if claims else "No specific claims."
    
    # 5. Collect unique source provenance
    all_sources = set()
    for r in results:
        all_sources.update(r.source_refs)
    source_provenance = f"\nSources consulted: {', '.join(sorted(all_sources))}" if all_sources else ""
    
    # 6. Build prompt
    prompt = f"""You are a knowledgeable assistant. Answer the question based on the following context.

RELEVANT ENTITIES:
{chr(10).join(entity_context)}

TOPIC CONTEXT:
{community_context or 'No additional topic context.'}

SPECIFIC FACTS:
{claims_context}
{source_provenance}

QUESTION: {question}

Provide a clear, concise answer based on the context above. If the context doesn't contain enough information, say so. Reference the source domains when relevant.

ANSWER:"""
    
    # 7. Generate answer
    answer = chat_ollama(prompt, temperature=0.3)
    
    return answer

In [50]:
# Test the Navigator
question = "What is the partnership between OpenAI and Microsoft?"
print(f"Question: {question}\n")
print("=" * 50)
answer = navigator_query(question)
print(f"\nAnswer:\n{answer}")

Question: What is the partnership between OpenAI and Microsoft?


Answer:
The provided context does not contain any information about the partnership between OpenAI and Microsoft. The context is related to the International Monetary Fund's study on the impact of artificial intelligence on labor markets, which involves staff analysis and use of an index to assess country readiness for AI adoption in the labor market. It does not mention anything about partnerships or collaborations involving companies like OpenAI or Microsoft.


In [51]:
# Cross-domain test queries spanning multiple source domains
questions = [
    "What is the role of AI in scientific research?",
    "What large-scale observation instruments are used in science?",
    "What are the economic impacts of emerging technologies?",
    "How is graph-based knowledge discovery used in information retrieval?",
]

for q in questions:
    print(f"\n{'='*60}")
    print(f"Q: {q}")
    print("-" * 60)
    
    # Show which sources the retrieval pulls from
    results = triple_factor_search(q, top_k=5)
    source_domains = set()
    for r in results:
        source_domains.update(r.source_refs)
    if source_domains:
        print(f"Sources: {', '.join(sorted(source_domains))}")
    
    answer = navigator_query(q)
    print(f"A: {answer}")


Q: What is the role of AI in scientific research?
------------------------------------------------------------
Sources: arxiv:2304.04869, arxiv:2312.14090, web:imf-ai-economy, web:planetary-voyager, web:quanta-memory
A: Based on the provided context, Artificial Intelligence (AI) plays a significant role in scientific research, particularly in analyzing complex data and improving methodologies for designing blockchain applications. The context mentions that AI is utilized to analyze sextortion cases and enhance the design of blockchain applications. This indicates that AI contributes to more efficient analysis processes within these fields.

However, it's important to note that the specific details about AI’s role in scientific research are not provided directly in the given context. Therefore, while we can infer that AI aids in data analysis and methodology development, a comprehensive understanding of its exact contributions would require additional information from sources related t

In [52]:
# Close database connection
conn.close()
print("\nDatabase connection closed.")


Database connection closed.


## Summary

This notebook completed:

1. **Embeddings** - Generated 768-dim vectors for entities, chunks, and claims
2. **sqlite-vec** - Set up vector storage with cosine similarity search
3. **Content-type-aware temporal decay** - Different half-lives for news (7d), papers (30d), reference (365d)
4. **Triple-factor retrieval** - Combined semantic (60%) + temporal (20%) + graph (20%) with per-entity half-lives
5. **Source provenance** - Navigator answers reference which source domains contributed
6. **Cross-domain queries** - Tested retrieval spanning AI, biology, climate, astrophysics, neuroscience, economics, space

## GraphRAG Multi-Source Pipeline Complete!

You now have a working multi-source GraphRAG system with:
- 7 diverse sources across 5+ domains
- Entity/relationship/claim extraction with cross-document merge
- Knowledge graph with community detection and interactive visualization
- Semantic search with content-type-aware temporal decay and graph-aware ranking
- Conversational query interface with source provenance

## Next Steps

For the full DKIA system:
1. **Source connectors** - Add RSS, HN, arXiv ingestion (automated)
2. **FastAPI server** - Build the Navigator API
3. **Cytoscape.js UI** - Production graph visualization
4. **APScheduler** - Overnight pipeline automation