# GraphRAG Retrievers

This notebook demonstrates retrieval strategies for GraphRAG applications, progressing from simple vector search to graph-enhanced retrieval with custom Cypher queries.

**Prerequisites:** Complete [01 Data and Embeddings](01_data_and_embeddings.ipynb) first.

**Learning Objectives:**
- Set up a VectorRetriever using Neo4j's vector index
- Perform semantic similarity searches on your knowledge graph
- Use GraphRAG to combine vector search with LLM-generated answers
- Create custom Cypher queries with VectorCypherRetriever for richer context
- Compare standard vs. graph-enhanced retrieval results

---

## Retrieval Strategies Overview

We'll explore two retrieval approaches:

1. **VectorRetriever** - Simple semantic search using embeddings
   - Finds chunks by meaning similarity
   - Returns raw text for LLM context

2. **VectorCypherRetriever** - Graph-enhanced semantic search
   - Uses vector search as entry point
   - Traverses graph relationships for richer context
   - Returns structured data alongside text

## Setup

Import required modules and initialize connections.

In [None]:
from neo4j_graphrag.retrievers import VectorRetriever, VectorCypherRetriever
from neo4j_graphrag.generation import GraphRAG

from data_utils import Neo4jConnection, get_llm, get_embedder

## Connect to Neo4j

Create and verify the connection to your Neo4j graph database.

In [None]:
neo4j = Neo4jConnection().verify()
driver = neo4j.driver

## Initialize LLM and Embedder

Set up the Large Language Model (LLM) and embedding model for GraphRAG workflows.

- **LLM**: Uses Microsoft Foundry's model via the `OpenAILLM` interface
- **Embedder**: Uses Microsoft Foundry's embedding API via `OpenAIEmbeddings`

In [None]:
llm = get_llm()
embedder = get_embedder()

print(f"LLM initialized: {llm.model_name}")
print(f"Embedder initialized: {embedder.model}")

---

# Part 1: Vector Retriever

The VectorRetriever performs semantic search over your Neo4j knowledge graph. Instead of keyword matching, it finds the most contextually similar passages to your query.

## Initialize Vector Retriever

Set up the vector-based retriever for semantic search.

In [None]:
vector_retriever = VectorRetriever(
    driver=driver,
    index_name='chunkEmbeddings',
    embedder=embedder,
    return_properties=['text']
)

print("VectorRetriever initialized")

The **VectorRetriever** class:
- Connects to Neo4j using the provided `driver`
- Uses the `chunkEmbeddings` vector index for efficient semantic retrieval
- The `embedder` generates embeddings for the query
- Returns the `text` property from matching chunks

> **Tip:** You can modify the `return_properties` list to include additional properties from the retrieved nodes.

## Simple Vector Search

Test the vector search by retrieving the top 5 most relevant text chunks for a given query.

In [None]:
query = "What are the risks that Apple faces?"
result = vector_retriever.search(query_text=query, top_k=5)

print(f"Query: \"{query}\"\n")
print(f"Number of results returned: {len(result.items)}\n")
print("=" * 70)

for item in result.items:
    print(f"\nScore: {item.metadata['score']:.4f}")
    print(f"Content: {item.content[0:150]}...")
    print(f"ID: {item.metadata['id']}")

**How it works:**
1. The query is converted to an embedding vector
2. `vector_retriever.search()` finds the top 5 matches based on vector similarity
3. Results show the similarity score, content snippet, and chunk ID

> **Tip:** Inspecting returned results helps verify relevance and adjust your chunking or embedding strategy.

## GraphRAG Pipeline

The `GraphRAG` class combines a Large Language Model (LLM) with a vector-based retriever to answer questions using both semantic search and generative reasoning.

In [None]:
query = "What are the risks that Apple faces?"

rag = GraphRAG(
    llm=llm,
    retriever=vector_retriever
)

response = rag.search(query, retriever_config={"top_k": 5}, return_context=True)

print(f"Query: \"{query}\"\n")
print(f"Number of chunks used: {len(response.retriever_result.items)}\n")
print("=" * 70)
print("\nAnswer:")
print(response.answer)

**How it works:**
1. The retriever (`vector_retriever`) finds the most relevant text chunks
2. The LLM uses the retrieved context to generate a natural language answer
3. The `return_context=True` option lets you see what context was used

The GraphRAG pipeline provides context-aware, accurate answers grounded in your knowledge graph data.

---

**Try different queries:**
- What products does Microsoft reference?
- What warnings have Nvidia given?
- What companies mention AI in their filings?

---

# Part 2: Vector Cypher Retriever

The VectorCypherRetriever enhances vector search with custom Cypher queries, enabling you to traverse graph relationships and return richer, more contextual answers.

This approach is ideal when:
- Questions involve relationships between entities
- You want structured data alongside text context
- Graph traversal can surface insights that text alone cannot

## Example 1: Asset Manager Enrichment

Create a VectorCypherRetriever that returns companies and their asset managers alongside the text chunks.

In [None]:
# Custom Cypher query to enrich results with asset manager information
# Using COLLECT subquery to limit asset managers per company
# This ensures top_k controls the final result count, not just vector search nodes

asset_manager_query = """
MATCH (node)-[:FROM_DOCUMENT]-(doc:Document)-[:FILED]-(company:Company)
WITH node, company, COLLECT {
  MATCH (company)-[:OWNS]-(manager:AssetManager)
  RETURN manager.managerName
  LIMIT 5
} AS managers
RETURN company.name AS company, managers AS AssetManagersWithSharesInCompany, node.text AS context
"""

vector_cypher_retriever = VectorCypherRetriever(
    driver=driver,
    index_name='chunkEmbeddings',
    embedder=embedder,
    retrieval_query=asset_manager_query
)

print("VectorCypherRetriever initialized with asset manager query")

**How this query works:**

- Matches text chunks (`node`) to their source documents and associated companies
- For each company, collects up to 5 asset managers
- Returns: company name, list of asset managers, and context text

The `COLLECT` subquery limits results per row, ensuring `top_k` controls the final count rather than being multiplied by graph traversal.

In [None]:
query = "Who are the asset managers most affected by banking regulations?"

rag = GraphRAG(llm=llm, retriever=vector_cypher_retriever)
response = rag.search(query, retriever_config={"top_k": 5}, return_context=True)

print(f"Query: \"{query}\"\n")
print(f"Number of results: {len(response.retriever_result.items)}\n")
print("=" * 70)
print("\nAnswer:")
print(response.answer)

In [None]:
# View the enriched context used by the LLM
print("Context used:")
print("=" * 70)
for item in response.retriever_result.items:
    print(f"\n{item.content}")

Notice how the context includes structured data (company names, asset manager lists) alongside the text. This enables more specific, accurate answers than text alone.

> **Tip:** Modify `top_k` to see how changing the result count affects relevance. As you increase values, results become less relevant.

## Example 2: Discovering Shared Risks Among Companies

Combine semantic search with graph traversal to uncover relationships - specifically, risk factors that connect major tech companies.

In [None]:
# Custom Cypher to find companies sharing the same risk factors
# Uses slice notation [0..10] on collect() to limit array sizes per row

shared_risks_query = """
WITH node
MATCH (node)-[:FROM_DOCUMENT]-(doc:Document)-[:FILED]-(c1:Company)
MATCH (c1)-[:FACES_RISK]->(risk:RiskFactor)<-[:FACES_RISK]-(c2:Company)
WHERE c1 <> c2
WITH c1, c2, risk
RETURN
  c1.name AS source_company,
  collect(DISTINCT c2.name)[0..10] AS related_companies,
  collect(DISTINCT risk.name)[0..10] AS shared_risks
LIMIT 10
"""

risk_retriever = VectorCypherRetriever(
    driver=driver,
    index_name="chunkEmbeddings",
    embedder=embedder,
    retrieval_query=shared_risks_query
)

print("VectorCypherRetriever initialized with shared risks query")

In [None]:
query = "What risks connect major tech companies?"

rag = GraphRAG(llm=llm, retriever=risk_retriever)
response = rag.search(query, retriever_config={"top_k": 5}, return_context=True)

print(f"Query: \"{query}\"\n")
print(f"Number of results: {len(response.retriever_result.items)}\n")
print("=" * 70)
print("\nAnswer:")
print(response.answer)

In [None]:
# View the graph-derived context
print("Context (shared risks between companies):")
print("=" * 70)
for item in response.retriever_result.items:
    print(f"\n{item.content}")

**How this works:**

1. **Semantic Search:** Finds top-k text chunks relevant to the query
2. **Graph Traversal:** For each chunk:
   - Follows `FROM_DOCUMENT` and `FILED` relationships to find the company
   - Finds risk factors (`FACES_RISK`) that company faces
   - Finds other companies facing the same risks
3. **Returns:** Source company, related companies, and shared risk factors

**Why this is powerful:**
- Uses the chunk as a semantic anchor, then graph logic discovers structured relationships
- Surfaces network-level insights that pure semantic or pure graph search alone cannot
- Ideal for exploratory questions about connections in your data

---

# Part 3: Comparing Retrieval Strategies

Let's compare the same query using both retrievers to see the difference in context and answers.

In [None]:
comparison_query = "What risks do technology companies face?"

print(f"Query: \"{comparison_query}\"")
print("\n" + "=" * 70)

# Basic Vector Retriever
print("\n[1] VECTOR RETRIEVER (text only)")
print("-" * 40)
rag_basic = GraphRAG(llm=llm, retriever=vector_retriever)
response_basic = rag_basic.search(comparison_query, retriever_config={"top_k": 3})
print(response_basic.answer)

# Graph-Enhanced Retriever
print("\n" + "=" * 70)
print("\n[2] VECTOR CYPHER RETRIEVER (graph-enhanced)")
print("-" * 40)
rag_enhanced = GraphRAG(llm=llm, retriever=risk_retriever)
response_enhanced = rag_enhanced.search(comparison_query, retriever_config={"top_k": 3})
print(response_enhanced.answer)

**Key Differences:**

| Aspect | VectorRetriever | VectorCypherRetriever |
|--------|-----------------|----------------------|
| Context | Raw text chunks | Text + structured graph data |
| Relationships | Implicit in text | Explicit via Cypher traversal |
| Answer specificity | General | Specific entities and connections |
| Best for | General Q&A | Relationship-focused questions |

**When to use each:**
- **VectorRetriever**: Simple semantic search, general questions
- **VectorCypherRetriever**: Questions about relationships, comparisons, or when you need structured data alongside text

## Summary

In this notebook, you learned two retrieval strategies for GraphRAG:

**Part 1 - Vector Retriever:**
1. Simple semantic search using vector embeddings
2. GraphRAG pipeline combining retrieval with LLM generation
3. Diagnostic inspection of search results

**Part 2 - Vector Cypher Retriever:**
4. Custom Cypher queries for graph traversal
5. Enriching context with structured entity data
6. Discovering relationships (shared risks, asset managers)

**Part 3 - Comparison:**
7. Understanding when to use each approach
8. Trade-offs between simplicity and richness

The graph-enhanced approach leverages Neo4j's relationship traversal to provide more specific, contextual answers - the core value proposition of GraphRAG.

---

**Next:** Continue to [Lab 6 - Foundry Agents](../Lab_6_Foundry_Agents) to build intelligent agents that use your knowledge graph as a tool.

In [None]:
# Cleanup
neo4j.close()