# Hybrid Search with Neo4j GraphRAG

This notebook demonstrates Neo4j's official hybrid search using the `HybridRetriever` and `HybridCypherRetriever` classes from the `neo4j-graphrag` package.

## Why Hybrid Search?

| Search Type | Strengths | Weaknesses |
|-------------|-----------|------------|
| **Vector** | Semantic similarity, concept matching | Misses exact terms, dates, names |
| **Fulltext** | Exact keyword matching, specific terms | No semantic understanding |
| **Hybrid** | Combines both for better precision and recall | Slightly more complex |

## How Neo4j Hybrid Search Works

1. Query executes against both vector and fulltext indexes **simultaneously**
2. Each index returns results with relevance scores
3. Scores are **normalized** for comparability
4. Results are **merged and deduplicated**
5. Combined scores are ranked using the `alpha` parameter
6. Top-k results are returned

**Alpha parameter:** Controls the balance between vector and fulltext scores
- `alpha=1.0` = Pure vector search
- `alpha=0.0` = Pure fulltext search  
- `alpha=0.5` = Equal weight to both

---

Import the required modules.

In [None]:
import sys
sys.path.insert(0, '../new-workshops/solutions')

import neo4j
from neo4j import GraphDatabase

# Official Neo4j GraphRAG retrievers
from neo4j_graphrag.retrievers import HybridRetriever, HybridCypherRetriever
from neo4j_graphrag.types import RetrieverResultItem

from config import Neo4jConfig, get_embedder

## Setup

Connect to Neo4j and initialize the embedder.

In [None]:
neo4j_config = Neo4jConfig()
driver = GraphDatabase.driver(
    neo4j_config.uri, 
    auth=(neo4j_config.username, neo4j_config.password)
)
driver.verify_connectivity()

# Initialize embedder for vector search
embedder = get_embedder()

print(f"Connected to Neo4j: {neo4j_config.uri}")
print(f"Embedder model: {embedder.model}")

## Verify Indexes

Hybrid search requires both a vector index and a fulltext index.

> **Note:** If the fulltext index doesn't exist, run:
> ```bash
> uv run python scripts/restore_neo4j.py --full-text
> ```

In [None]:
# Index names used throughout this notebook
VECTOR_INDEX = "chunkEmbeddings"
FULLTEXT_INDEX = "search_entities"

with driver.session() as session:
    # Check vector index
    vec_result = session.run(
        "SHOW VECTOR INDEXES YIELD name, state WHERE name = $name RETURN state",
        name=VECTOR_INDEX
    )
    vec_state = vec_result.single()
    print(f"Vector index '{VECTOR_INDEX}': {vec_state['state'] if vec_state else 'NOT FOUND'}")
    
    # Check fulltext index
    ft_result = session.run(
        "SHOW FULLTEXT INDEXES YIELD name, state WHERE name = $name RETURN state",
        name=FULLTEXT_INDEX
    )
    ft_state = ft_result.single()
    print(f"Fulltext index '{FULLTEXT_INDEX}': {ft_state['state'] if ft_state else 'NOT FOUND'}")

## Pattern 1: Basic HybridRetriever

The `HybridRetriever` automatically:
- Searches both vector and fulltext indexes
- Normalizes scores for fair comparison
- Merges and deduplicates results
- Ranks by combined score using `alpha`

This replaces manual Cypher queries with a single, clean API call.

In [None]:
# Create a HybridRetriever instance
# This combines vector similarity search with fulltext keyword search
hybrid_retriever = HybridRetriever(
    driver=driver,
    vector_index_name=VECTOR_INDEX,      # For semantic similarity
    fulltext_index_name=FULLTEXT_INDEX,  # For keyword matching
    embedder=embedder,                    # Converts query text to vectors
    return_properties=["text"],           # Properties to return from matched nodes
)

print("HybridRetriever created successfully")

In [None]:
# Perform a hybrid search
# The query is used for both vector embedding AND fulltext keyword matching
query = "supply chain risks and disruptions"

results = hybrid_retriever.search(
    query_text=query,
    top_k=5,
)

print(f"Hybrid search for: '{query}'\n")
print(f"Found {len(results.items)} results\n")

for i, item in enumerate(results.items, 1):
    score = item.metadata.get('score', 'N/A')
    text = item.content[:200] if isinstance(item.content, str) else str(item.content)[:200]
    print(f"{i}. Score: {score}")
    print(f"   {text}...\n")

## Pattern 2: Tuning Alpha

The `alpha` parameter controls how vector and fulltext scores are weighted:

```
combined_score = alpha * vector_score + (1 - alpha) * fulltext_score
```

| Alpha | Effect |
|-------|--------|
| 1.0 | Pure vector (semantic) search |
| 0.7 | Favor vector, some fulltext |
| 0.5 | Equal weight |
| 0.3 | Favor fulltext, some vector |
| 0.0 | Pure fulltext (keyword) search |

In [None]:
# Compare different alpha values
# This query includes a specific term ("Apple") that fulltext excels at
# AND a semantic concept ("financial risks") that vector excels at
query = "Apple financial risks"

for alpha in [1.0, 0.5, 0.0]:
    results = hybrid_retriever.search(
        query_text=query,
        top_k=3,
        alpha=alpha,  # Weight for vector score
    )
    
    label = {
        1.0: "Pure Vector",
        0.5: "Balanced",
        0.0: "Pure Fulltext"
    }[alpha]
    
    print(f"\n=== Alpha={alpha} ({label}) ===")
    for i, item in enumerate(results.items, 1):
        text = item.content[:100] if isinstance(item.content, str) else str(item.content)[:100]
        print(f"{i}. {text}...")

## Pattern 3: HybridCypherRetriever with Graph Traversal

The `HybridCypherRetriever` extends hybrid search with **graph traversal**:

1. First, hybrid search finds relevant nodes (like `Chunk`)
2. Then, a Cypher query traverses the graph to gather context
3. Returns enriched results with related entities

This is powerful for knowledge graphs where you want to:
- Find semantically relevant chunks
- AND get related companies, risk factors, products, etc.

In [None]:
# Define a retrieval query that runs AFTER hybrid search finds matching nodes
# The 'node' variable contains matched Chunk nodes from hybrid search
# The 'score' variable contains the combined hybrid score
RETRIEVAL_QUERY = """
// Traverse from matched chunk to document and company
MATCH (node)-[:FROM_DOCUMENT]->(doc:Document)<-[:FILED]-(company:Company)

// Get risk factors for context
OPTIONAL MATCH (company)-[:FACES_RISK]->(risk:RiskFactor)

// Get products mentioned
OPTIONAL MATCH (company)-[:MENTIONS]->(product:Product)

// Return enriched results
RETURN node.text AS text,
       score,
       company.name AS company,
       doc.path AS document,
       collect(DISTINCT risk.name)[0..3] AS risks,
       collect(DISTINCT product.name)[0..3] AS products
"""


def format_result(record: neo4j.Record) -> RetrieverResultItem:
    """
    Format HybridCypherRetriever results for easy access.
    
    This is the recommended pattern from Neo4j for customizing retriever output.
    The function receives a neo4j.Record with keys matching the RETURN clause.
    
    Reference: neo4j-graphrag-python/examples/customize/retrievers/result_formatter_vector_cypher_retriever.py
    """
    return RetrieverResultItem(
        content={
            "text": record.get("text", ""),
            "company": record.get("company", "N/A"),
            "document": record.get("document", ""),
            "risks": record.get("risks", []),
            "products": record.get("products", []),
        },
        metadata={"score": record.get("score", 0.0)},
    )


# Create HybridCypherRetriever with result_formatter
hybrid_cypher_retriever = HybridCypherRetriever(
    driver=driver,
    vector_index_name=VECTOR_INDEX,
    fulltext_index_name=FULLTEXT_INDEX,
    retrieval_query=RETRIEVAL_QUERY,  # Graph traversal after hybrid search
    embedder=embedder,
    result_formatter=format_result,   # Converts neo4j.Record to structured dict
)

print("HybridCypherRetriever created successfully")

In [None]:
# Search with graph-enhanced results
query = "artificial intelligence and machine learning"

results = hybrid_cypher_retriever.search(
    query_text=query,
    top_k=5,
)

print(f"Graph-enhanced hybrid search for: '{query}'\n")

for i, item in enumerate(results.items, 1):
    # content is a dict from our result_formatter
    content = item.content if isinstance(item.content, dict) else {}
    # score is in metadata from our result_formatter
    score = item.metadata.get("score", "N/A") if item.metadata else "N/A"
    
    print(f"{i}. Company: {content.get('company', 'N/A')}")
    print(f"   Score: {score}")
    
    text = content.get('text', '')
    print(f"   Text: {text[:150]}...")
    
    risks = content.get('risks', [])
    if risks:
        print(f"   Risks: {', '.join(risks)}")
    
    products = content.get('products', [])
    if products:
        print(f"   Products: {', '.join(products)}")
    print()

## Pattern 4: Comparing Search Methods

Let's demonstrate when hybrid search outperforms pure vector or pure fulltext.

**Scenario:** Query contains both:
- A specific entity name ("Microsoft") - fulltext excels
- A semantic concept ("cloud computing strategy") - vector excels

In [None]:
# Query with both specific name AND semantic concept
query = "Microsoft cloud computing strategy"

print(f"Query: '{query}'\n")
print("="*60)

# Pure Vector (alpha=1.0)
print("\n[VECTOR ONLY - alpha=1.0]")
print("Finds semantically similar content, may miss exact 'Microsoft' mentions\n")
results = hybrid_retriever.search(query_text=query, top_k=2, alpha=1.0)
for item in results.items:
    text = item.content[:120] if isinstance(item.content, str) else str(item.content)[:120]
    print(f"  - {text}...")

# Pure Fulltext (alpha=0.0)  
print("\n[FULLTEXT ONLY - alpha=0.0]")
print("Matches 'Microsoft' keyword exactly, may miss semantic meaning\n")
results = hybrid_retriever.search(query_text=query, top_k=2, alpha=0.0)
for item in results.items:
    text = item.content[:120] if isinstance(item.content, str) else str(item.content)[:120]
    print(f"  - {text}...")

# Hybrid (alpha=0.5)
print("\n[HYBRID - alpha=0.5]")
print("Combines both: exact 'Microsoft' matches + semantic 'cloud' relevance\n")
results = hybrid_retriever.search(query_text=query, top_k=2, alpha=0.5)
for item in results.items:
    text = item.content[:120] if isinstance(item.content, str) else str(item.content)[:120]
    print(f"  - {text}...")

## Summary

### Retriever Classes

| Class | Use Case |
|-------|----------|
| `HybridRetriever` | Simple hybrid search, returns matched nodes |
| `HybridCypherRetriever` | Hybrid search + graph traversal for context |

### Key Parameters

| Parameter | Description |
|-----------|-------------|
| `vector_index_name` | Vector index for semantic search |
| `fulltext_index_name` | Fulltext index for keyword search |
| `alpha` | Weight for vector (1.0=vector only, 0.0=fulltext only) |
| `top_k` | Number of results to return |
| `retrieval_query` | Cypher for graph traversal (HybridCypherRetriever) |
| `result_formatter` | Function to format neo4j.Record to RetrieverResultItem |

### Best Practices

- **Use `alpha=0.5`** as a starting point for balanced results
- **Increase alpha** (toward 1.0) for natural language questions
- **Decrease alpha** (toward 0.0) for queries with specific names/dates
- **Use HybridCypherRetriever** when you need related graph context
- **Use `result_formatter`** to control how results are structured

---

**References:**
- [Neo4j GraphRAG Python Documentation](https://neo4j.com/docs/neo4j-graphrag-python/current/user_guide_rag.html)
- [Hybrid Retrieval Blog Post](https://neo4j.com/blog/developer/hybrid-retrieval-graphrag-python-package/)

[View the complete code](../solutions/05_02_hybrid_search.py)

[Return to Fulltext Search](05_01_fulltext_search.ipynb)

In [None]:
# Cleanup
driver.close()
print("Connection closed")