# Hybrid Retrievers for Aircraft Maintenance (Optional)

This notebook demonstrates **hybrid search** - combining vector similarity search with fulltext keyword search for improved retrieval. Hybrid search is particularly useful when queries contain specific technical terminology (fault codes, part numbers, exact specifications) that keyword matching handles better than semantic similarity alone.

**Prerequisites:** Complete [03 Data and Embeddings](03_data_and_embeddings.ipynb) first (creates both vector and fulltext indexes).

**Learning Objectives:**
- Understand when hybrid search outperforms pure vector search
- Set up a `HybridRetriever` using both vector and fulltext indexes
- Use `HybridCypherRetriever` for graph-enhanced hybrid search
- Compare vector vs hybrid retrieval on different query types

---

## How Hybrid Search Works

| Approach | Strengths | Weaknesses |
|----------|-----------|------------|
| **Vector search** | Finds semantically similar content even with different wording | Can miss exact matches for specific terms |
| **Fulltext search** | Precise keyword matching for codes, numbers, exact terms | Misses semantically related content with different wording |
| **Hybrid search** | Combines both - semantic understanding + keyword precision | Slightly more complex setup |

Example: Searching for "V2500 EGT exceedance"
- Vector search finds chunks about "engine temperature limits" (semantic match)
- Fulltext search finds chunks containing the exact string "V2500" and "EGT" (keyword match)
- Hybrid search returns both, ranked by combined relevance

## Section 1: Configuration

Enter your Neo4j Aura connection details below (same credentials as notebooks 03 and 04).

In [None]:
# ============================================
# CONFIGURATION - Enter your Neo4j credentials
# ============================================

NEO4J_URI = ""  # e.g., "neo4j+s://xxxxxxxx.databases.neo4j.io"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = ""  # Your password from Lab 1

# Validate configuration
if not NEO4J_URI or not NEO4J_PASSWORD:
    print("WARNING: Please enter your Neo4j credentials above before running the notebook!")
else:
    print("Configuration ready!")
    print(f"Neo4j URI: {NEO4J_URI}")

## Setup

Import required modules and initialize connections.

In [None]:
from neo4j_graphrag.retrievers import HybridRetriever, HybridCypherRetriever, VectorRetriever
from neo4j_graphrag.generation import GraphRAG

from data_utils import Neo4jConnection, get_llm, get_embedder

In [None]:
neo4j = Neo4jConnection(uri=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD).verify()
driver = neo4j.driver

# Show graph statistics
neo4j.get_graph_stats()

In [None]:
llm = get_llm()
embedder = get_embedder()

print(f"LLM initialized: {llm.model_id}")
print(f"Embedder initialized: {embedder.model_id}")

In [None]:
# Index names created in notebook 03
VECTOR_INDEX = "maintenanceChunkEmbeddings"
FULLTEXT_INDEX = "maintenanceChunkText"

---

# Part 1: Hybrid Retriever

The `HybridRetriever` combines vector similarity search with fulltext keyword search. It queries both indexes and merges the results using a configurable ranking strategy.

## Initialize Hybrid Retriever

Set up the hybrid retriever with both vector and fulltext index names.

In [None]:
hybrid_retriever = HybridRetriever(
    driver=driver,
    vector_index_name=VECTOR_INDEX,
    fulltext_index_name=FULLTEXT_INDEX,
    embedder=embedder,
    return_properties=["text"],
)

print("HybridRetriever initialized")

## Hybrid Search

Test the hybrid retriever with a query that benefits from both semantic and keyword matching.

In [None]:
query = "V2500 engine EGT exceedance troubleshooting"
result = hybrid_retriever.search(query_text=query, top_k=5)

print(f"Query: \"{query}\"\n")
print(f"Number of results: {len(result.items)}\n")
print("=" * 70)

for item in result.items:
    print(f"\nScore: {item.metadata['score']:.4f}")
    print(f"Content: {item.content[0:200]}...")

## GraphRAG with Hybrid Retriever

Combine the hybrid retriever with an LLM for context-aware answers.

In [None]:
query = "What are the V2500 engine operating limits and what should I do if EGT is exceeded?"

rag = GraphRAG(llm=llm, retriever=hybrid_retriever)
response = rag.search(
    query,
    retriever_config={"top_k": 5},
    return_context=True,
    response_fallback="No relevant maintenance procedures found.",
)

print(f"Query: \"{query}\"\n")
print(f"Number of chunks used: {len(response.retriever_result.items)}\n")
print("=" * 70)
print("\nAnswer:")
print(response.answer)

---

# Part 2: Hybrid Cypher Retriever

The `HybridCypherRetriever` extends hybrid search with custom Cypher queries for graph traversal. This gives you the keyword precision of fulltext search, the semantic understanding of vector search, and the structural context of graph traversal - all in one retriever.

## Adjacent Chunk Retrieval with Hybrid Search

Use hybrid search as the entry point, then traverse `NEXT_CHUNK` relationships to gather surrounding context - the same pattern from notebook 04, but now with hybrid search powering the initial retrieval.

In [None]:
adjacent_chunks_query = """
WITH node
OPTIONAL MATCH (prev:Chunk)-[:NEXT_CHUNK]->(node)
OPTIONAL MATCH (node)-[:NEXT_CHUNK]->(next:Chunk)
MATCH (node)-[:FROM_DOCUMENT]->(doc:Document)
RETURN 
    doc.documentId AS document_id,
    node.index AS chunk_index,
    COALESCE(prev.text, '') AS previous_context,
    node.text AS main_context,
    COALESCE(next.text, '') AS next_context
"""

hybrid_adjacent_retriever = HybridCypherRetriever(
    driver=driver,
    vector_index_name=VECTOR_INDEX,
    fulltext_index_name=FULLTEXT_INDEX,
    embedder=embedder,
    retrieval_query=adjacent_chunks_query,
)

print("HybridCypherRetriever initialized with adjacent chunks query")

In [None]:
query = "What is the procedure for hydraulic fluid contamination check?"

rag = GraphRAG(llm=llm, retriever=hybrid_adjacent_retriever)
response = rag.search(
    query,
    retriever_config={"top_k": 3},
    return_context=True,
    response_fallback="No relevant maintenance procedures found.",
)

print(f"Query: \"{query}\"\n")
print(f"Number of results: {len(response.retriever_result.items)}\n")
print("=" * 70)
print("\nAnswer:")
print(response.answer)

## Document Context with Hybrid Search

Enrich hybrid search results with document metadata for source traceability.

In [None]:
document_context_query = """
MATCH (node)-[:FROM_DOCUMENT]->(doc:Document)
RETURN 
    doc.documentId AS document_id,
    doc.aircraftType AS aircraft_type,
    doc.title AS document_title,
    node.index AS chunk_index,
    node.text AS context
"""

hybrid_document_retriever = HybridCypherRetriever(
    driver=driver,
    vector_index_name=VECTOR_INDEX,
    fulltext_index_name=FULLTEXT_INDEX,
    embedder=embedder,
    retrieval_query=document_context_query,
)

print("HybridCypherRetriever initialized with document context query")

In [None]:
query = "What are the borescope inspection intervals for the V2500?"

rag = GraphRAG(llm=llm, retriever=hybrid_document_retriever)
response = rag.search(
    query,
    retriever_config={"top_k": 3},
    return_context=True,
    response_fallback="No relevant maintenance procedures found.",
)

print(f"Query: \"{query}\"\n")
print("=" * 70)
print("\nAnswer:")
print(response.answer)

print("\n\nContext used:")
print("=" * 70)
for item in response.retriever_result.items:
    print(f"\n{item.content}")

---

# Part 3: Vector vs Hybrid Comparison

Let's compare retrieval results on queries where each approach has distinct advantages.

## Set Up Both Retrievers

Create a VectorRetriever alongside our HybridRetriever for side-by-side comparison.

In [None]:
vector_retriever = VectorRetriever(
    driver=driver,
    index_name=VECTOR_INDEX,
    embedder=embedder,
    return_properties=["text"],
)

print("VectorRetriever initialized for comparison")

## Side-by-Side Comparison

We'll test queries that exercise different retrieval strengths:
- **Semantic query**: Natural language question (vector search advantage)
- **Technical query**: Contains specific codes/terms (hybrid search advantage)

In [None]:
comparison_queries = [
    {
        "label": "Semantic (natural language)",
        "query": "How do I fix an engine that is running too hot?",
    },
    {
        "label": "Technical (specific terminology)",
        "query": "V2500 EGT limit 925 degrees",
    },
    {
        "label": "Mixed (natural language + specific terms)",
        "query": "What causes hydraulic pressure drop below 2800 PSI?",
    },
]

for entry in comparison_queries:
    print(f"\n{'=' * 70}")
    print(f"Query type: {entry['label']}")
    print(f"Query: \"{entry['query']}\"")
    print("=" * 70)

    # Vector search
    vector_result = vector_retriever.search(query_text=entry["query"], top_k=3)
    print(f"\n  [Vector] Top result score: {vector_result.items[0].metadata['score']:.4f}")
    print(f"  [Vector] Content: {vector_result.items[0].content[:120]}...")

    # Hybrid search
    hybrid_result = hybrid_retriever.search(query_text=entry["query"], top_k=3)
    print(f"\n  [Hybrid] Top result score: {hybrid_result.items[0].metadata['score']:.4f}")
    print(f"  [Hybrid] Content: {hybrid_result.items[0].content[:120]}...")

**Key observations:**

- **Semantic queries**: Both retrievers perform similarly since the vector component dominates
- **Technical queries**: Hybrid retriever often finds more precise matches because the fulltext component catches exact terms like part numbers and fault codes
- **Mixed queries**: Hybrid retriever benefits from both signals, providing the best of both worlds

## GraphRAG Answer Comparison

Compare the final LLM-generated answers using each retriever.

In [None]:
query = "What fault codes indicate bearing wear in the V2500 engine?"

print(f"Query: \"{query}\"")
print("\n" + "=" * 70)

# Vector Retriever
print("\n[1] VECTOR RETRIEVER")
print("-" * 40)
rag_vector = GraphRAG(llm=llm, retriever=vector_retriever)
response_vector = rag_vector.search(
    query,
    retriever_config={"top_k": 3},
    return_context=True,
    response_fallback="No relevant maintenance procedures found.",
)
print(response_vector.answer)

# Hybrid Retriever
print("\n" + "=" * 70)
print("\n[2] HYBRID RETRIEVER")
print("-" * 40)
rag_hybrid = GraphRAG(llm=llm, retriever=hybrid_retriever)
response_hybrid = rag_hybrid.search(
    query,
    retriever_config={"top_k": 3},
    return_context=True,
    response_fallback="No relevant maintenance procedures found.",
)
print(response_hybrid.answer)

---

## Summary

In this notebook, you explored hybrid retrieval strategies for aircraft maintenance GraphRAG:

**Part 1 - Hybrid Retriever:**
1. Combined vector + fulltext search using `HybridRetriever`
2. GraphRAG pipeline with hybrid retrieval

**Part 2 - Hybrid Cypher Retriever:**
3. Adjacent chunk retrieval powered by hybrid search
4. Document context enrichment with hybrid search

**Part 3 - Comparison:**
5. Vector vs hybrid retrieval on different query types
6. When each approach excels

**When to use Hybrid Search:**
- Queries may contain specific technical terms, codes, or part numbers
- Your corpus has domain-specific vocabulary that benefits from exact matching
- You want robust retrieval across both natural language and technical queries

**Available Retrievers Summary:**

| Retriever | Search Method | Graph Traversal | Best For |
|-----------|--------------|-----------------|----------|
| `VectorRetriever` | Semantic similarity | No | Natural language questions |
| `VectorCypherRetriever` | Semantic similarity | Yes (custom Cypher) | Context-rich semantic search |
| `HybridRetriever` | Semantic + keyword | No | Technical terminology queries |
| `HybridCypherRetriever` | Semantic + keyword | Yes (custom Cypher) | Full-featured technical search |

In [None]:
# Cleanup
neo4j.close()