# Aircraft Maintenance Manual - Data Loading and Embeddings

This notebook covers the complete data preparation pipeline for adding semantic search capabilities to your aircraft knowledge graph. We'll load the A320-200 Maintenance Manual into Neo4j as a Document-Chunk structure, then enrich it with embeddings for semantic search.

**Prerequisites:**
- Complete **Lab 5** (Databricks ETL) to load the aircraft topology graph (Aircraft, System, Component nodes)
- Running in a Databricks notebook environment

**Learning Objectives:**
- Understand the Document -> Chunk graph structure for semantic search
- Connect to Neo4j and create Document and Chunk nodes alongside existing aircraft data
- Link chunks with `FROM_DOCUMENT` and `NEXT_CHUNK` relationships
- Generate embeddings using Databricks Foundation Model APIs (BGE-large)
- Create a vector index and perform similarity search over maintenance procedures

---

## Why Documents and Chunks?

When building GraphRAG applications, we split documents into smaller pieces called **chunks** because:

1. **Context windows are limited** - LLMs can only process a certain amount of text at once
2. **Retrieval precision** - Smaller chunks allow more precise matching to user queries
3. **Cost efficiency** - Processing smaller chunks is faster and cheaper

The graph structure we'll build extends your existing aircraft topology:
```
(:Aircraft)-[:HAS_SYSTEM]->(:System)-[:HAS_COMPONENT]->(:Component)
                                |
(:Document) <-[:FROM_DOCUMENT]- (:Chunk) -[:NEXT_CHUNK]-> (:Chunk)
```

The maintenance manual chunks can later be linked to specific aircraft, systems, or components for graph-enhanced retrieval.

## Section 1: Configuration

Enter your Neo4j Aura connection details below. You received these credentials when you created your Neo4j Aura instance in Lab 1.

In [None]:
# ============================================
# CONFIGURATION - Enter your Neo4j credentials
# ============================================

NEO4J_URI = ""  # e.g., "neo4j+s://xxxxxxxx.databases.neo4j.io"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = ""  # Your password from Lab 1

# Unity Catalog Volume path (pre-configured by workshop admin)
DATA_PATH = "/Volumes/aws-databricks-neo4j-lab/lab-schema/lab-volume"

# Validate configuration
if not NEO4J_URI or not NEO4J_PASSWORD:
    print("WARNING: Please enter your Neo4j credentials above before running the notebook!")
else:
    print("Configuration ready!")
    print(f"Neo4j URI: {NEO4J_URI}")
    print(f"Data Path: {DATA_PATH}")

## Setup

Import required modules and configure the environment.

In [None]:
from neo4j_graphrag.indexes import create_vector_index, create_fulltext_index, upsert_vectors
from data_utils import (
    Neo4jConnection, VolumeDataLoader, split_text, get_embedder,
    EMBEDDING_DIMENSIONS
)

## Maintenance Manual Data

We'll load the A320-200 Maintenance and Troubleshooting Manual from the Unity Catalog Volume. This file was uploaded during lab setup along with the aircraft CSV data.

**Volume Path:** `/Volumes/aws-databricks-neo4j-lab/lab-schema/lab-volume/MAINTENANCE_A320.md`

This comprehensive document includes:

- **Aircraft specifications** for the SkyWays A320-200 fleet (5 aircraft)
- **System architecture** covering Engines (V2500-A1), Avionics, and Hydraulics
- **Troubleshooting procedures** with fault codes and decision trees
- **Operating limits** and scheduled maintenance tasks

This realistic maintenance manual will allow semantic search queries like:
- "How do I troubleshoot engine vibration?"
- "What are the EGT limits during takeoff?"
- "What causes hydraulic pressure loss?"

In [None]:
# Load text from the maintenance manual in Unity Catalog Volume
loader = VolumeDataLoader("MAINTENANCE_A320.md", volume_path=DATA_PATH)
MANUAL_TEXT = loader.text

# Document metadata
DOCUMENT_ID = "AMM-A320-2024-001"
DOCUMENT_TYPE = "Maintenance Manual"
AIRCRAFT_TYPE = "A320-200"

metadata = loader.get_metadata()
print(f"Loaded: {metadata['name']}")
print(f"From Volume: {metadata['volume']}")
print(f"Size: {metadata['size']:,} characters")
print(f"\nFirst 500 characters:")
print(f"{MANUAL_TEXT[:500]}...")

## Connect to Neo4j

Create a connection to your Neo4j database. This should already contain the aircraft topology from Lab 5.

In [None]:
neo4j = Neo4jConnection(uri=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD).verify()
driver = neo4j.driver

# Show existing graph statistics
neo4j.get_graph_stats()

## Clear Existing Chunks (Optional)

For a clean start, remove any existing Document and Chunk nodes from previous runs. This preserves your aircraft topology (Aircraft, System, Component nodes).

In [None]:
neo4j.clear_chunks()

---

# Part 1: Building the Document-Chunk Graph

First, we'll create the basic graph structure with Document and Chunk nodes.

## Create Document Node

Create a Document node to represent the maintenance manual. This node stores metadata about the document including its ID, type, and applicable aircraft.

In [None]:
def create_document(driver, doc_id: str, doc_type: str, aircraft_type: str) -> str:
    """Create a Document node and return its element ID."""
    records, _, _ = driver.execute_query("""
        CREATE (d:Document {
            documentId: $doc_id,
            type: $doc_type,
            aircraftType: $aircraft_type,
            title: 'A320-200 Maintenance and Troubleshooting Manual'
        })
        RETURN elementId(d) as doc_id
    """, doc_id=doc_id, doc_type=doc_type, aircraft_type=aircraft_type)
    return records[0]["doc_id"]

doc_element_id = create_document(driver, DOCUMENT_ID, DOCUMENT_TYPE, AIRCRAFT_TYPE)
print(f"Created Document node with ID: {doc_element_id}")

## Split Text into Chunks

Use `FixedSizeSplitter` from neo4j-graphrag-python to split the manual into chunks. For technical documentation:

- `chunk_size=800`: Larger chunks preserve context for procedures and specifications
- `chunk_overlap=100`: Overlap ensures context continuity across chunk boundaries

In [None]:
# Split text using the utility function
chunks = split_text(MANUAL_TEXT, chunk_size=800, chunk_overlap=100)

print(f"Split into {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks[:5]):  # Show first 5 chunks
    print(f"Chunk {i}: {len(chunk)} chars")
    print(f"  {chunk[:80]}...\n")

if len(chunks) > 5:
    print(f"... and {len(chunks) - 5} more chunks")

## Create Chunk Nodes

Create Chunk nodes for each piece of text and link them to the Document with `FROM_DOCUMENT` relationships.

In [None]:
def create_chunks(driver, doc_element_id: str, chunks: list[str]) -> list[str]:
    """Create Chunk nodes linked to a Document. Returns chunk element IDs."""
    chunk_ids = []
    for index, text in enumerate(chunks):
        records, _, _ = driver.execute_query("""
            MATCH (d:Document) WHERE elementId(d) = $doc_id
            CREATE (c:Chunk {text: $text, index: $index})
            CREATE (c)-[:FROM_DOCUMENT]->(d)
            RETURN elementId(c) as chunk_id
        """, doc_id=doc_element_id, text=text, index=index)
        chunk_id = records[0]["chunk_id"]
        chunk_ids.append(chunk_id)
        if index < 5 or index == len(chunks) - 1:
            print(f"Created Chunk {index}")
        elif index == 5:
            print("...")
    return chunk_ids

chunk_ids = create_chunks(driver, doc_element_id, chunks)
print(f"\nCreated {len(chunk_ids)} chunks")

## Link Chunks with NEXT_CHUNK

Create `NEXT_CHUNK` relationships between sequential chunks. This preserves the original document order for context retrieval.

In [None]:
def link_chunks(driver, chunk_ids: list[str]):
    """Create NEXT_CHUNK relationships between sequential chunks."""
    for i in range(len(chunk_ids) - 1):
        driver.execute_query("""
            MATCH (c1:Chunk) WHERE elementId(c1) = $id1
            MATCH (c2:Chunk) WHERE elementId(c2) = $id2
            CREATE (c1)-[:NEXT_CHUNK]->(c2)
        """, id1=chunk_ids[i], id2=chunk_ids[i+1])
    print(f"Created {len(chunk_ids) - 1} NEXT_CHUNK relationships")

link_chunks(driver, chunk_ids)

## Verify the Graph Structure

Query the graph to see what we created alongside the existing aircraft topology.

In [None]:
def show_graph_structure(driver):
    """Display the Document-Chunk graph structure."""
    # Count chunks per document
    records, _, _ = driver.execute_query("""
        MATCH (d:Document)
        OPTIONAL MATCH (d)<-[:FROM_DOCUMENT]-(c:Chunk)
        RETURN d.documentId as document_id, d.title as title, count(c) as chunks
    """)
    print("=== Document-Chunk Structure ===")
    for record in records:
        print(f"Document: {record['document_id']}")
        print(f"  Title: {record['title']}")
        print(f"  Chunks: {record['chunks']}")
    
    # Show chunk chain sample
    records, _, _ = driver.execute_query("""
        MATCH (c:Chunk)
        WHERE c.index IS NOT NULL
        OPTIONAL MATCH (c)-[:NEXT_CHUNK]->(next:Chunk)
        RETURN c.index as idx, 
               substring(c.text, 0, 60) as text,
               next.index as next_idx
        ORDER BY c.index
        LIMIT 5
    """)
    print("\n=== Chunk Chain (first 5) ===")
    for record in records:
        next_str = f" -> Chunk {record['next_idx']}" if record['next_idx'] is not None else " (end)"
        print(f"Chunk {record['idx']}: \"{record['text']}...\"{next_str}")

show_graph_structure(driver)

---

# Part 2: Adding Embeddings for Semantic Search

Now that we have our Document-Chunk graph, we'll add embeddings to enable semantic search. Embeddings are numerical representations (vectors) of text that capture semantic meaning.

```
"How to fix engine vibration" -> [0.12, -0.45, 0.78, ...] (1024 dimensions)
"Troubleshooting vibration exceedance" -> [0.11, -0.44, 0.77, ...] (similar vector!)
```

Similar texts have similar embeddings, enabling **semantic search** - finding maintenance procedures by meaning rather than exact keywords.

## Initialize Embedder

Create an embedder using Databricks Foundation Model APIs. We use `databricks-bge-large-en` which produces 1024-dimensional vectors optimized for semantic search.

In [None]:
embedder = get_embedder()
print(f"Embedder initialized: {embedder.model_id}")
print(f"Embedding dimensions: {EMBEDDING_DIMENSIONS}")

## Generate and Store Embeddings

Generate an embedding vector for each chunk and store them in Neo4j using `upsert_vectors` from the neo4j-graphrag library. This batch-stores all embeddings in a single operation rather than updating nodes one at a time.

In [None]:
def generate_embeddings(embedder, chunk_ids: list[str], driver) -> list[list[float]]:
    """Generate embeddings for all chunks and batch-store them using upsert_vectors."""
    # Fetch all chunk texts in one query
    records, _, _ = driver.execute_query("""
        MATCH (c:Chunk) WHERE elementId(c) IN $chunk_ids
          AND c.index IS NOT NULL
        RETURN elementId(c) as chunk_id, c.text as text
        ORDER BY c.index
    """, chunk_ids=chunk_ids)

    # Generate embeddings with progress display
    embeddings = []
    for i, record in enumerate(records):
        embedding = embedder.embed_query(record["text"])
        embeddings.append(embedding)
        if i < 3 or i == len(records) - 1:
            print(f"Chunk {i}: Generated {len(embedding)}-dimensional embedding")
        elif i == 3:
            print(f"Processing {len(records) - 4} more chunks...")

    # Batch-store all embeddings using neo4j-graphrag's upsert_vectors
    ordered_ids = [record["chunk_id"] for record in records]
    upsert_vectors(
        driver=driver,
        ids=ordered_ids,
        embedding_property="embedding",
        embeddings=embeddings,
    )
    print(f"\nStored {len(embeddings)} embeddings via upsert_vectors")
    return embeddings

embeddings = generate_embeddings(embedder, chunk_ids, driver)

## Create Vector Index

Create a vector index in Neo4j for efficient similarity search. The index uses cosine similarity to compare embeddings.

> **Note:** `create_vector_index` defaults to `fail_if_exists=False`, so it's safe to re-run without manually dropping the index first.

In [None]:
INDEX_NAME = "maintenanceChunkEmbeddings"

create_vector_index(
    driver=driver,
    name=INDEX_NAME,
    label="Chunk",
    embedding_property="embedding",
    dimensions=EMBEDDING_DIMENSIONS,
    similarity_fn="cosine"
)
print(f"Created vector index: {INDEX_NAME} ({EMBEDDING_DIMENSIONS} dimensions, cosine similarity)")

## Create Fulltext Index

Create a fulltext index on Chunk text for keyword-based search. This enables `HybridRetriever` in later notebooks, which combines vector similarity with traditional keyword matching.

In [None]:
FULLTEXT_INDEX_NAME = "maintenanceChunkText"

create_fulltext_index(
    driver=driver,
    name=FULLTEXT_INDEX_NAME,
    label="Chunk",
    node_properties=["text"]
)
print(f"Created fulltext index: {FULLTEXT_INDEX_NAME}")

## Vector Similarity Search

Now we can search for chunks that are semantically similar to a query. The search:
1. Converts the query to an embedding
2. Finds chunks with similar embedding vectors
3. Returns results ranked by similarity score

In [None]:
def vector_search(driver, embedder, query: str, top_k: int = 3):
    """Search for chunks similar to the query."""
    # Generate query embedding
    query_embedding = embedder.embed_query(query)
    
    records, _, _ = driver.execute_query("""
        CALL db.index.vector.queryNodes($index_name, $top_k, $embedding)
        YIELD node, score
        RETURN node.text as text, node.index as idx, score
    """, index_name=INDEX_NAME, top_k=top_k, embedding=query_embedding)
    
    return records

# Test search with a maintenance query
query = "How do I troubleshoot engine vibration?"
print(f"Query: \"{query}\"\n")
print("=" * 70)

results = vector_search(driver, embedder, query)
for i, record in enumerate(results):
    print(f"\n[{i+1}] Score: {record['score']:.4f} (Chunk {record['idx']})")
    print(f"    {record['text'][:250]}...")

## Compare Different Queries

Try different maintenance queries to see how semantic search finds relevant procedures even with different wording.

In [None]:
queries = [
    "What are the EGT limits during takeoff?",
    "How to detect bearing wear in the engine?",
    "What causes hydraulic pressure loss?",
    "When should I replace the fuel filter?"
]

for query in queries:
    print(f"\nQuery: \"{query}\"")
    print("-" * 60)
    results = vector_search(driver, embedder, query, top_k=1)
    if results:
        record = results[0]
        print(f"Best match (score: {record['score']:.4f}):")
        print(f"  {record['text'][:200]}...")

## Summary

In this notebook, you learned the complete data preparation pipeline for adding semantic search to your aircraft knowledge graph:

**Part 1 - Graph Structure:**
1. **Document-Chunk structure** - The maintenance manual is split into searchable chunks
2. **FROM_DOCUMENT relationship** - Links chunks back to their source document
3. **NEXT_CHUNK relationship** - Preserves the sequential order of chunks

**Part 2 - Embeddings:**
4. **Databricks BGE embeddings** - 1024-dimensional vectors that capture semantic meaning
5. **Batch vector storage** - Using `upsert_vectors` from neo4j-graphrag for efficient storage
6. **Vector index** - Enabling efficient similarity search with cosine similarity
7. **Fulltext index** - Enabling keyword-based search for hybrid retrieval
8. **Semantic search** - Finding maintenance procedures by meaning, not just keywords

Your knowledge graph now combines:
- **Structured data**: Aircraft -> System -> Component hierarchy from Lab 5
- **Unstructured data**: Maintenance manual chunks with embeddings and fulltext indexing

---

**Next:** [GraphRAG Retrievers](04_graphrag_retrievers.ipynb) - Learn to use retrievers that combine vector search with graph traversal for context-aware answers about aircraft maintenance.

**Optional:** [Hybrid Retrievers](05_hybrid_retrievers.ipynb) - Explore hybrid search that combines vector similarity with keyword matching.

In [None]:
# Cleanup
neo4j.close()