In [1]:
# https://neo4j.com/docs/neo4j-graphrag-python/current/user_guide_rag.html#using-another-llm-model
# https://neo4j.com/blog/news/graphrag-python-package/
# https://neo4j.com/blog/developer/enhancing-hybrid-retrieval-graphrag-python-package/

# Retrieval Modes and GraphRAG in Neo4j

This section clarifies the different retrieval modes used in GraphRAG systems, how they are combined in practice, and how Neo4j supports them internally through different indexing mechanisms and retrievers.

---

## 1. Three Fundamental Modes of Retrieval

In modern knowledge-assisted LLM systems, there are **three complementary modes of retrieval**:

### 1.1 Vector (Semantic) Search  
This is the retrieval mode used in **classical RAG**.

- Text is embedded into vectors using an embedding model
- Queries are embedded in the same vector space
- Retrieval is based on **semantic similarity**
- Captures meaning, paraphrases, and conceptual similarity

**Strengths:**  
- Works even when exact keywords do not match  
- Robust to paraphrasing  

**Limitations:**  
- Can miss exact terms, names, numbers, or rare tokens  

---

### 1.2 Graph (Query-Based) Search  
This is **symbolic retrieval** over a structured knowledge graph.

- Nodes and relationships represent entities and facts
- Retrieval is done using **Cypher queries**
- Supports joins, constraints, and multi-hop reasoning

**Strengths:**  
- Precise, explainable, deterministic  
- Ideal for structured knowledge and reasoning  

**Limitations:**  
- Requires schema awareness  
- Not suitable for fuzzy or semantic matching  

---

### 1.3 Full-Text (Exact / Lexical) Search  
This is **keyword-based retrieval**, similar to classical information retrieval.

- Operates on raw text
- Uses tokenization and term statistics (e.g. BM25-style scoring)
- Matches exact words or phrases

**Strengths:**  
- Excellent for names, technical terms, identifiers  
- Deterministic and interpretable  

**Limitations:**  
- No semantic understanding  
- Sensitive to wording  

---

## 2. How GraphRAG Combines These Modes

**GraphRAG is not a single retrieval strategy**, but a family of approaches that combine the above modes.

In practice, the most effective GraphRAG pipelines follow this pattern:

> **Retrieve text first, then reason with the graph.**

Typical combinations include:

- **Vector → Graph**  
  Classical RAG retrieval followed by Cypher queries on the graph nodes linked to retrieved text.

- **Vector → Full-Text → Graph**  
  Semantic recall first, lexical filtering second, graph traversal last.

- **Full-Text → Vector → Graph**  
  Exact keyword filtering first, semantic ranking second, graph reasoning last.

The key idea is that **vector and full-text search identify relevant evidence**, while **graph queries provide structure, aggregation, and reasoning**.

---

## 3. How Neo4j Supports These Retrieval Modes

Neo4j supports these retrieval processes **in parallel**, within the same database, using different **indexing mechanisms**.

Conceptually, you can think of Neo4j as hosting:

### 3.1 The Knowledge Graph (Symbolic Layer)
- Nodes: entities (e.g. `Planet`, `Person`, `Company`)
- Relationships: facts and relations
- Accessed via **Cypher queries**

This is the graph used in query-based GraphRAG.

---

### 3.2 The Vector Index (Semantic Layer)
- Built over nodes that store embeddings (typically `Chunk` nodes)
- Enables **k-nearest-neighbor vector search**
- Used for semantic retrieval

---

### 3.3 The Full-Text Index (Lexical Layer)
- Built over text properties of nodes (again, usually `Chunk`)
- Enables keyword-based search
- Complements vector retrieval

> These are **not separate databases**, but **different indexes co-existing on the same Neo4j graph**, each optimized for a different retrieval mode.

---

## 4. Neo4j Retrievers and the Retrieval Modes They Combine

Neo4j provides four retrievers that progressively combine these capabilities.

---

### 4.1 `VectorRetriever`

**Retrieval modes used:**
- Vector search

**Requirements:**
- Vector index

**Description:**  
Performs pure semantic search over embeddings and returns the matched nodes with similarity scores.  
This is the closest analogue to classical RAG retrieval.

---

### 4.2 `VectorCypherRetriever`

**Retrieval modes used:**
- Vector search  
- Graph (Cypher) traversal

**Requirements:**
- Vector index  
- Knowledge graph

**Description:**  
Performs vector search first, then executes a Cypher retrieval query starting from the matched nodes.  
This is often the **first true GraphRAG retriever** users encounter.

---

### 4.3 `HybridRetriever`

**Retrieval modes used:**
- Vector search  
- Full-text search

**Requirements:**
- Vector index  
- Full-text index

**Description:**  
Combines semantic and lexical retrieval to improve recall and precision, but does not yet exploit graph structure.

---

### 4.4 `HybridCypherRetriever`

**Retrieval modes used:**
- Vector search  
- Full-text search  
- Graph (Cypher) traversal

**Requirements:**
- Vector index  
- Full-text index  
- Knowledge graph

**Description:**  
This retriever represents the **full GraphRAG retrieval pipeline**:
semantic + lexical retrieval followed by graph-based contextualization and reasoning.

---

## 5. Summary

- There are **three fundamental retrieval modes**: vector, graph, and full-text
- GraphRAG systems combine these modes in different ways
- Neo4j supports all three via **co-existing indexes on the same graph**
- The four Neo4j retrievers differ in **which modes they combine**
- `HybridCypherRetriever` represents the most complete and production-ready approach

Understanding these layers clarifies why GraphRAG is more than “RAG + a graph” — it is a structured retrieval and reasoning pipeline.


In [2]:
from neo4j import GraphDatabase
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.llm import OllamaLLM
from neo4j_graphrag.embeddings.ollama import OllamaEmbeddings
from neo4j_graphrag.retrievers import VectorRetriever, VectorCypherRetriever, HybridRetriever, HybridCypherRetriever
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama

### LLMs to use

We consider that we need to augment the abilities of a `chat` LLM with RAG + full text + graph information. To do that, we employ a second, `embedder` LLM that generates embeddings, for doing vector search in RAG. Then, we use a third `cypher` model which is good at formating text in JSON format and also translate natural language text queries to cypher code.

In [3]:
# Initialize the chat model
chat_name = 'qwen2.5:3b'  # Or another text-focused model
chat_model = OllamaLLM(
    model_name=chat_name,
    model_params={
        # "response_format": {"type": "json_object"},
        "options": {"temperature": 0},
        # 'format': 'json'
    }
)

# Initialize the embedder model
embedder_name = 'qwen3-embedding:0.6b'
embedder = OllamaEmbeddings(
    model=embedder_name
)

# Initialize the cypher model
cypher_name = 'tomasonjo/llama3-text2cypher-demo:8b_4bit'
cypher_model = ChatOllama(
    model=cypher_name,
    validate_model_on_init=True,
    temperature=0
)
# cypher_model = OllamaLLM(
#     model_name=cypher_name,
#     model_params={
#         # "response_format": {"type": "json_object"},
#         "options": {"temperature": 0},
#         # 'format': 'json'
#     }
# )

### Neo4j driver

We need to initialize a Neo4j driver and connect to the Neo4j graph database with it. Through the driver we can construct the vector database (which will appear in graph form) and 

In [4]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Get credentials from environment variables
neo4j_url = os.getenv("NEO4J_URL", "bolt://localhost:7687")
neo4j_user = os.getenv("NEO4J_USER", "neo4j")
neo4j_password = os.getenv("NEO4J_PASSWORD")

if not neo4j_password:
    raise ValueError("NEO4J_PASSWORD environment variable is not set. Please create a .env file with your credentials.")

driver = GraphDatabase.driver(
    neo4j_url,
    auth=(neo4j_user, neo4j_password)
)


In [5]:
text = '''
The solar system consists of the Sun and the objects that orbit it, including planets, moons, asteroids, comets, and meteoroids.
The Sun is a star at the center of the Solar System.
Mercury is a planet in the Solar System. Mercury orbits the Sun. Mercury has no atmosphere and no magnetic field.
Venus is a planet in the Solar System. Venus orbits the Sun. Venus has a thick atmosphere. The atmosphere of Venus is composed mainly of carbon dioxide. Venus has no magnetic field.
Earth is a planet in the Solar System. Earth orbits the Sun. Earth has one moon called the Moon. Earth has a thick atmosphere composed mainly of nitrogen and oxygen. Earth has a strong magnetic field.
Mars is a planet in the Solar System. Mars orbits the Sun. Mars has two moons called Phobos and Deimos. Mars has a thin atmosphere composed mainly of carbon dioxide. Mars has a weak magnetic field.
Jupiter is a planet in the Solar System. Jupiter orbits the Sun. Jupiter has moons called Io, Europa, Ganymede, and Callisto. Jupiter has a thick atmosphere composed mainly of hydrogen and helium. Jupiter has a strong magnetic field.
'''
print(text)


The solar system consists of the Sun and the objects that orbit it, including planets, moons, asteroids, comets, and meteoroids.
The Sun is a star at the center of the Solar System.
Mercury is a planet in the Solar System. Mercury orbits the Sun. Mercury has no atmosphere and no magnetic field.
Venus is a planet in the Solar System. Venus orbits the Sun. Venus has a thick atmosphere. The atmosphere of Venus is composed mainly of carbon dioxide. Venus has no magnetic field.
Earth is a planet in the Solar System. Earth orbits the Sun. Earth has one moon called the Moon. Earth has a thick atmosphere composed mainly of nitrogen and oxygen. Earth has a strong magnetic field.
Mars is a planet in the Solar System. Mars orbits the Sun. Mars has two moons called Phobos and Deimos. Mars has a thin atmosphere composed mainly of carbon dioxide. Mars has a weak magnetic field.
Jupiter is a planet in the Solar System. Jupiter orbits the Sun. Jupiter has moons called Io, Europa, Ganymede, and Callis

## Build the KG

Build the KG and store in a Neo4j database.

There is the option to build the schema automatically, more information can be found [here](https://neo4j.com/docs/neo4j-graphrag-python/current/user_guide_kg_builder.html#simple-kg-pipeline), but for the purposes of this tutorial, we will move on with a manual schema.

In [6]:
from neo4j_graphrag.experimental.components.schema import (
    SchemaBuilder,
    NodeType,
    PropertyType,
    RelationshipType,
)

schema_builder = SchemaBuilder()

node_types = [
    NodeType(label="Star", properties=[PropertyType(name="id", type="STRING")]),
    NodeType(label="Planet", properties=[PropertyType(name="id", type="STRING")]),
    NodeType(label="Moon", properties=[PropertyType(name="id", type="STRING")]),
    NodeType(label="Atmosphere", properties=[PropertyType(name="id", type="STRING")]),
    NodeType(label="Substance", properties=[PropertyType(name="id", type="STRING")]),
    NodeType(label="MagneticFieldStrength", properties=[PropertyType(name="id", type="STRING")])
]

relationship_types = [
    RelationshipType(label="ORBITS"),
    RelationshipType(label="HAS_ATMOSPHERE"),
    RelationshipType(label="COMPOSED_OF"),
    RelationshipType(label="HAS_MAGNETIC_FIELD")
]

patterns = [
    ("Planet", "ORBITS", "Star"),
    ("Moon", "ORBITS", "Planet"),
    ("Planet", "HAS_ATMOSPHERE", "Atmosphere"),
    ("Atmosphere", "COMPOSED_OF", "Substance"),
    ("Planet", "HAS_MAGNETIC_FIELD", "MagneticFieldStrength")
]

graph_schema = await schema_builder.run(
    node_types=node_types,
    relationship_types=relationship_types,
    patterns=patterns
)

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from neo4j_graphrag.experimental.components.text_splitters.langchain import LangChainTextSplitterAdapter

In [9]:
kg_pipeline = SimpleKGPipeline(
    driver=driver,
    llm=chat_model,
    # # schema=graph_schema,
    schema=None, #'EXTRACTED',
    embedder=embedder,
    from_pdf=False,
    text_splitter=LangChainTextSplitterAdapter(
        RecursiveCharacterTextSplitter(
            chunk_size=100,
            chunk_overlap=50,
            separators=["\n\n", "\n", " ", ""]
        )
    )
)


In [11]:
await kg_pipeline.run_async(
    text=text
)

ERROR:neo4j_graphrag.experimental.components.entity_relation_extractor:LLM response has improper format for chunk_index=4


PipelineResult(run_id='62001e8d-8380-41fb-8ab3-1ba0f8a1f975', result={'resolver': {'number_of_nodes_to_resolve': 12, 'number_of_created_nodes': 8}})

<img src="figs/graph_embeddings.png" width=400px height=400px />

In [12]:
from neo4j_graphrag.indexes import create_vector_index, create_fulltext_index

# Get the embedding dimension by testing with a sample query
sample_embedding = embedder.embed_query("test")
dimensions = len(sample_embedding)
print(f"Embedding dimensions: {dimensions}")

# Create vector index
create_vector_index(
    driver,
    name="text_embeddings",
    label="Chunk",
    embedding_property="embedding",
    dimensions=dimensions,
    similarity_fn="cosine")

# Create full-text index for hybrid search
create_fulltext_index(
    driver,
    name="chunk_fulltext",
    label="Chunk",
    node_properties=["text"])

Embedding dimensions: 1024


In [40]:
# https://neo4j.com/docs/neo4j-graphrag-python/current/api.html#retrievers

In [41]:
query_text = "Which planet has exactly two moons?"

In [42]:
chat_result = chat_model.invoke(f'Based on this text: {text} Answer this question: {query_text}')
print(chat_result)

content='Jupiter is the planet that has exactly two moons. They are named Io, Europa, Ganymede, and Callisto, but the text specifically mentions that Jupiter has "two moons called Phobos and Deimos." However, it\'s important to note that Phobos and Deimos were not originally part of Jupiter\'s system; they are actually captured asteroids or possibly remnants from a collision.'


In [43]:
vector_retriever = VectorRetriever(
   driver,
   index_name="text_embeddings",
   embedder=embedder
)

In [44]:
import json

vector_result = vector_retriever.get_search_results(query_text = query_text, top_k=3)
for i in vector_result.records: print(json.dumps(i.data(), indent=4))

{
    "node": {
        "embedding": null,
        "index": 11,
        "text": "Mars is a planet in the Solar System. Mars orbits the Sun. Mars has two moons called Phobos and"
    },
    "nodeLabels": [
        "__KGBuilder__",
        "Chunk"
    ],
    "elementId": "4:ee307052-61bb-4a26-b230-a256204ad709:59",
    "id": "4:ee307052-61bb-4a26-b230-a256204ad709:59",
    "score": 0.8072717189788818
}
{
    "node": {
        "embedding": null,
        "index": 8,
        "text": "Earth is a planet in the Solar System. Earth orbits the Sun. Earth has one moon called the Moon."
    },
    "nodeLabels": [
        "__KGBuilder__",
        "Chunk"
    ],
    "elementId": "4:ee307052-61bb-4a26-b230-a256204ad709:56",
    "id": "4:ee307052-61bb-4a26-b230-a256204ad709:56",
    "score": 0.804288387298584
}
{
    "node": {
        "embedding": null,
        "index": 14,
        "text": "Jupiter is a planet in the Solar System. Jupiter orbits the Sun. Jupiter has moons called Io,"
    },
    "nodeL

In [45]:
# create text response based on retrieved chunks
template = ChatPromptTemplate.from_template(
    """You are an expert on the Solar System. Based on the following context, answer the question.
    Context: {context}
    Question: {question}
    Answer:"""
)

formatted_prompt = template.format(
    context="\n".join([record.data()["node"]["text"] for record in vector_result.records]),
    question=query_text
)

vector_response = chat_model.invoke(formatted_prompt)

In [46]:
print(vector_response)

content='Based on the provided context, Mars is the planet that has exactly two moons. The context states that "Mars has two moons called Phobos and Deimos." Earth, on the other hand, has one moon called the Moon, and Jupiter has multiple moons (Io and possibly others). Therefore, the answer to the question "Which planet has exactly two moons?" is Mars.'


In [47]:
hybrid_retriever = HybridRetriever(
    driver=driver,
    vector_index_name="text_embeddings",
    fulltext_index_name="chunk_fulltext",
    embedder=embedder,
)

hybrid_retriever_result = hybrid_retriever.get_search_results(query_text=query_text, top_k=3)
print(hybrid_retriever_result)

records=[<Record node={'embedding': None, 'index': 11, 'text': 'Mars is a planet in the Solar System. Mars orbits the Sun. Mars has two moons called Phobos and'} nodeLabels=['__KGBuilder__', 'Chunk'] elementId='4:ee307052-61bb-4a26-b230-a256204ad709:59' id='4:ee307052-61bb-4a26-b230-a256204ad709:59' score=1.0>, <Record node={'embedding': None, 'index': 8, 'text': 'Earth is a planet in the Solar System. Earth orbits the Sun. Earth has one moon called the Moon.'} nodeLabels=['__KGBuilder__', 'Chunk'] elementId='4:ee307052-61bb-4a26-b230-a256204ad709:56' id='4:ee307052-61bb-4a26-b230-a256204ad709:56' score=0.996304426861291>, <Record node={'embedding': None, 'index': 14, 'text': 'Jupiter is a planet in the Solar System. Jupiter orbits the Sun. Jupiter has moons called Io,'} nodeLabels=['__KGBuilder__', 'Chunk'] elementId='4:ee307052-61bb-4a26-b230-a256204ad709:62' id='4:ee307052-61bb-4a26-b230-a256204ad709:62' score=0.9541288793107268>] metadata={'query_vector': [-0.030172499, -0.00804708

In [48]:
formatted_prompt = template.format(
    context="\n".join([record.data()["node"]["text"] for record in hybrid_retriever_result.records]),
    question=query_text
)

In [49]:
hybrid_response = chat_model.invoke(formatted_prompt)

In [50]:
print(hybrid_response)

content='Based on the provided context, Mars is the planet that has exactly two moons. The context states that "Mars has two moons called Phobos and Deimos." Earth, in contrast, has one moon (the Moon), and Jupiter has multiple moons but no information about their exact number was given in the provided context.'


### Construct a symbolic graph

To augment the search with cypher queries, we need to involve symbolic nodes and edges in the existing graph.

In [13]:
import os
from dotenv import load_dotenv
from langchain_neo4j import Neo4jGraph
from langchain_core.documents import Document
from langchain_experimental.graph_transformers import LLMGraphTransformer

# Load environment variables from .env file
load_dotenv()

# Get credentials from environment variables
neo4j_url = os.getenv("NEO4J_URL", "bolt://localhost:7687")
neo4j_user = os.getenv("NEO4J_USER", "neo4j")
neo4j_password = os.getenv("NEO4J_PASSWORD")

if not neo4j_password:
    raise ValueError("NEO4J_PASSWORD environment variable is not set. Please create a .env file with your credentials.")

graph = Neo4jGraph(
    url=neo4j_url,
    username=neo4j_user,
    password=neo4j_password
)

In [14]:
# Create a ChatPromptTemplate for graph extraction
graph_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert Neo4j Cypher query generator.

TASK:
- Translate the user's natural language question into a Cypher query.

CONSTRAINTS:
- Use ONLY the schema provided below.
- Do NOT invent labels, relationship types, or properties.
- Do NOT explain the query.
- Output ONLY valid Cypher.
- If the question cannot be answered unambiguously using the schema, output:
  // CANNOT_ANSWER

GRAPH SCHEMA:
Node labels:
- Star {{id}}
- Planet {{id}}
- Moon {{id}}
- Atmosphere {{id}}
- Substance {{id}}
- MagneticFieldStrength {{id}}

Relationships:
- (Planet)-[:ORBITS]->(Star)
- (Moon)-[:ORBITS]->(Planet)
- (Planet)-[:HAS_ATMOSPHERE]->(Atmosphere)
- (Atmosphere)-[:COMPOSED_OF]->(Substance)
- (Planet)-[:HAS_MAGNETIC_FIELD]->(MagneticFieldStrength)
     
ALLOWED VALUES:
- MagneticField.id \\in {{"none", "weak", "strong"}}

QUERY RULES:
1. Always specify node labels.
2. Always specify relationship directions.
3. MagneticField nodes MUST be matched or merged by description
4. Use meaningful variable names.
5. Return only properties, not full nodes.
6. Use DISTINCT unless duplicates are required.
7. Use OPTIONAL MATCH if information may be missing.
8. Do not use APOC or procedures.

FAILURE CONDITIONS:
- If required entities, labels, or relationships are missing from the schema,
  output:
  // CANNOT_ANSWER

EXAMPLES:
Question:
Which planet orbits the Sun?

Cypher:
MATCH (planet:Planet)-[:ORBITS]->(star:Star {{id: "Sun"}})
RETURN DISTINCT planet.id

Question:
Which moon orbits planet Mars?

Cypher:
MATCH (moon:Moon)-[:ORBITS]->(planet:Planet {{id: "Mars"}})
RETURN DISTINCT moon.id

Question:
What substances compose the atmosphere of Mars?

Cypher:
MATCH (planet:Planet {{id: "Mars"}})
      -[:HAS_ATMOSPHERE]->(atm:Atmosphere)
      -[:COMPOSED_OF]->(substance:Substance)
RETURN DISTINCT substance.id

Question:
Does Jupiter have a magnetic field?

Cypher:
MATCH (planet:Planet {{id: "Jupiter"}})
      -[:HAS_MAGNETIC_FIELD]->(prop:MagneticFieldStrength)
RETURN DISTINCT prop.id
"""),
    ("human", "{input}")
])


In [15]:
prompt_schema = LLMGraphTransformer(
    llm=cypher_model,
    prompt=graph_prompt,
)

In [17]:
graph_prompt_schema = prompt_schema.convert_to_graph_documents([Document(page_content=text)])
print(graph_prompt_schema)

[GraphDocument(nodes=[Node(id='Sun', type='Star', properties={}), Node(id='Mercury', type='Planet', properties={}), Node(id='Venus', type='Planet', properties={}), Node(id='Earth', type='Planet', properties={}), Node(id='Moon', type='Moon', properties={}), Node(id='Mars', type='Planet', properties={}), Node(id='Phobos', type='Moon', properties={}), Node(id='Deimos', type='Moon', properties={}), Node(id='Jupiter', type='Planet', properties={}), Node(id='Io', type='Moon', properties={}), Node(id='Europa', type='Moon', properties={}), Node(id='Ganymede', type='Moon', properties={}), Node(id='Callisto', type='Moon', properties={})], relationships=[Relationship(source=Node(id='Mercury', type='Planet', properties={}), target=Node(id='Sun', type='Star', properties={}), type='ORBITS', properties={}), Relationship(source=Node(id='Venus', type='Planet', properties={}), target=Node(id='Sun', type='Star', properties={}), type='ORBITS', properties={}), Relationship(source=Node(id='Earth', type='Pla

In [18]:
graph.add_graph_documents(graph_prompt_schema)

<img src="figs/graph_symbolic_embeddings_disconnected.png" width=400px height=400px />

In [47]:
def fetch_chunks(driver):
    with driver.session() as session:
        result = session.run("""
        MATCH (c:Chunk)
        RETURN c.index AS index, c.text AS text
        """)
        return [
            Document(
                page_content=r["text"],
                metadata={"chunk_index": r["index"]}
            )
            for r in result
        ]


In [48]:
documents = fetch_chunks(driver)

In [49]:
print(len(documents))
print(documents)

19
[Document(metadata={'chunk_index': 15}, page_content='Jupiter is a planet in the Solar System. Jupiter orbits the Sun. Jupiter has moons called Io,'), Document(metadata={'chunk_index': 16}, page_content='orbits the Sun. Jupiter has moons called Io, Europa, Ganymede, and Callisto. Jupiter has a thick'), Document(metadata={'chunk_index': 17}, page_content='Ganymede, and Callisto. Jupiter has a thick atmosphere composed mainly of hydrogen and helium.'), Document(metadata={'chunk_index': 18}, page_content='composed mainly of hydrogen and helium. Jupiter has a strong magnetic field.'), Document(metadata={'chunk_index': 1}, page_content='objects that orbit it, including planets, moons, asteroids, comets, and meteoroids.'), Document(metadata={'chunk_index': 2}, page_content='The Sun is a star at the center of the Solar System.'), Document(metadata={'chunk_index': 3}, page_content='Mercury is a planet in the Solar System. Mercury orbits the Sun. Mercury has no atmosphere and no'), Document(

In [21]:
retrieval_query = """
MATCH (m:Moon)-[:ORBITS]->(p:Planet)
WITH p, COUNT(DISTINCT m) AS moon_count
WHERE moon_count = 2
RETURN p.id AS planet_name
"""

In [22]:
vector_cypher_retriever = VectorCypherRetriever(
    driver=driver,
    index_name="text_embeddings",
    retrieval_query=retrieval_query,
    embedder=embedder
)
vector_cypher_retriever_result = vector_cypher_retriever.get_search_results(query_text=query_text, top_k=3)
print(vector_cypher_retriever_result)

records=[] metadata={'query_vector': [-0.030172499, -0.008047088, -0.0077332663, -0.029828874, 0.054222304, 0.014122126, 0.031544387, -0.024864495, -0.0020415827, -0.056867238, -0.0075086444, 0.012181566, 0.03519656, -0.0073047196, -0.03788306, 0.07500315, 0.052464377, 0.077367134, 0.10219009, 0.0017662144, -0.0032555142, 0.02191525, -0.021454485, 0.06734899, 0.052508496, 0.055885617, -0.024706075, 0.026439797, -0.023486033, 0.03372772, 0.041791223, -0.008844478, 0.024663756, -0.011707786, 0.0012278217, -0.011680889, -0.010489083, -0.022559857, -0.024791345, 0.016040716, -0.010812873, 0.061017733, -0.01781538, 0.022677828, -0.013873393, -0.023641285, 0.07466772, 0.033045527, 0.026297787, 0.023529489, -0.009656674, -0.027476711, 0.006666778, -0.011979841, 0.02293994, -0.02538853, 0.0031599654, -0.012059918, 0.04604583, -0.010695301, -0.06907099, 0.06438417, 0.022348097, -0.0063296854, -0.0068382574, 0.06721644, 0.025090287, 0.013335668, -0.01431362, 0.0016918402, -0.02763321, -0.0507890

In [23]:
cypher_prompt = ChatPromptTemplate.from_messages([
    ("system", """
You are an expert Neo4j Cypher query generator.

TASK:
- Translate the user's natural language question into a Cypher query.

CONSTRAINTS:
- Use ONLY the schema provided below.
- Do NOT invent labels, relationship types, or properties.
- Do NOT explain the query.
- Output ONLY valid Cypher.
- If the question cannot be answered unambiguously using the schema, output:
  // CANNOT_ANSWER

GRAPH SCHEMA:
Node labels:
- Star {{id}}
- Planet {{id}}
- Moon {{id}}
- Atmosphere {{id}}
- Substance {{id}}
- MagneticFieldStrength {{id}}

Relationships:
- (Planet)-[:ORBITS]->(Star)
- (Moon)-[:ORBITS]->(Planet)
- (Planet)-[:HAS_ATMOSPHERE]->(Atmosphere)
- (Atmosphere)-[:COMPOSED_OF]->(Substance)
- (Planet)-[:HAS_MAGNETIC_FIELD]->(MagneticFieldStrength)
     
ALLOWED VALUES:
- MagneticField.id \\in {{"none", "weak", "strong"}}

QUERY RULES:
1. Always specify node labels.
2. Always specify relationship directions.
3. MagneticField nodes MUST be matched or merged by description
4. Use meaningful variable names.
5. Return only properties, not full nodes.
6. Use DISTINCT unless duplicates are required.
7. Use OPTIONAL MATCH if information may be missing.
8. Do not use APOC or procedures.

FAILURE CONDITIONS:
- If required entities, labels, or relationships are missing from the schema,
  output:
  // CANNOT_ANSWER

"""),
    ("human", "{question}")
])

In [24]:
prompt_str = cypher_prompt.format(question=query_text)
retrieval_query = cypher_model.invoke(prompt_str)

In [25]:
print("Generated Cypher Query:")
print(retrieval_query.content)

Generated Cypher Query:
MATCH (p:Planet)-[:ORBITS]->(m:Moon)
WITH p, COUNT(m) AS moonCount
WHERE moonCount = 2
RETURN p.id


### Error in generated cypher

If not highlighted in the schema, the model may find it hard to distinguish which object orbits which.

```
MATCH (p:Planet)-[:ORBITS]->(m:Moon)
WITH p, COUNT(m) AS moonCount
WHERE moonCount = 2
RETURN p.id
```

Let's try again and make it clear to the model that moons orbit planets inside the query.

In [26]:
new_query_text = "Moons orbit planets. Which planet has exactly two moons?"

prompt_str = cypher_prompt.format(question=new_query_text)
retrieval_query = cypher_model.invoke(prompt_str)

In [27]:
print("Generated Cypher Query:")
print(retrieval_query.content)

Generated Cypher Query:
MATCH (p:Planet)<-[:ORBITS]-(m:Moon)
WITH p, COUNT(m) AS moonCount
WHERE moonCount = 2
RETURN p.id


In [28]:
vector_cypher_retriever = VectorCypherRetriever(
    driver=driver,
    index_name="text_embeddings",
    retrieval_query=retrieval_query.content,
    embedder=embedder
)
vector_cypher_retriever_result = vector_cypher_retriever.get_search_results(query_text=new_query_text, top_k=3)
print(vector_cypher_retriever_result)

records=[] metadata={'query_vector': [-0.06761809, -0.026653253, -0.0076062735, -0.029500986, 0.062323287, 0.013448412, 0.024527883, -0.008000494, 0.011512892, -0.062426258, -0.001657002, 0.0031641268, 0.028899519, -0.006955838, -0.03656136, 0.091650754, 0.022950534, 0.055763617, 0.09427943, -0.007070434, -0.006330946, 0.023841143, -0.009253826, 0.04012057, 0.050484765, 0.051139083, -0.010881126, 0.0068818554, -0.034379203, 0.029960241, 0.021660985, 0.0031901416, 0.025139267, 0.017510053, 0.00049797347, -0.010738733, -0.008462424, -0.058304504, -0.024714375, -0.00028101428, -0.011708774, 0.061709348, -0.022604153, 0.021844473, -0.020796707, -0.041666754, 0.080053784, 0.038709212, -0.0012346813, 0.015877659, -0.016060764, -0.0225018, 0.0022629565, -0.012326456, 0.027882207, -0.03508181, -0.010921005, -0.015959844, 0.029429993, -0.020834511, -0.0518125, 0.06339612, -0.024803812, 0.007960487, 0.002038054, 0.07848519, 0.029958684, 0.02623016, -0.028066076, -0.0012124092, -0.028881477, -0.0

In [29]:
hybrid_cypher_retriever = HybridCypherRetriever(
    driver=driver,
    vector_index_name="text_embeddings",
    fulltext_index_name="chunk_fulltext",
    retrieval_query=retrieval_query.content,
    embedder=embedder,
)
hybrid_cypher_retriever_result = hybrid_cypher_retriever.get_search_results(query_text=new_query_text, top_k=3)
print(hybrid_cypher_retriever_result)

records=[] metadata={'query_vector': [-0.06761809, -0.026653253, -0.0076062735, -0.029500986, 0.062323287, 0.013448412, 0.024527883, -0.008000494, 0.011512892, -0.062426258, -0.001657002, 0.0031641268, 0.028899519, -0.006955838, -0.03656136, 0.091650754, 0.022950534, 0.055763617, 0.09427943, -0.007070434, -0.006330946, 0.023841143, -0.009253826, 0.04012057, 0.050484765, 0.051139083, -0.010881126, 0.0068818554, -0.034379203, 0.029960241, 0.021660985, 0.0031901416, 0.025139267, 0.017510053, 0.00049797347, -0.010738733, -0.008462424, -0.058304504, -0.024714375, -0.00028101428, -0.011708774, 0.061709348, -0.022604153, 0.021844473, -0.020796707, -0.041666754, 0.080053784, 0.038709212, -0.0012346813, 0.015877659, -0.016060764, -0.0225018, 0.0022629565, -0.012326456, 0.027882207, -0.03508181, -0.010921005, -0.015959844, 0.029429993, -0.020834511, -0.0518125, 0.06339612, -0.024803812, 0.007960487, 0.002038054, 0.07848519, 0.029958684, 0.02623016, -0.028066076, -0.0012124092, -0.028881477, -0.0