# Text to Cypher

Vector retrievers are great for finding relevant data based on semantic similarity.

To answer more specific questions, you may need to perform more complex queries to find data relating to specific nodes, relationships, or properties.

Text to Cypher allows you to convert natural language queries into Cypher queries that can be executed against the graph.

In this module, you will:

- Use the Cypher QA (question-answering) chain to query the graph using natural language queries.

- Create a custom Cypher generation prompt include specific instructions and examples queries.

- Explore how restricting the schema can support more focused queries.

- Add a text to Cypher retriever to a LangChain agent.


# Cypher QA Chain

The LangChain [GraphCypherQAChain](https://python.langchain.com/api_reference/neo4j/chains/langchain_neo4j.chains.graph_qa.cypher.GraphCypherQAChain.html):

1. Accepts a question.

2. Converts the question into a Cypher query using the graph schema.

3. Executes the query

4. Uses the result to generate an answer.

If asked the question "What year was the movie Babe released?", the chain will generate messages like:

```
[human]
What year was the movie Babe released?
[system]
Generate a Cypher query based on this question and this graph schema.
[assistant]
MATCH (m:Movie)
WHERE m.title = 'Babe'
RETURN m.released

The Cypher query is the executed and the result returned.

[system]
Generate an answer based on these results [{m.released_year: 1995}].
[assistant]
The movie Babe was released in 1995.
```

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

from langchain.chat_models import init_chat_model
from langchain_neo4j import Neo4jGraph
from langchain_neo4j import GraphCypherQAChain

# Initialize the LLM
model = init_chat_model(
    "gpt-4o", 
    model_provider="openai"
)

# Connect to Neo4j
graph = Neo4jGraph(
    url=os.getenv("NEO4J_URI"),
    username=os.getenv("NEO4J_USERNAME"), 
    password=os.getenv("NEO4J_PASSWORD"),
)

# Create the Cypher QA chain
cypher_qa = GraphCypherQAChain.from_llm(
    graph=graph, 
    llm=model, 
    allow_dangerous_requests=True,
    verbose=True, 
)

# Invoke the chain
question = "How many movies are in the Sci-Fi genre?"
response = cypher_qa.invoke({"query": question})
print(response["result"])



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (:Movie)-[:IN_GENRE]->(:Genre {name: "Sci-Fi"})
RETURN count(*) AS sciFiMovieCount
[0m
Full Context:
[32;1m[1;3m[{'sciFiMovieCount': 5}][0m

[1m> Finished chain.[0m
There are 5 movies in the Sci-Fi genre.


## Allow Dangerous Requests

You are trusting the generation of Cypher to the LLM. It may generate invalid Cypher queries that could corrupt data in the graph or provide access to sensitive information.

You have to opt-in to this risk by setting the `allow_dangerous_requests` flag to `True`.

In a production environment, you should ensure that access to data is limited, and sufficient security is in place to prevent malicious queries. This could include the use of a [read only user](https://neo4j.com/docs/operations-manual/current/authentication-authorization/manage-users/?_gl=1*xiizkq*_gcl_au*MjEzNTI4NjkxNy4xNzU3MjU4NDMzLjc4MDQ1OTczLjE3NTg0MTY3NjUuMTc1ODQxNjc2NA..*_ga*MTkzMzgxNTk1LjE3NTcyNTg0MzQ.*_ga_DL38Q8KGQC*czE3NjMyNjQ0MzMkbzQ3JGcxJHQxNzYzMzAzNDg0JGo2MCRsMCRoMA..*_ga_DZP8Z65KK4*czE3NjMyNjQ0MzMkbzQ3JGcxJHQxNzYzMzAzNDg0JGo2MCRsMCRoMA..) or [role based access control](https://neo4j.com/docs/operations-manual/current/authentication-authorization/manage-privileges/?_gl=1*xiizkq*_gcl_au*MjEzNTI4NjkxNy4xNzU3MjU4NDMzLjc4MDQ1OTczLjE3NTg0MTY3NjUuMTc1ODQxNjc2NA..*_ga*MTkzMzgxNTk1LjE3NTcyNTg0MzQ.*_ga_DL38Q8KGQC*czE3NjMyNjQ0MzMkbzQ3JGcxJHQxNzYzMzAzNDg0JGo2MCRsMCRoMA..*_ga_DZP8Z65KK4*czE3NjMyNjQ0MzMkbzQ3JGcxJHQxNzYzMzAzNDg0JGo2MCRsMCRoMA..).

## Generated Cypher

The LLM may not always understand the graph schema or the question correctly. This can lead to the generated Cypher queries being incorrect or inefficient.

You will explore different ways to improve the quality of the generated Cypher queries in the next lesson.

## Cypher LLM

You can use different LLMs to generate the Cypher query and the answer.

This is useful as the requirements for generating a Cypher query maybe different from generating answer.

Modify the program to include a different LLM for the Cypher query generation:

```cypher
cypher_model = init_chat_model(
    "gpt-4o", 
    model_provider="openai",
    temperature=0.0
)
```

# Cypher Generation

To improve the accuracy of the generated Cypher queries you can customize the generation prompt for your data requirements.

In this lesson, you will learn how to provide specific instructions and examples queries to improve Cypher query generation.

## Prompt

You can provide a custom prompt to the GraphCypherQAChain. You can tailor the prompt to your use case to generate more accurate Cypher queries.

```python
from langchain_core.prompts.prompt import PromptTemplate

# Cypher template
cypher_template = """Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.

Schema:
{schema}

Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}"""

cypher_prompt = PromptTemplate(
    input_variables=["schema", "question"], 
    template=cypher_template
)
```

The prompt includes instructions for generating Cypher queries including parameters for the `schema`, and `question`. When invoked the `GraphCypherQAChain` will insert the `schema` and `question` parameters into the prompt.

Add the custom prompt to the `GraphCypherQAChain`:

```python
cypher_qa = GraphCypherQAChain.from_llm(
    graph=graph, 
    llm=model, 
    cypher_llm=cypher_model,
    cypher_prompt=cypher_prompt,
    allow_dangerous_requests=True,
    verbose=True,
)
```

## Specific instructions

To manage specific data or business rules, you can provide specific instructions to the LLM when generating the Cypher.

For example, movie titles that start with "The" are stored in the graph as "Matrix, The" instead of "The Matrix".

Asking the LLM to generate Cypher queries without this information will result in no data being returned.

```
[user]
Who acted in the movie The Matrix?

[assistant]
I don't know.
```

Update the cypher_template to include a specific instruction to the LLM to handle this case:

    For movie titles that begin with "The", move "the" to the end,
    for example "The 39 Steps" becomes "39 Steps, The".

```
# Cypher template with additional instructions
cypher_template = """Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
For movie titles that begin with "The", move "the" to the end, for example "The 39 Steps" becomes "39 Steps, The".

Schema:
{schema}

Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}"""
```

## Examples

You can provide examples of questions and relevant Cypher queries to help the LLM generate more accurate Cypher queries.

Questions that relate to movies ratings often generate ambiguous or incorrect Cypher. This is because the rating is a property of the RATED relationship, and the Movie node also includes a imdbRating property.

Cypher examples should describe the query and the expected Cypher query, for example:

```
Question: Get user ratings?
Cypher: MATCH (u:User)-[r:RATED]->(m:Movie)
        WHERE u.name = "User name"
        RETURN r.rating AS userRating
```

Update the `cypher_template` to include the examples relating to movie ratings:

```
# Cypher template with examples
cypher_template = """Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
For movie titles that begin with "The", move "the" to the end, for example "The 39 Steps" becomes "39 Steps, The".

Schema:
{schema}
Examples:
1. Question: Get user ratings?
   Cypher: MATCH (u:User)-[r:RATED]->(m:Movie) WHERE u.name = "User name" RETURN r.rating AS userRating
2. Question: Get average rating for a movie?
   Cypher: MATCH (m:Movie)<-[r:RATED]-(u:User) WHERE m.title = 'Movie Title' RETURN avg(r.rating) AS userRating

Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}"""
```

In [2]:
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_neo4j import Neo4jGraph
from langchain_neo4j import GraphCypherQAChain
from langchain.chat_models import init_chat_model
from langchain_core.prompts.prompt import PromptTemplate

model = init_chat_model(
    "gpt-4o", 
    model_provider="openai"
)

cypher_model = init_chat_model(
    "gpt-4o-mini", 
    model_provider="openai",
    temperature=0.0
)

graph = Neo4jGraph(
    url=os.getenv("NEO4J_URI"),
    username=os.getenv("NEO4J_USERNAME"), 
    password=os.getenv("NEO4J_PASSWORD"),
)



# Cypher template with examples
cypher_template = """Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
For movie titles that begin with "The", move "the" to the end, for example "The 39 Steps" becomes "39 Steps, The".

Schema:
{schema}
Examples:
1. Question: Get user ratings?
   Cypher: MATCH (u:User)-[r:RATED]->(m:Movie) WHERE u.name = "User name" RETURN r.rating AS userRating
2. Question: Get average rating for a movie?
   Cypher: MATCH (m:Movie)<-[r:RATED]-(u:User) WHERE m.title = 'Movie Title' RETURN avg(r.rating) AS userRating

Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}"""


cypher_prompt = PromptTemplate(
    input_variables=["schema", "question"], 
    template=cypher_template
)

cypher_qa = GraphCypherQAChain.from_llm(
    graph=graph, 
    llm=model, 
    cypher_llm=cypher_model,
    cypher_prompt=cypher_prompt,
    allow_dangerous_requests=True,
    verbose=True,
)

question = "What was the release date of the movie The 39 Steps?"
response = cypher_qa.invoke({"query": question})
print(response["result"])



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (m:Movie) WHERE m.title = '39 Steps, The' RETURN m.released AS releaseDate[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m
I don't know the answer.


## Genres

The database contains data about movie genres.

When generating more complex Cypher queries, such as those that involve genres, the LLM may not generate the correct Cypher query.

These queries may require a specific example on how to retrieve genres from the graph:

- What is the highest user rated movie in the Horror genre?

- How many Sci-Fi movies has Tom Hanks acted in?

Your challenge is to provide an example Cypher query that demonstrates how to retrieve genres from the graph.

```python
# Cypher template with examples
cypher_template = """Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
For movie titles that begin with "The", move "the" to the end, for example "The 39 Steps" becomes "39 Steps, The".

Schema:
{schema}
Examples:
1. Question: Get user ratings?
   Cypher: MATCH (u:User)-[r:RATED]->(m:Movie) WHERE u.name = "User name" RETURN r.rating AS userRating
2. Question: Get average rating for a movie?
   Cypher: MATCH (m:Movie)<-[r:RATED]-(u:User) WHERE m.title = 'Movie Title' RETURN avg(r.rating) AS userRating
3. Question: Get movies for a genre?
   Cypher: MATCH ((m:Movie)-[:IN_GENRE]->(g:Genre) WHERE g.name = 'Genre Name' RETURN m.title AS movieTitle

Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}"""
```

# Schema

The LLM generates Cypher queries based on the schema of the graph. When a query is submitted, the schema is automatically read from the database and added to the prompt.

In this lesson, you will learn how to restrict the schema to only include certain node labels or relationship types.

Restricting the schema can help generate better Cypher by:

- Reducing the complexity of the generated Cypher queries.

- Helping the LLM focus on the relevant parts of the graph.

- Excluding irrelevant or unwanted parts of the graph that may confuse the LLM.

More generally, the more focused the schema, the better the LLM can generate Cypher queries.

## Restricting the schema

You can restrict the schema by either providing the `GraphCypherQAChain` with a list of node labels and relationship types to **include** or **exclude** from the schema.

If you wanted to just include data about movies and their directors, you could provide the following list of node labels and relationship types to include:

- Movie

- DIRECTED

- Director

You provide the types as a list to the `include_types` parameter of the `GraphCypherQAChain`:

```python
# Create the Cypher QA chain
cypher_qa = GraphCypherQAChain.from_llm(
    graph=graph, 
    llm=model, 
    include_types=["Movie", "ACTED_IN", "Person"],
    allow_dangerous_requests=True,
    verbose=True, 
)
```

When prompted with a question about movies, the LLM will only be able to respond with answers related to movies and directors.

Alternatively, if you wanted to exclude ratings data, you could provide `User` and `RATED` as the types to the exclude_types parameter:

```python
# Create the Cypher QA chain
cypher_qa = GraphCypherQAChain.from_llm(
    graph=graph, 
    llm=model, 
    exclude_types=["User", "RATED"],
    allow_dangerous_requests=True,
    verbose=True, 
)
```

How you restrict the schema will depend on the graph structure and the types of questions you want to answer.

# Retriever

In this lesson, you will use the GraphCypherQAChain to add a text to Cypher retriever to the LangChain agent.

## Text to Cypher

In [3]:
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_core.documents import Document
from langchain.chat_models import init_chat_model
from langgraph.graph import START, StateGraph
from langchain_core.prompts import PromptTemplate
from typing_extensions import List, TypedDict
from langchain_neo4j import Neo4jGraph
from langchain_neo4j import GraphCypherQAChain

# Initialize the LLM
model = init_chat_model("gpt-4o", model_provider="openai")

# Create a prompt
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}

Answer:"""

prompt = PromptTemplate.from_template(template)

# Define state for application
class State(TypedDict):
    question: str
    context: List[dict]
    answer: str

# Connect to Neo4j
graph = Neo4jGraph(
    url=os.getenv("NEO4J_URI"),
    username=os.getenv("NEO4J_USERNAME"), 
    password=os.getenv("NEO4J_PASSWORD"),
)

# Create the Cypher QA chain
cypher_qa = GraphCypherQAChain.from_llm(
    graph=graph, 
    llm=model, 
    allow_dangerous_requests=True,
    return_direct=True,
)

# Define functions for each step in the application

# Retrieve context 
def retrieve(state: State):
    context = cypher_qa.invoke(
        {"query": state["question"]}
    )
    return {"context": context}

# Generate the answer based on the question and context
def generate(state: State):
    messages = prompt.invoke({"question": state["question"], "context": state["context"]})
    response = model.invoke(messages)
    return {"answer": response.content}

# Define application steps
workflow = StateGraph(State).add_sequence([retrieve, generate])
workflow.add_edge(START, "retrieve")
app = workflow.compile()

# Run the application
question = "What movies has Tom Hanks acted in?"
response = app.invoke({"question": question})
print("Answer:", response["answer"])
print("Context:", response["context"])

Answer: Tom Hanks has acted in "Toy Story" according to the provided context. However, he has also appeared in many other movies, including "Forrest Gump," "Cast Away," "Saving Private Ryan," "The Green Mile," "Big," and the "Toy Story" sequels, among others. Please note that the context provided only lists "Toy Story."
Context: {'query': 'What movies has Tom Hanks acted in?', 'result': [{'m.title': 'Toy Story'}]}


## Improve the retriever

Your challenge is to improve the retriever using the techniques you learned in the previous lessons, which could include:

- Providing a custom prompt and specific instructions.

- Including example questions and Cypher queries.

- Using a different LLM model for Cypher generation.

- Restricting the schema to provide more focused results.

Here are some examples of more complex questions you can try:

- When was the movie The Abyss released?

- What is the highest grossing movie of all time?

- Can you recommend a Horror movie based on user rating?

- What movies scored about 4 for user rating?

- What are the highest rated movies with more than 100 ratings?

There is no right or wrong solution. You should experiment with different approaches to see how they affect the accuracy and relevance of the generated Cypher queries.

In [4]:
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_core.documents import Document
from langchain.chat_models import init_chat_model
from langgraph.graph import START, StateGraph
from langchain_core.prompts import PromptTemplate
from typing_extensions import List, TypedDict
from langchain_neo4j import Neo4jGraph
from langchain_neo4j import GraphCypherQAChain

# Initialize the LLM
model = init_chat_model("gpt-4o", model_provider="openai")

# Create a prompt
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}

Answer:"""

prompt = PromptTemplate.from_template(template)

# Define state for application
class State(TypedDict):
    question: str
    context: List[dict]
    answer: str

# Connect to Neo4j
graph = Neo4jGraph(
    url=os.getenv("NEO4J_URI"),
    username=os.getenv("NEO4J_USERNAME"), 
    password=os.getenv("NEO4J_PASSWORD"),
)

cypher_template = """Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
For movie titles that begin with "The", move "the" to the end, for example "The 39 Steps" becomes "39 Steps, The".
Exclude NULL values when finding the highest value of a property.

Schema:
{schema}
Examples:
1. Question: Get user ratings?
   Cypher: MATCH (u:User)-[r:RATED]->(m:Movie) WHERE u.name = "User name" RETURN r.rating AS userRating
2. Question: Get average rating for a movie?
   Cypher: MATCH (m:Movie)<-[r:RATED]-(u:User) WHERE m.title = 'Movie Title' RETURN avg(r.rating) AS userRating
3. Question: Get movies for a genre?
   Cypher: MATCH ((m:Movie)-[:IN_GENRE]->(g:Genre) WHERE g.name = 'Genre Name' RETURN m.title AS movieTitle

Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}"""

cypher_prompt = PromptTemplate(
    input_variables=["schema", "question"], 
    template=cypher_template
)

# Create the Cypher QA chain
cypher_qa = GraphCypherQAChain.from_llm(
    graph=graph, 
    llm=model, 
    cypher_prompt=cypher_prompt,
    allow_dangerous_requests=True,
    verbose=True,
)

# Define functions for each step in the application

# Retrieve context 
def retrieve(state: State):
    context = cypher_qa.invoke(
        {"query": state["question"]}
    )
    return {"context": context}

# Generate the answer based on the question and context
def generate(state: State):
    messages = prompt.invoke({"question": state["question"], "context": state["context"]})
    response = model.invoke(messages)
    return {"answer": response.content}

# Define application steps
workflow = StateGraph(State).add_sequence([retrieve, generate])
workflow.add_edge(START, "retrieve")
app = workflow.compile()

# Run the application
question = "What is the highest grossing movie of all time?"
response = app.invoke({"question": question})
print("Answer:", response["answer"])
print("Context:", response["context"])



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (m:Movie) WHERE m.revenue IS NOT NULL RETURN CASE WHEN m.title STARTS WITH 'The ' THEN substring(m.title, 4) + ', The' ELSE m.title END AS movieTitle ORDER BY m.revenue DESC LIMIT 1[0m
Full Context:
[32;1m[1;3m[{'movieTitle': 'Toy Story'}][0m

[1m> Finished chain.[0m
Answer: I don't know.
Context: {'query': 'What is the highest grossing movie of all time?', 'result': "I don't know the answer."}
