In [None]:
!pip install -qU langchain langchain-community langchain-openai neo4j

# Build a Question Answering application over a Graph Database

In this session, we will create a Q&A chain over a graph database. These systems will allow us to ask a question about the data in a graph database and get back a natural language answer.

Note:
* Building Q&A systems of graph databases requires executing model-generated graph queries. Make sure that our database connection permissions are always scoped as narrowly as possible for our chain/agent's needs. This will mitigate though not eliminate the risks of building a model-driven system.

## Architecture

At a high-level, the steps of most graph chains are:
1. **Convert question to a graph database query**: Model converts user input to a graph database query (e.g., Cypher).
2. **Execute graph database query**: Execute the graph database query.
3. **Answer the question**: Model responds to user input using the query results.

Auestion -> `LLM` -> `Cypher Query` -> `GraphDB` -> `LLM` -> Answer; Optional: Graph Agent

In [None]:
import os

langchain_api_key = 'your_langchain_api_key_here'  # Replace with your actual LangChain API key
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_API_KEY'] = langchain_api_key

openai_api_key = 'your_openai_api_key_here'  # Replace with your actual OpenAI API key
os.environ['OPENAI_API_KEY'] = openai_api_key

Next, we need to define Neo4j credentials.

In [None]:
# Assume we have set up Neo4j properly
os.environ['NEO4J_URI'] = "bolt://localhost:7687"
os.environ['NEO4J_USERNAME'] = "neo4j"
os.environ['NEO4J_PASSWORD'] = "password"

Next we wil create a connection with a Neo4j database and will populate it with example data about movies and their actors.

In [None]:
from langchain_community.graphs import Neo4jGraph

graph = Neo4jGraph()

# Import movie information

movies_query = """
LOAD CSV WITH HEADERS FROM
'https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/movies/movies_small.csv'
AS row
MERGE (m:Movie {id:row.movieId})
SET m.released = date(row.released),
    m.title = row.title,
    m.imdbRating = toFloat(row.imdbRating)
FOREACH (director in split(row.director, '|') |
    MERGE (p:Person {name:trim(director)})
    MERGE (p)-[:DIRECTED]->(m))
FOREACH (actor in split(row.actors, '|') |
    MERGE (p:Person {name:trim(actor)})
    MERGE (p)-[:ACTED_IN]->(m))
FOREACH (genre in split(row.genres, '|') |
    MERGE (g:Genre {name:trim(genre)})
    MERGE (m)-[:IN_GENRE]->(g))
"""

graph.query(movies_query)

This `movies_query` in Neo4j is a Cypher script that loads a CSV file of movie data, processes it, and creates a structured graph database with nodes and relationships for movies, people (directors and actors), and genres. Here’s a breakdown of each part:

1. **`LOAD CSV WITH HEADERS FROM ... AS row`**:
   - Loads the CSV file from the specified URL, treating each line as a row with headers.

2. **`MERGE (m:Movie {id:row.movieId})`**:
   - For each row in the CSV, it finds (or creates if it doesn’t exist) a `Movie` node with a unique identifier `movieId`.

3. **`SET m.released = date(row.released), m.title = row.title, m.imdbRating = toFloat(row.imdbRating)`**:
   - Sets properties for each `Movie` node:
     - `released`: The release date.
     - `title`: The movie title.
     - `imdbRating`: The IMDb rating, converted to a float.

4. **`FOREACH (director in split(row.director, '|') | ...)`**:
   - Iterates through a list of directors (assumes multiple directors are separated by `|` in the CSV).
   - For each director:
     - Creates a `Person` node with the name of the director if it doesn’t exist, then creates a `DIRECTED` relationship between that person and the movie.

5. **`FOREACH (actor in split(row.actors, '|') | ...)`**:
   - Similarly, iterates through actors, creating a `Person` node for each and establishing an `ACTED_IN` relationship with the movie.

6. **`FOREACH (genre in split(row.genres, '|') | ...)`**:
   - Iterates through genres, creating a `Genre` node for each unique genre and an `IN_GENRE` relationship between the movie and each genre.

Summary:
This script builds a Neo4j graph structure with:
- **Nodes**: `Movie`, `Person` (directors and actors), and `Genre`.
- **Relationships**:
  - `:DIRECTED` from `Person` to `Movie`.
  - `:ACTED_IN` from `Person` to `Movie`.
  - `:IN_GENRE` from `Movie` to `Genre`.

The result is a graph where you can query by movies, directors, actors, and genres, with relationships connecting relevant nodes.

## Graph schema

In order for an LLM to be able to generate a Cypher statement, it needs information about the graph schema.

When we instantiate a graph object, it retrieves the information about the graph schema. If we make any changes to the graph, we can run `refresh_schema` to refresh the schema information:

In [None]:
graph.refresh_schema()
graph.schema

This is the graph database that we can query.

## Chain

We need to use a simple chain that takes a question, turns it into a Cypher query, executes the query, and uses the result to answer the original question.

Steps:
1. User asks question and sends it to the LangChain Cypher module.
2. The module generates a Cypher Chain, and the question is translated to a Cypher statement.
3. The module connects the Neo4j Database, and the genereated Cypher is used to query Neo4j database.
4. The module summarizes the results using another Chain, and the results from database are converted to natural language.
5. The module returns the answer to the user.

LangChain comes with a built-in chain for this workflow that is designed to work with Neo4j: `GraphCypherQAChain`

In [None]:
from langchain.chains import GraphCypherQAChain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0)

chain = GraphCypherQAChain.from_llm(graph=graph, llm=llm, verbose=True)

response = chain.invoke({'query': "What was the cast of the Casino?"})
response

### Validating relationship direction

LLMs can struggle with relationship directions in generated Cypher statement. Since the graph schema is predefined, we can validate and optionally correct relationship directions in the generated Cypher statements by using the `validate_cypher` paramter:

In [None]:
chain = GraphCypherQAChain.from_llm(
    graph=graph, llm=llm, verbose=True, validate_cypher=True
)

response = chain.invoke({'query': "What was the cast of the Casino?"})
response