# GraphRAG in Memgraph

In this tutorial, we will build GraphRAG using the Memgraph ecosystem and
OpenAI. This example is based on a portion of a fixed Game of Thrones dataset,
which will be enriched with unstructured data to create a knowledge graph. 

To search for relevant information, in this example we will use vector search on
node embeddings to find schematically relevant data. Following this, the
structured data will be extracted from the graph and passed to LLM to answer the
question. 

## Prerequisites

To begin with this tutorial, you will need Docker, Python and an OpenAI API key.
With a few small tweaks, you can adapt this setup to run on your local Ollama
environment. 

First, we need to start Memgraph with the vector search capabilities. You can do
this by running the following command: 

TODO: Updated the command when the vector search is available in the official
Memgraph docker image
```bash
docker run -p 7687:7687 -p 7444:7444 memgraph/memgraph-mage:exp-vector-1 --log-level=TRACE --also-log-to-stderr --telemetry-enabled=False --experimental-vector-indexes='tag__Entity__embedding__{"dimension":384,"limit":3000}'
```

You can run this command outside of this notebook. 

Once Memgraph is running in the background, make sure to load the initial Game
of Thrones dataset:  

In [None]:
```bash
cat ./data/memgraph-export-got.cypherl | docker run -i memgraph/mgconsole --host=localhost
``` 

After the dataset is ingested, install a few Python packages needed to run the demo:  

In [None]:

%pip install neo4j                   # for driver and connection to Memgraph
%pip install sentence-transformers   # for calculating sentence embeddings
%pip install openai                  # for access to LLM
%pip install dotenv                  # for environment variables


## Enrich knowledge graph with the embeddings 

Since in GraphRAG you are not writing actual Cypher queries, rather you are
asking the questions about your domain knowledge graph in plain English. To
retrieve relevant parts of the knowledge graph, you'll need a way to encode
semantic meaning into the graph.  

To achieve this, encode the semantic meaning into the graph so you can locate
the semantically similar parts of the graph. 

There are a several approaches to consider: embedding the node labels and
properties, embedding the triplets related to a node or embedding specific paths
a node can take. Adding more data into embeddings requires a vector with more
dimensions, which can be costly in terms of memory and performance. 

However, with this approach, you can locate semantically similar parts of the
graph with greater accuracy. This means that for longer questions, semantic
search is more likely to find the right part of the graph. 

If the semantic search misses relevant parts of the graph, the LLM will not be
able to answer the question correctly. 

To illustrate a basic example, here is a function that calculates embeddings
based on the node labels an properties: 


In [None]:
def compute_embeddings(driver, model):
    with driver.session() as session:

        # Retrieve all nodes
        result = session.run("MATCH (n) RETURN n")

        for record in result:
            node = record["n"]
            # Combine node labels and properties into a single string
            node_data = (
                " ".join(node.labels)
                + " "
                + " ".join(f"{k}: {v}" for k, v in node.items())
            )

            # Compute the embedding for the node
            node_embedding = model.encode(node_data)

            # Store the embedding back into the node
            session.run(
                f"MATCH (n) WHERE id(n) = {node.element_id} SET n.embedding = {node_embedding.tolist()}"
            )

        # Set the label to Entity for all nodes
        session.run("MATCH (n) SET n:Entity")

If we have a node `:Character {name:"Viserys Targaryen"}` in the graph, the
encoded embedding will include the label `:Charater` and the property
`name:Viserys Targaryen`.

Asking the question `Who is Viserys Targaryen?` will yield a very similar
embedding, allowing you to locate that node in the graph. However, if you ask a
longer question like, `To whom was Viserys Targaryen Loyal in seasone 1 of Game
of Thrones?`, there is a chance that this question might not locate the `Viserys
Targaryen` node in the graph due to its length and complexity. 

Embedding a triplet on the node will yield a better result in this case. 

## Finding the relevant part of the graph

TODO: configure and set vector search index based on new release

Once embeddings are calculated in your graph, you can perform a search based on
these embeddings by using a vector search. 

Memgraph supports vector search starting from version 2.22.  

The goal is to find the most similar node that resembles your question and to
extract the relevant knowledge from it. The function takes the question's
embedding and compares it to the embeddings stored on the nodes.

In [None]:
def find_most_similar_node(driver, question_embedding):

    with driver.session() as session:
        # Perform the vector search on all nodes based on the question embedding
        result = session.run(
            f"CALL vector_search.search('tag', 10, {question_embedding.tolist()}) YIELD * RETURN *;"
        )
        nodes_data = []
        
        # Retrieve all similar nodes and print them
        for record in result:
            node = record["node"]
            properties = {k: v for k, v in node.items() if k != "embedding"}
            node_data = {
                "distance": record["distance"],
                "id": node.element_id,
                "labels": list(node.labels),
                "properties": properties,
            }
            nodes_data.append(node_data)
        print("All similar nodes:")
        for node in nodes_data:
            print(node)

        # Return the most similar node
        return nodes_data[0] if nodes_data else None

Based on the similarity between the question embeddings and node embeddings, we
get the most similar node. This node serves as a pivot point from which we can
pull relevant data. For example, if we are searching for information about
`Viserys Targaryen`, we would pull data surrounding that node, making it our
pivot node. 

## Getting the relevant data

Once we have the pivot node, we can begin retrieving the relevant structured
data around it. The most straightforward approach is to perform multiple hops
from the pivot node. 

Here is the function that fetches the data around pivot node, a specified number
of `hops` away from the pivot node.  


In [None]:
def get_relevant_data(driver, node, hops):
    with driver.session() as session:
        # Retrieve the paths from the node to other nodes that are 'hops' away
        query = (
            f"MATCH path=((n)-[r*..{hops}]-(m)) WHERE id(n) = {node['id']} RETURN path"
        )
        result = session.run(query)

        paths = []
        for record in result:
            path_data = []
            for segment in record["path"]:

                # Process start node without 'embedding' property
                start_node_data = {
                    k: v for k, v in segment.start_node.items() if k != "embedding"
                }

                # Process relationship data
                relationship_data = {
                    "type": segment.type,
                    "properties": segment.get("properties", {}),
                }

                # Process end node without 'embedding' property
                end_node_data = {
                    k: v for k, v in segment.end_node.items() if k != "embedding"
                }

                # Add to path_data as a tuple (start_node, relationship, end_node)
                path_data.append((start_node_data, relationship_data, end_node_data))

            paths.append(path_data)

        # Return all paths
        return paths

TODO: Insert a picture showing this. 

To avoid overloading the LLM's limited context with non-relevant data, we drop
the embedding property from the nodes. Embeddings contain a lot of data that
isn't particularly relevant to the LLM. 

## Helper functions 

For the LLM to understand its task, we need specific prompts. The `RAG_prompt`
describes how the LLM should answer the question, while the `question_prompt` is
optimized for calculating question embeddings by extracting only the key pices
of information to improve embedding accuracy. For example, if you ask, `Who is
Viserys Targaryen?`, only the `Viserys Targaryen` will be extracted from the
question. Ultimately, the LLM will receive the full question back in the
`RAG_prompt`.

In [None]:
def RAG_prompt(question, relevance_expansion_data):
    prompt = f"""
    You are an AI language model. I will provide you with a question and a set of data obtained through a relevance expansion process in a graph database. The relevance expansion process finds nodes connected to a target node within a specified number of hops and includes the relationships between these nodes.

    Question: {question}

    Relevance Expansion Data:
    {relevance_expansion_data}

    Based on the provided data, please answer the question, make sure to base your answers only based on the provided data. Add a context on what data did you base your answer on.
    """
    return prompt


def question_prompt(question):
    prompt = f"""
    You are an AI language model. I will provide you with a question. 
    Extract the key information from the questions. The key information is important information that is required to answer the question.

    Question: {question}

    The output format should be like this: 
    Key Information: [key information 1], [key information 2], ...
    """
    return prompt


async def get_response(client, prompt):
    response = await client.chat.completions.create(
        model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


In [None]:
## Getting a Graph RAG answer 

def main():

    # Create a Neo4j driver
    driver = neo4j.GraphDatabase.driver("bolt://100.64.149.141:7687", auth=("", ""))

    # Load .env file
    load_dotenv()
    os.environ["TOKENIZERS_PARALLELISM"] = "false"

    # Load the SentenceTransformer model
    model = SentenceTransformer("paraphrase-MiniLM-L6-v2")

    # Compute embeddings for all nodes in the graph
    compute_bigger_embeddings_based_on_node(
        driver, model
    )
    client = AsyncOpenAI()

    # Ask a question  (feel free to change the question) 
    question = "In which episode was Viserys Targaryen killed?"

    # Key information from the question 
    prompt = question_prompt(question)
    response = asyncio.run(get_response(client, prompt))
    print(response)
    key_information = response.split("Key Information: ")[1].strip()

    # Compute the embedding for the key information
    question_embedding = model.encode(key_information)

    # Find the most similar node to the question embedding
    node = find_most_similar_node(driver, question_embedding)
    if node:
        print("The most similar node is:")
        print(node)

    # Get the relevant data based on the most similar node
    relevant_data = get_relevant_data(driver, node, hops=2)

    # Show the relevant data
    print("The relevant data is:")
    print(relevant_data)

    # LLM answers the question based on the relevant data
    prompt = RAG_prompt(question, relevant_data)
    response = asyncio.run(get_response(client, prompt))
    print("The response is:")
    print(response)

    driver.close()


if __name__ == "__main__":
    main()


## Expanding the knowledge

Let's say that now we want to expand our existing knowledge graph with
additional information to enrich the dataset, provide more context and retrieve
more relevant data. In this example, we will take unstructured data, such as the
character description summary provided below, extract entities from that
summary, generate triplets to build the knowledge graph create queries and
eventually execute those queries in Memgraph to incorporate with the existing
graph. 

In [9]:
# Sample text summary for processing
summary="Viserys Targaryen is the last living son of the former king, Aerys II Targaryen (the 'Mad King'). As one of the last known Targaryen heirs, Viserys Targaryen is obsessed with reclaiming the Iron Throne and restoring his family’s rule over Westeros. Ambitious and arrogant, he often treats his younger sister, Daenerys Targaryen, as a pawn, seeing her only as a means to gain power. His ruthless ambition leads him to make a marriage alliance with Khal Drogo, a powerful Dothraki warlord, hoping Khal Drogo will give him the army he needs. However, Viserys Targaryen’s impatience and disrespect toward the Dothraki culture lead to his downfall; he is ultimately killed by Khal Drogo in a brutal display of 'a crown for a king' – having molten gold poured over his head. Khal Drogo is a prominent warlord and leader of the Dothraki people, known for his fearsome reputation and formidable combat skills. He enters into a marriage with Princess Daenerys Targaryen as part of an alliance orchestrated by her brother, Prince Viserys Targaryen. Though initially aloof and intimidating to Daenerys Targaryen, Khal Drogo grows to care deeply for her, and their relationship evolves into one of mutual respect and love. Khal Drogo becomes devoted to Daenerys Targaryen and her dreams, including her goal of reclaiming the Iron Throne for House Targaryen. However, Khal Drogo’s fate takes a tragic turn after he sustains a serious wound in a skirmish, which becomes infected. A healer named Mirri Maz Duur performs a ritual that leaves Khal Drogo in a vegetative state, robbing him of his strength and dignity. Heartbroken, Daenerys Targaryen ends Khal Drogo’s life mercifully, signaling both the end of her first love and a significant turning point in her journey. King Joffrey Baratheon is the eldest son of Queen Cersei Lannister and, officially, King Robert Baratheon, though he is actually the result of an incestuous relationship between Queen Cersei and her twin brother, Ser Jaime Lannister. Joffrey Baratheon's personality is marked by sadism, cruelty, and impulsive behavior, traits that make him a despised ruler and widely loathed by the people of Westeros. Following King Robert Baratheon’s death, Joffrey Baratheon takes the throne and quickly reveals himself as a tyrannical ruler, prone to rash decisions and heedless of the advice given by those around him. His cruelty is particularly directed at Lady Sansa Stark, his former fiancée, whom he torments and humiliates on multiple occasions. As king, he alienates his allies and fosters unrest with his reckless brutality. Joffrey Baratheon’s reign ends abruptly when he is poisoned at his wedding feast to Margaery Tyrell, an event known as the 'Purple Wedding' which brings relief to many who suffered under his rule. His death marks a turning point in the power struggles of King’s Landing."

## Entity extraction

TODO: add links 

The first step in the process is to extract entities from the summary using
SpaCy’s large language model. SpaCy is an advanced NLP (natural language
processing) library in Python, designed for tasks such as entity recognition,
part-of-speech tagging, and dependency parsing. It’s widely used for its speed
and accuracy in processing text.

To begin, we need to install SpaCy and the specific model we wll be using.

In [None]:
%pip install spacy
%pip install spacy_llm
%python -m spacy download en_core_web_md

Next, set up your OpenAI API key.

In [10]:
import os
from wasabi import msg

os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_KEY>"

# Check for OpenAI API key
if not os.getenv("OPENAI_API_KEY"):
    msg.fail("OPENAI_API_KEY environment variable not set. Please set it to proceed.", exits=1)

The goal of extracting entities from the text is to preprocess the data before
sending it to the GPT model, ensuring more accurate and relevant results. By
using SpaCy, we can identify key entities such as characters, locations and
other entities for better understanding of the context of the text.

This is particularly useful because SpaCy is specifically trained to recognize
linguistic patterns and relationships in text, which helps to isolate and
highlight the most important pieces of information. By preprocessing the text
this way, we ensure that the GPT model receives a more structured input, helps
reduce noise and irrelevant data, leading to more precise and context-aware
outputs. 

In [None]:

import json
from collections import Counter
from pathlib import Path

import spacy
from spacy_llm.util import assemble

# Load the spaCy model
nlp = spacy.load("en_core_web_md")

# Split document into sentences
def split_document_sent(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents]

def process_text(text, verbose=False):
    doc = nlp(text)
    if verbose:
        msg.text(f"Text: {doc.text}")
        msg.text(f"Entities: {[(ent.text, ent.label_) for ent in doc.ents]}")
    return doc

# Pipeline to run entity extraction
def extract_entities(text, verbose=False):
    processed_data = []
    entity_counts = Counter()

    sentences = split_document_sent(text)
    for sent in sentences:
        doc = process_text(sent, verbose)
        entities = [(ent.text, ent.label_) for ent in doc.ents]

        # Store processed data for each sentence
        processed_data.append({'text': doc.text, 'entities': entities})

        # Update counters
        entity_counts.update([ent[1] for ent in entities])

    # Export to JSON
    with open('processed_data.json', 'w') as f:
        json.dump(processed_data, f)

    msg.text(f"Entity counts: {entity_counts}")

# Run the pipeline on the summary text
verbose = True
extract_entities(summary, verbose)


## Extract node and relationship parameters

Now that we have extracted entities from the text, we have a better
understanding of the data and a more structured context to send to GPT model
we'll be using. The next step is to provide the extracted JSON file to the GPT
prompt, along with clear instructions on how to extract nodes and relationships
from those entities. These instructions will guide the model in identifying key
connections between the entities, which can then be used to build a knowledge
graph. In this example, we will be using a gpt-4 model. 

In [None]:
import json
import openai
from pathlib import Path

# Load processed data from JSON
json_path = Path("processed_data.json")
with open(json_path, "r") as f:
    processed_data = json.load(f)

# Prepare nodes and relationships
nodes = []
relationships = []

# Formulate a prompt for GPT-4
prompt = (
    "Extract entities and relationships from the following JSON data. For each entry in data['entities'], "
    "create a 'node' dictionary with fields 'id' (unique identifier), 'name' (entity text), and 'type' (entity label). "
    "For entities that have meaningful connections, define 'relationships' as dictionaries with 'source' (source node id), "
    "'target' (target node id), and 'relationship' (type of connection). Create max 30 nodes, format relationships in the format of capital letters and _ inbetween words and format the entire response in the JSON output containing only variables nodes and relationships without any text inbetween. Use following labels for nodes: Character, Title, Location, House, Death, Event, Allegiance and following relationship types: HAPPENED_IN, SIBLING_OF, PARENT_OF, MARRIED_TO, HEALED_BY, RULES, KILLED, LOYAL_TO, BETRAYED_BY. Make sure the entire JSON file fits in the output" 
    "JSON data:\n"
    f"{json.dumps(processed_data)}"
)

# Call GPT-4 to analyze the JSON and extract structured nodes and relationships
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "system", "content": "You are a helpful assistant that structures data into nodes and relationships."},
              {"role": "user", "content": prompt}],
    max_tokens=2000
)

# Parse GPT-4 response and add to nodes and relationships lists
output = response['choices'][0]['message']['content']
print(output)
structured_data = json.loads(output)  # Assuming GPT-4 outputs structured JSON

# Populate nodes and relationships lists
nodes.extend(structured_data.get("nodes", []))
relationships.extend(structured_data.get("relationships", []))


## Generate queries

Now that GPT has provided us with the structured data for the nodes and
relationships, the next step is to generate the Cypher queries that we will use
to execute in Memgraph.

In [None]:
def generate_cypher_queries(nodes, relationships):
    queries = []

    # Create nodes
    for node in nodes:
        query = f"""
        MERGE (n:{node['type']}:Entity {{name: '{node['name']}'}}) 
        ON CREATE SET n.id={node['id']} 
        ON MATCH SET n.id={node['id']}
        """
        queries.append(query)

    # Create relationships
    for rel in relationships:
        query = f"MATCH (a {{id: {rel['source']}}}), (b {{id: {rel['target']}}}) " \
                f"CREATE (a)-[:{rel['relationship']}]->(b)"
        queries.append(query)

    return queries

cypher_queries = generate_cypher_queries(nodes, relationships)

## Execute queries

The final step is to execute those queries in Memgraph, enriching your graph
with the newly created context. 

In [None]:
from neo4j import GraphDatabase

# Initialize the Neo4j driver for Memgraph (modify the URI if necessary)
uri = "bolt://localhost:7687"
user = ""
password = ""
driver = GraphDatabase.driver(uri, auth=(user, password))

# Function to execute Cypher queries in Memgraph
def execute_cypher_queries(queries):
    with driver.session() as session:
        for query in queries:
            try:
                session.run(query)
                msg.good(f"Executed query: {query}")
            except Exception as e:
                msg.fail(f"Error executing query: {query}. Error: {e}")

# Execute the generated Cypher queries
execute_cypher_queries(cypher_queries)

