# GraphRAG in Memgraph
In this example, we are going to build GraphRAG by using the Memgraph ecosystem and OpenAI

## Prerequisites

First we need to start Memgraph that has the vector search, we can do this by running the following command: 

TODO: Updated the command when the vector search is available in the official Memgraph docker image
```bash
docker run -p 7687:7687 -p 7444:7444 memgraph/memgraph-mage:exp-vector-1 --log-level=TRACE --also-log-to-stderr --telemetry-enabled=False --experimental-vector-indexes='tag__Entity__embedding__{"dimension":384,"limit":3000}'
```

You can do this outside of this notebook. 
After that make sure you have few packages installed on your system. You can install them using pip3

In [None]:

pip install neo4j                   # for driver and connection to Memgraph
pip install sentence-transformers   # for sentence embeddings
pip install openai                  # for access to LLM
pip install dotenv                  # for environment variables


After the install of prerequisites, make sure to insert the base dataset into the Memgraph you can do it via following command: 

In [None]:
```bash
cat ./data/memgraph-export-final.cypherl | docker run -i memgraph/mgconsole --host=localhost
``` 

## Enrich knowledge graph with the embeddings 

Since in GraphRAG you are not writing actual queries, rather you are asking the questions about your domain knowledge graph in plain English, somehow you need to get to the relevant parts of your knowledge graph. 

To do this you need to encode the semantic meaning into the knowledge of the graph so you are able to retrive it based on you question. 

Here is the function that calculates embeddings based on the node labels an properties: 


In [None]:
def compute_embeddings(driver, model):
    with driver.session() as session:

        # Retrieve all nodes
        result = session.run("MATCH (n) RETURN n")

        for record in result:
            node = record["n"]
            # Combine node labels and properties into a single string
            node_data = (
                " ".join(node.labels)
                + " "
                + " ".join(f"{k}: {v}" for k, v in node.items())
            )

            # Compute the embedding for the node
            node_embedding = model.encode(node_data)

            # Store the embedding back into the node
            session.run(
                f"MATCH (n) WHERE id(n) = {node.element_id} SET n.embedding = {node_embedding.tolist()}"
            )

        # Set the label to Entity for all nodes
        session.run("MATCH (n) SET n:Entity")

## Finding the relevant part of the graph

Once the embeddings are calculated in your graph, you can perform a search on top of that embeddings, for that you need a Vector search. 

Memgraph supports vector search from version 2.22.  

The goal is to find the most similar node that resembles your question and extract the relevant knowledge from there. 

In [None]:
def find_most_similar_node(driver, question_embedding):

    with driver.session() as session:
        # Perform the vector search on all nodes based on the question embedding
        result = session.run(
            f"CALL vector_search.search('tag', 10, {question_embedding.tolist()}) YIELD * RETURN *;"
        )
        nodes_data = []
        
        # Retrieve all similar nodes and print them
        for record in result:
            node = record["node"]
            properties = {k: v for k, v in node.items() if k != "embedding"}
            node_data = {
                "distance": record["distance"],
                "id": node.element_id,
                "labels": list(node.labels),
                "properties": properties,
            }
            nodes_data.append(node_data)
        print("All similar nodes:")
        for node in nodes_data:
            print(node)

        # Return the most similar node
        return nodes_data[0] if nodes_data else None

## Getting the relevant data

Once you have the pivot node that is connected to the knowledge you need, you can fetch the relevant data around that node: 


In [None]:
def get_relevant_data(driver, node, hops):
    with driver.session() as session:
        # Retrieve the paths from the node to other nodes that are 'hops' away
        query = (
            f"MATCH path=((n)-[r*..{hops}]-(m)) WHERE id(n) = {node['id']} RETURN path"
        )
        result = session.run(query)

        paths = []
        for record in result:
            path_data = []
            for segment in record["path"]:

                # Process start node without 'embedding' property
                start_node_data = {
                    k: v for k, v in segment.start_node.items() if k != "embedding"
                }

                # Process relationship data
                relationship_data = {
                    "type": segment.type,
                    "properties": segment.get("properties", {}),
                }

                # Process end node without 'embedding' property
                end_node_data = {
                    k: v for k, v in segment.end_node.items() if k != "embedding"
                }

                # Add to path_data as a tuple (start_node, relationship, end_node)
                path_data.append((start_node_data, relationship_data, end_node_data))

            paths.append(path_data)

        # Return all paths
        return paths

To avoid overload the LLM with the non-relevant data, we are dropping the embedding property out of the nodes. 

## Helper functions 

We also need different prompts to provide the context to the LLM what is expected to generate

In [None]:
def RAG_prompt(question, relevance_expansion_data):
    prompt = f"""
    You are an AI language model. I will provide you with a question and a set of data obtained through a relevance expansion process in a graph database. The relevance expansion process finds nodes connected to a target node within a specified number of hops and includes the relationships between these nodes.

    Question: {question}

    Relevance Expansion Data:
    {relevance_expansion_data}

    Based on the provided data, please answer the question, make sure to base your answers only based on the provided data. Add a context on what data did you base your answer on.
    """
    return prompt


def question_prompt(question):
    prompt = f"""
    You are an AI language model. I will provide you with a question. 
    Extract the key information from the questions. The key information is important information that is required to answer the question.

    Question: {question}

    The output format should be like this: 
    Key Information: [key information 1], [key information 2], ...
    """
    return prompt


async def get_response(client, prompt):
    response = await client.chat.completions.create(
        model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


In [None]:
## Getting a Graph RAG answer 

def main():

    # Create a Neo4j driver
    driver = neo4j.GraphDatabase.driver("bolt://100.64.149.141:7687", auth=("", ""))

    # Load .env file
    load_dotenv()
    os.environ["TOKENIZERS_PARALLELISM"] = "false"

    # Load the SentenceTransformer model
    model = SentenceTransformer("paraphrase-MiniLM-L6-v2")

    # Compute embeddings for all nodes in the graph
    compute_bigger_embeddings_based_on_node(
        driver, model
    )
    client = AsyncOpenAI()

    # Ask a question  (feel free to change the question) 
    question = "In which episode was Viserys Targaryen killed?"

    # Key information from the question 
    prompt = question_prompt(question)
    response = asyncio.run(get_response(client, prompt))
    print(response)
    key_information = response.split("Key Information: ")[1].strip()

    # Compute the embedding for the key information
    question_embedding = model.encode(key_information)

    # Find the most similar node to the question embedding
    node = find_most_similar_node(driver, question_embedding)
    if node:
        print("The most similar node is:")
        print(node)

    # Get the relevant data based on the most similar node
    relevant_data = get_relevant_data(driver, node, hops=2)

    # Show the relevant data
    print("The relevant data is:")
    print(relevant_data)

    # LLM answers the question based on the relevant data
    prompt = RAG_prompt(question, relevant_data)
    response = asyncio.run(get_response(client, prompt))
    print("The response is:")
    print(response)

    driver.close()


if __name__ == "__main__":
    main()


# KG in Memgraph
In this example, we summarized the *House of Dragons*, identified key entities
using Spacy LLM and GPT-4, then generated and executed Cypher queries in
Memgraph to create expand a current dataset and a knowledge graph around the
book's themes and characters.

## Entity extraction
The first step in the process is to extract entities from the summary using
SpaCy’s large language model. SpaCy is an advanced NLP (natural language
processing) library in Python, designed for tasks like entity recognition,
part-of-speech tagging, and dependency parsing. It’s widely used for its speed
and accuracy in processing text.

To start, we need to install SpaCy and the specific model we’ll be using.

In [None]:
!pip install spacy
!pip install spacy_llm
!python -m spacy download en_core_web_md

Here’s the summary of *House of Dragons* that we'll use to create the
knowledge graph.

In [1]:
# Sample text summary for processing
summary="House of the Dragon centers on the history and tragic power struggle within House Targaryen, set nearly two centuries before the events of Game of Thrones. This prequel explores the Dance of the Dragons, a devastating civil war sparked by competing claims to the Iron Throne. The story unfolds through the lives of the Targaryens and their allies, revealing a family torn apart by ambition, loyalty, and betrayal. At the heart of the conflict is Prince Daemon Targaryen, a fierce and unpredictable character. Known for his skill in battle and his prowess as a dragon-rider, Daemon is the younger brother to King Viserys I Targaryen. Daemon’s fierce loyalty to his family’s legacy is matched only by his own ambitions, which lead him into direct and indirect conflicts with his kin. Over time, Daemon becomes an ally—and later the husband—of Princess Rhaenyra Targaryen, strengthening her claim to the throne while adding to the deepening divisions within the family. Princess Rhaenyra, daughter of King Viserys I and his first wife, is named as heir to the Iron Throne by her father, a decision that stirs resentment and opposition. As the first woman designated to inherit the throne, Rhaenyra faces resistance from those who believe a male heir should rule, especially after her father remarries and has sons with his second wife, Queen Alicent Hightower. Rhaenyra’s story becomes one of perseverance as she fights to secure her birthright in a society resistant to female rulers. Her marriage to Daemon only heightens the tension, setting the stage for an inevitable clash with Alicent and her faction. King Viserys I is portrayed as a well-intentioned but indecisive ruler, whose choice to name Rhaenyra as his heir ignites a feud within his own family. Though he cares deeply for both Rhaenyra and the children he fathers with Alicent, his inability to manage the building tension between his firstborn daughter and his second family leaves a power vacuum that eventually leads to war. Viserys’s reign is marked by his efforts to keep the peace, but his lack of political decisiveness inadvertently sets the stage for the violent conflict that erupts after his death. Queen Alicent Hightower, Viserys’s second wife, becomes one of Rhaenyra’s main adversaries. Hailing from the influential House Hightower, Alicent is a shrewd political operator who believes strongly that her eldest son, Aegon II, should inherit the throne. Her stance is rooted in her belief in traditional succession, and she quickly gathers a faction of supporters who view Rhaenyra’s claim as illegitimate. With the support of her father, Otto Hightower, who serves as Hand of the King, Alicent effectively leads her faction, mobilizing noble houses and resources to ensure that Aegon II’s claim is seen as valid. The rivalry between Alicent and Rhaenyra becomes the centerpiece of the Targaryen family conflict. Aegon II Targaryen, the firstborn son of Viserys and Alicent, becomes the figurehead of the opposition to Rhaenyra’s claim. Despite some initial reluctance, Aegon is encouraged by his mother and grandfather to challenge his half-sister’s right to rule, and he eventually takes up the mantle of leadership for his faction. His rivalry with Rhaenyra sparks the bloody Targaryen civil war that becomes known as the Dance of the Dragons. House Hightower, led by Otto Hightower, is a powerful and ambitious family whose influence grows significantly due to Alicent’s marriage to the king. Otto is a calculating and politically astute figure who manipulates events behind the scenes to advance his grandson’s claim to the throne. The Hightowers provide essential support to Aegon II’s faction, rallying resources and allies to his cause and heightening the scale of the Targaryen conflict. House Velaryon is another pivotal family in the civil war, led by the powerful Lord Corlys Velaryon, often called the Sea Snake. A wealthy and respected naval commander, Corlys has substantial influence due to his vast fleet and riches. His marriage to Rhaenys Targaryen, known as the Queen Who Never Was after being passed over for the throne in favor of Viserys, allies the Velaryons with Rhaenyra. Rhaenys, despite having been denied her own claim to the Iron Throne, is a loyal supporter of Rhaenyra, and House Velaryon’s resources prove critical to Rhaenyra’s faction as the war unfolds. The Dance of the Dragons is defined by a series of brutal battles and shifting allegiances as the two Targaryen factions vie for dominance. Among the most pivotal conflicts is the Battle of Rook’s Rest, where Rhaenyra’s supporters face off against Aegon II’s forces. The siege of Dragonstone, Rhaenyra’s stronghold and base of operations, also becomes a key moment in the war, as control of Dragonstone becomes vital for both factions. King’s Landing, the capital and seat of the Iron Throne, changes hands multiple times, highlighting the instability brought about by the war. Harrenhal, the vast and foreboding castle, also sees fierce fighting and is captured by various factions throughout the civil war, bearing witness to some of the most intense skirmishes in Westeros. Dragons are instrumental in the Targaryen civil war, as dragon-riders on both sides use them as weapons of terror and destruction. Rhaenyra rides Syrax, while Daemon rides Caraxes, both using their dragons to enforce their claims and crush opposition. Vhagar, one of the largest and oldest dragons, eventually fights on the side of Aegon II’s faction, tipping the balance in critical battles. Other dragons, like Vermax, Arrax, and Sunfyre, play notable roles, representing the sheer destructive power of House Targaryen. However, the Dance of the Dragons is costly, with many dragons killed in battle, marking the beginning of a sharp decline in the number of dragons in Westeros. Their deaths symbolize the tragic toll of the Targaryens’ internal strife and foreshadow the weakening of the family’s power. The aftermath of the Dance of the Dragons leaves House Targaryen decimated. Both Rhaenyra and Aegon II perish, along with many other prominent Targaryen family members and key allies on both sides. The civil war drains the Targaryens of their strength, reducing their influence and resources and leading to a decline in the dragon population. This bloody conflict shapes Westeros’s future, creating lasting suspicions about Targaryen rule and bringing an end to the era of dragon supremacy. The Dance of the Dragons is remembered as a cautionary tale of ambition and family rivalry, with the Targaryens’ lust for power nearly resulting in the annihilation of their own house."

Next, set up your OpenAI API key.

In [2]:
import os
from wasabi import msg

os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

# Check for OpenAI API key
if not os.getenv("OPENAI_API_KEY"):
    msg.fail("OPENAI_API_KEY environment variable not set. Please set it to proceed.", exits=1)

Extract entitites from the text using spaCy

In [3]:

import json
from collections import Counter
from pathlib import Path

import spacy
from spacy_llm.util import assemble

# load the spaCy model
nlp = spacy.load("en_core_web_md")

# split document into sentences
def split_document_sent(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents]

# define custom relationship extraction and text processing
def process_text(text, verbose=False):
    doc = nlp(text)
    if verbose:
        msg.text(f"Text: {doc.text}")
        msg.text(f"Entities: {[(ent.text, ent.label_) for ent in doc.ents]}")
        # Relations extraction logic can be added here
    return doc

# Pipeline to run entity extraction
def extract_entities(text, verbose=False):
    processed_data = []
    entity_counts = Counter()

    sentences = split_document_sent(text)
    for sent in sentences:
        doc = process_text(sent, verbose)
        entities = [(ent.text, ent.label_) for ent in doc.ents]

        # Store processed data for each sentence
        processed_data.append({'text': doc.text, 'entities': entities})

        # Update counters
        entity_counts.update([ent[1] for ent in entities])

    # Export to JSON
    with open('processed_data.json', 'w') as f:
        json.dump(processed_data, f)

    # Display summary
    msg.text(f"Entity counts: {entity_counts}")

# Run the pipeline on the summary text
verbose = True
extract_entities(summary, verbose)


  from .autonotebook import tqdm as notebook_tqdm


Text: House of the Dragon centers on the history and tragic power struggle
within House Targaryen, set nearly two centuries before the events of Game of
Thrones.
Entities: [('House', 'ORG'), ('House Targaryen', 'ORG'), ('nearly two
centuries', 'DATE')]
Text: This prequel explores the Dance of the Dragons, a devastating civil war
sparked by competing claims to the Iron Throne.
Entities: [('Dragons', 'PERSON'), ('the Iron Throne', 'ORG')]
Text: The story unfolds through the lives of the Targaryens and their allies,
revealing a family torn apart by ambition, loyalty, and betrayal.
Entities: [('Targaryens', 'NORP')]
Text: At the heart of the conflict is Prince Daemon Targaryen, a fierce and
unpredictable character.
Entities: [('Prince Daemon Targaryen', 'PERSON')]
Text: Known for his skill in battle and his prowess as a dragon-rider, Daemon is
the younger brother to King Viserys I
Entities: []
Text: Targaryen.
Entities: [('Targaryen', 'PERSON')]
Text: Daemon’s fierce loyalty to his family’

## Create node and rel parameters

In [None]:
!pip install openai==0.28 neo4j

In [5]:
import json
import openai
from pathlib import Path

# Load processed data from JSON
json_path = Path("processed_data.json")
with open(json_path, "r") as f:
    processed_data = json.load(f)

# Prepare nodes and relationships
nodes = []
relationships = []

# Formulate a prompt for GPT-4
prompt = (
    "Extract entities and relationships from the following JSON data. For each entry in data['entities'], "
    "create a 'node' dictionary with fields 'id' (unique identifier), 'name' (entity text), and 'type' (entity label). "
    "For entities that have meaningful connections, define 'relationships' as dictionaries with 'source' (source node id), "
    "'target' (target node id), and 'relationship' (type of connection). Create max 30 nodes, format relationships in the format of capital letters and _ inbetween words and format the entire response in the JSON output containing only variables nodes and relationships without any text inbetween. Use following labels for nodes: Character, Location, Death, Event, Allegiance and following relationship types: HAPPENED_IN, VICTIM_TO, KILLER_OF, LOYAL_TO and feel free to expand with whatever labels/types you think are needed. Make sure the entire JSON file fits in the output" 
    "JSON data:\n"
    f"{json.dumps(processed_data)}"
)

# Call GPT-4 to analyze the JSON and extract structured nodes and relationships
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "system", "content": "You are a helpful assistant that structures data into nodes and relationships."},
              {"role": "user", "content": prompt}],
    max_tokens=2000
)

# Parse GPT-4 response and add to nodes and relationships lists
output = response['choices'][0]['message']['content']
print(output)
structured_data = json.loads(output)  # Assuming GPT-4 outputs structured JSON

# Populate nodes and relationships lists
nodes.extend(structured_data.get("nodes", []))
relationships.extend(structured_data.get("relationships", []))

# Example of final output
#print("Nodes:", nodes)
#print("Relationships:", relationships)

{
"nodes": [
    {"id": 1, "name": "House of the Dragon", "type": "EVENT"},
    {"id": 2, "name": "House Targaryen", "type": "Allegiance"},
    {"id": 3, "name": "Game of Thrones", "type": "Event"},
    {"id": 4, "name": "Dance of the Dragons", "type": "Event"},
    {"id": 5, "name": "Iron Throne", "type": "Location"},
    {"id": 6, "name": "Prince Daemon Targaryen", "type": "Character"},
    {"id": 7, "name": "King Viserys I", "type": "Character"},
    {"id": 8, "name": "Princess Rhaenyra Targaryen", "type": "Character"},
    {"id": 9, "name": "Queen Alicent Hightower", "type": "Character"},
    {"id": 10, "name": "Aegon II", "type": "Character"},
    {"id": 11, "name": "Otto Hightower", "type": "Character"},
    {"id": 12, "name": "Rook’s Rest", "type": "Location"},
    {"id": 13, "name": "Dragonstone", "type": "Location"},
    {"id": 14, "name": "King’s Landing", "type": "Location"},
    {"id": 15, "name": "Harrenhal", "type": "Location"},
    {"id": 16, "name": "Westeros", "type": 

## Generate queries

In [6]:
def generate_cypher_queries(nodes, relationships):
    queries = []

    # Create nodes
    for node in nodes:
        query = f"CREATE (n:{node['type']} {{id: '{node['id']}', name: '{node['name']}'}})"
        queries.append(query)

    # Create relationships
    for rel in relationships:
        query = f"MATCH (a {{id: '{rel['source']}'}}), (b {{id: '{rel['target']}'}}) " \
                f"CREATE (a)-[:{rel['relationship']}]->(b)"
        queries.append(query)

    return queries

cypher_queries = generate_cypher_queries(nodes, relationships)
print(cypher_queries)

["CREATE (n:EVENT {id: '1', name: 'House of the Dragon'})", "CREATE (n:Allegiance {id: '2', name: 'House Targaryen'})", "CREATE (n:Event {id: '3', name: 'Game of Thrones'})", "CREATE (n:Event {id: '4', name: 'Dance of the Dragons'})", "CREATE (n:Location {id: '5', name: 'Iron Throne'})", "CREATE (n:Character {id: '6', name: 'Prince Daemon Targaryen'})", "CREATE (n:Character {id: '7', name: 'King Viserys I'})", "CREATE (n:Character {id: '8', name: 'Princess Rhaenyra Targaryen'})", "CREATE (n:Character {id: '9', name: 'Queen Alicent Hightower'})", "CREATE (n:Character {id: '10', name: 'Aegon II'})", "CREATE (n:Character {id: '11', name: 'Otto Hightower'})", "CREATE (n:Location {id: '12', name: 'Rook’s Rest'})", "CREATE (n:Location {id: '13', name: 'Dragonstone'})", "CREATE (n:Location {id: '14', name: 'King’s Landing'})", "CREATE (n:Location {id: '15', name: 'Harrenhal'})", "CREATE (n:Location {id: '16', name: 'Westeros'})", "CREATE (n:Character {id: '17', name: 'Syrax'})", "CREATE (n:Ch

## Execute queries

In [7]:
from neo4j import GraphDatabase

# Initialize the Neo4j driver for Memgraph (modify the URI if necessary)
uri = "bolt://localhost:7687"
user = ""
password = ""
driver = GraphDatabase.driver(uri, auth=(user, password))

# Function to execute Cypher queries in Memgraph
def execute_cypher_queries(queries):
    with driver.session() as session:
        for query in queries:
            try:
                session.run(query)
                msg.good(f"Executed query: {query}")
            except Exception as e:
                msg.fail(f"Error executing query: {query}. Error: {e}")

# Execute the generated Cypher queries
execute_cypher_queries(cypher_queries)



[38;5;2m✔ Executed query: CREATE (n:EVENT {id: '1', name: 'House of the
Dragon'})[0m
[38;5;2m✔ Executed query: CREATE (n:Allegiance {id: '2', name: 'House
Targaryen'})[0m
[38;5;2m✔ Executed query: CREATE (n:Event {id: '3', name: 'Game of
Thrones'})[0m
[38;5;2m✔ Executed query: CREATE (n:Event {id: '4', name: 'Dance of the
Dragons'})[0m
[38;5;2m✔ Executed query: CREATE (n:Location {id: '5', name: 'Iron
Throne'})[0m
[38;5;2m✔ Executed query: CREATE (n:Character {id: '6', name: 'Prince Daemon
Targaryen'})[0m
[38;5;2m✔ Executed query: CREATE (n:Character {id: '7', name: 'King Viserys
I'})[0m
[38;5;2m✔ Executed query: CREATE (n:Character {id: '8', name: 'Princess
Rhaenyra Targaryen'})[0m
[38;5;2m✔ Executed query: CREATE (n:Character {id: '9', name: 'Queen Alicent
Hightower'})[0m
[38;5;2m✔ Executed query: CREATE (n:Character {id: '10', name: 'Aegon II'})[0m
[38;5;2m✔ Executed query: CREATE (n:Character {id: '11', name: 'Otto
Hightower'})[0m
[38;5;2m✔ Executed query: CR