# Prerequirements

- Install python
- Install docker and docker compose
- Create new python environment with VS Code or install manually
    ```
    python -m venv .venv
    ```
- Install required packages from 3rd block.
- Copy the openai API Key from following url: https://justpaste.it/infobip-graphrag-workshop and paste it into the OpenAI object in 4th block.

# GraphRAG in Memgraph

In this tutorial, we will build a Q&A application that utilizes **GraphRAG** powered by Memgraph and
OpenAI's LLM.

In this example, we will use **vector search** on node embeddings to find
semantically relevant data. After relevant data is located, the structured data
will be extracted from the graph and passed to LLM to ground it for a more accurate answer to our question. 

### Required libraries
Installs the required Python libraries: Neo4j driver (to connect with Memgraph) and OpenAI SDK (to interact with the LLM).

In [None]:
%pip install neo4j==5.28.2                   # for driver and connection to Memgraph
%pip install openai==1.107.3                 # for access to LLM

### Initialize OpenAI and GraphDatabase
Initializes connections to OpenAI (for LLM calls) and Memgraph (for graph storage/queries), setting up the core services used in the pipeline.

In [None]:
from openai import OpenAI
from neo4j import GraphDatabase

client = OpenAI(api_key="Replace with the key from: https://justpaste.it/infobip-graphrag-workshop")
driver = GraphDatabase.driver("bolt://localhost:7687")

### Create an vector index
Creates a vector index on the Speaker node’s speaker_embedding property in Memgraph, ensuring fast similarity search. Runs once to avoid duplicate index creation.

In [None]:
def setup_speaker_index():
    with driver.session() as session:
        result = session.run("SHOW INDEX INFO;")
        existing_indexes = [
            (record["label"], record["property"]) for record in result
        ]

        if ("Speaker", "speaker_embedding") not in existing_indexes:
            session.run(
                """CREATE VECTOR INDEX speaker_embedding ON :Speaker(speaker_embedding) WITH CONFIG {"dimension": 1536, "capacity": 1000};"""
            )
            print("Index 'speaker_embedding' created.")
        else:
            print("Index 'speaker_embedding' already exists.")

setup_speaker_index()

### Generate embedding vector
Generates an embedding vector for a given text using OpenAI’s text-embedding-3-small model, returning embedding vector.

In [None]:
def get_embedding(text: str):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    embedding = response.data[0].embedding
    return embedding

### Embed existing data
Fetches all Speaker nodes, generates embeddings for their names via OpenAI, and stores those embeddings back into Memgraph for use in similarity-based queries.

In [None]:
def embed_speakers_data():
    with driver.session() as session:
        # Fetch all Speaker nodes
        result = session.run("MATCH (s:Speaker) RETURN id(s) AS id, s.name AS name")
        
        for record in result:
            speaker_id = record["id"]
            text = record["name"]

            embedding = get_embedding(text)

            # Store embedding in Memgraph
            session.run(
                """
                MATCH (s:Speaker) WHERE id(s) = $id
                SET s.speaker_embedding = $embedding
                """,
                id=speaker_id,
                embedding=embedding
            )

embed_speakers_data()

### Vector search
Runs a vector similarity search in Memgraph against the specified index using the given query vector, returning the most similar nodes (e.g., closest speakers).

In [None]:
def vector_search(query_vector):
    query = f"""
    CALL vector_search.search("speaker_embedding", 1, {query_vector}) YIELD * RETURN *;
    """

    with driver.session() as session:
        result = session.run(query)
        memgraph_response = [record.data() for record in result]
        return memgraph_response

### Getting the relevant data

Once we have the pivot node, we can retrieve the relevant structured
data around it. The goal is to enrich the prompt given to the LLM with relevant data to provide a more accurate answer. The most straightforward approach to expanding the knowledge is to **perform multiple hops** starting from the pivot node.  

The `get_relevant_data` is the function that fetches the data around the pivot node, a specified number
of `hops` away from the pivot node.  


In [None]:
def get_relevant_data(node, hops):
    paths = []

    name = node["node"]["name"]
    with driver.session() as session:
        query = f'MATCH path=((n:Speaker {{name: "{name}"}})-[r*..{hops}]->(m)) RETURN path'
        result = session.run(query)

        for record in result:
            path_data = []
            for segment in record["path"]:
                # Process start node without 'embedding'
                start_node_data = {
                    k: v
                    for k, v in segment.start_node.items()
                    if k not in ["speaker_embedding"]
                }

                # Process relationship data
                relationship_data = {
                    "type": segment.type,
                    "properties": segment.get("properties", {}),
                }

                # Process end node without 'embedding'
                end_node_data = {
                    k: v
                    for k, v in segment.end_node.items()
                    if k not in ["speaker_embedding"]
                }

                # Add to path_data as a tuple (start_node, relationship, end_node)
                path_data.append(
                    (start_node_data, relationship_data, end_node_data)
                )

            paths.append(path_data)


    return paths

### Define the **prompting logic** for the RAG pipeline

* `RAG_prompt` → Builds a prompt that combines the user’s question with graph-expanded data, instructing the model to answer only from that context.
* `question_prompt` → Builds a prompt to extract the **key entity** (e.g., person or talk) from a user question.
* `get_response` → Sends the prompt to OpenAI (`gpt-5-mini`) and returns the model’s response text.


In [None]:
def RAG_prompt(question, relevance_expansion_data):
    prompt = f"""
    I will provide you with a question and a set of data obtained through a relevance expansion process in a graph database. 
    The relevance expansion process finds nodes connected to a target node within a specified number of hops and includes 
    the relationships between these nodes.

    Question: {question}

    Relevance Expansion Data:
    {relevance_expansion_data}

    Based on the provided data, please answer the question, make sure to base your answers only based on the provided data. 
    Add a context on what data did you base your answer on. If you can't find the right answer, 
    politely suggest to ask another question and explain that you don't have enough context.
    """
    return prompt


def question_prompt(question):
    prompt = f"""
    Your task is to extract the key subject entity from a question. 

    Question: {question}

    Output **only** the most relevant entity (the person or the talk) needed to answer the question — nothing else.

    Examples:
    Question: "Where did Clay Shirky talk?"
    Key entity: Clay Shirky
    """
    return prompt


async def get_response(client, prompt):
    response = client.chat.completions.create(
        model="gpt-5", messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


### End-to-end RAG pipeline

Extracts the key entity from the user’s question.

Generates an embedding for that entity.

Finds the most similar node in Memgraph via vector search.

Expands the graph around that node (2 hops) to gather relevant context.

Builds a RAG-style prompt with the question + context.

Queries the LLM to produce the final grounded answer.

In [None]:
# Ask a question  (feel free to change the question) 
question = 'Which talks did Clay Shirky held?'

async def ask_question(question):

    # Key information from the question 
    prompt = question_prompt(question)
    response = await get_response(client, prompt)
    print(response)

    # Compute the embedding for the key information
    question_embedding = get_embedding(response)

    # Find the most similar node to the question embedding
    node = vector_search(question_embedding)
    if node:
        print("\n\nThe most similar node is:")
        print(node[0])

    # Get the relevant data based on the most similar node
    relevant_data = get_relevant_data(node[0], hops=1)

    # Show the relevant data
    print("\n\nThe relevant data is:")
    print(relevant_data)

    # LLM answers the question based on the relevant data
    prompt = RAG_prompt(question, relevant_data)
    response = await get_response(client, prompt)
    print("\n\nThe response is:")
    print(response)

await ask_question(question)

## Task

Identify the speakers from a specific TED Talk event.

The question we want to answer is:
> Who spoke at TED2009 event?

Use what you’ve just learned, and feel free to edit, copy, adapt or remove the code provided above to complete this task.

In [None]:
def setup_event_index():
    raise NotImplementedError("Replace this with your code")

setup_event_index()

In [None]:
def embed_event_data():
    raise NotImplementedError("Replace this with your code")

embed_event_data()

In [None]:
def vector_search_event(query_vector):
    raise NotImplementedError("Replace this with your code")

In [None]:
question = 'Who spoke at TED2009 event?'

async def ask_question(question):
    raise NotImplementedError("Replace this with your code")

await ask_question(question)