# Notebook for embeddings generation

This notebook can be used for creating vector embeddings for text-content contained in the `Chunk` nodes in a neo4j Graph database.

The code in this notebook represents the "*automatische Generierung der Embeddings*" part of the data ingestion process visualized below:

![Data ingestion process](img/data_ingestion.png)


The code retrieves all `Chunk` nodes from the specified database and creates embeddings for their content. `Chunk` nodes already associated with an `Embedding` node will be ignored in this process to avoid the generation of redundant embeddings. To re-embed the content of a node, it's `Embedding` node needs to be deleted in advance.

A neo4j graph database containing standard documents needs to be present before this notebook can be utilized. Follow the steps in `markdown_ingestion.ipynb` for this task.
To run this notebook a VoyageAI API key is required. A free key can be acquired from [here](https://www.voyageai.com).
Use the terminal command below to insert your api-key into the environment:

```shell
echo "VOYAGE_API_KEY=YOUR_KEY"
```

Steps:
- requirements including: `neo4j`, `voyageai` and `python-dotenv` need to be installed
- insert neo4j credentials
- run the cell below

In [12]:
import os
from neo4j import GraphDatabase
import uuid
import voyageai
from dotenv import load_dotenv

# Define credentials for neo4j
neo4j_uri = ""  # Change to your Neo4j URI
username = ""  # Change to your username
password = ""  # Change to your password


load_dotenv()
vo = voyageai.Client()

EMBEDDING_MODEL = "voyage-multilingual-2"  # Using Voyage AI model
BATCH_SIZE = 128  # Batch size for embedding creation

def batch_get_embeddings(texts, model):
    embeddings = []
    for i in range(0, len(texts), BATCH_SIZE):
        batch = texts[i:i + BATCH_SIZE]
        batch_embeddings = vo.embed(batch, model=model, input_type="document").embeddings
        embeddings.extend(batch_embeddings)
    return embeddings

def LoadEmbeddingBatch(label: str, property: str):
    driver = GraphDatabase.driver(neo4j_uri, auth=(username, password))
    vo.api_key = os.environ["VOYAGE_API_KEY"]  # Set Voyage AI API key

    with driver.session() as session:
        # Get chunks without embeddings
        result = session.run(f"MATCH (ch:{label}) WHERE NOT (ch)-[:HAS_EMBEDDING]->() RETURN ch.id AS id, ch.{property} AS text")
        
        # Collect all texts and IDs
        texts = []
        ids = []
        for record in result:
            texts.append(record["text"])
            ids.append(record["id"])

        # Generate embeddings in batches
        embeddings = batch_get_embeddings(texts, EMBEDDING_MODEL)

        # Create Embedding nodes and relationships in batches
        count = 0
        for i in range(0, len(ids), BATCH_SIZE):
            batch_ids = ids[i:i + BATCH_SIZE]
            batch_embeddings = embeddings[i:i + BATCH_SIZE]

            # Prepare batch operation
            batch = []
            for id, embedding in zip(batch_ids, batch_embeddings):
                uuid_value = str(uuid.uuid4())
                batch.append({
                    "id": id,
                    "key": property,
                    "embedding": embedding,
                    "model": EMBEDDING_MODEL,
                    "uuid": uuid_value
                })

            # Execute batch operation
            cypher = """
            UNWIND $batch AS item
            MATCH (n) WHERE n.id = item.id
            CREATE (e:Embedding {key: item.key, value: item.embedding, model: item.model, id: item.uuid})
            CREATE (n)-[:HAS_EMBEDDING]->(e)
            """
            session.run(cypher, batch=batch)
            count += len(batch)

        print(f"Processed {count} {label} nodes for property @{property}.")
        return count

# Example usage
count = LoadEmbeddingBatch("Chunk", "content")



Processed 323 Chunk nodes for property @content.


In [None]:
"""CREATE VECTOR INDEX `content-embeddings-vo`
FOR (n: Embedding) ON (n.value)
OPTIONS {indexConfig: {
 `vector.dimensions`: 1024,
 `vector.similarity_function`: 'cosine'
}};"""
# cypher to create vector embedding index

In [20]:
# function to generate an embedding for a given text
def get_embedding(client, text, model):
    response = client.embed(
                    texts=text,
                    model=model,
                    input_type="query",
                )
    return response

In [21]:
result = get_embedding(vo, "Hello, world!", "voyage-multilingual-2")

In [22]:
result.embeddings[0]  # The embedding for the input text

[0.04963812232017517,
 -0.027705121785402298,
 0.00772517966106534,
 -0.027495836839079857,
 0.06063716486096382,
 0.006642687134444714,
 0.008455268107354641,
 -0.026922492310404778,
 -0.03650641813874245,
 0.05570219084620476,
 0.007110630162060261,
 -0.07014128565788269,
 0.06036631762981415,
 -0.02391771785914898,
 -0.06917399168014526,
 -0.024286169558763504,
 0.026069512590765953,
 -0.03380260616540909,
 0.023855064064264297,
 0.004099577199667692,
 0.03315451741218567,
 -0.03379029408097267,
 -0.011755287647247314,
 -0.0398315005004406,
 0.010348310694098473,
 0.04646715149283409,
 0.010648392140865326,
 -0.008860433474183083,
 0.02327490597963333,
 -0.056602660566568375,
 0.013117415830492973,
 0.018454693257808685,
 -0.00317822746001184,
 0.005923370365053415,
 0.0046236757189035416,
 -0.00470721535384655,
 0.0015942800091579556,
 0.021656004711985588,
 0.008596625179052353,
 -0.08125991374254227,
 -0.02825472317636013,
 0.04153306409716606,
 -0.019928062334656715,
 -0.0223113