# Embeddings

## Embedding Models

You would typically use an embedding model to generate an embedding for a piece of data. As you have learned, the data could be anything - text, images, music, video, or any other data type.

Embedding models are widely available for use with different data types. For example, you can use a text embedding model to generate embeddings for text data or an image embedding model to generate embeddings for image data.

It is possible to create an embedding model, but it is easier to use a pre-trained model. Pre-trained models are trained on large datasets and are available for use with different data types.

Here are some well-known embedding models and types:

- [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) - A model for generating word embeddings, turning words into vectors based on their context.

- [FastText](https://fasttext.cc/) - An extension of Word2Vec, FastText treats each word as composed of character n-grams, allowing it to generate embeddings for out-of-vocabulary words.

- [Node2Vec](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/node2vec/?_gl=1*orc980*_gcl_au*MjEzNTI4NjkxNy4xNzU3MjU4NDMzLjc4MDQ1OTczLjE3NTg0MTY3NjUuMTc1ODQxNjc2NA..*_ga*MTkzMzgxNTk1LjE3NTcyNTg0MzQ.*_ga_DL38Q8KGQC*czE3NjI4NjUyMDIkbzM3JGcxJHQxNzYyODY2NzI5JGo1MiRsMCRoMA..*_ga_DZP8Z65KK4*czE3NjI4NjUyMDIkbzM3JGcxJHQxNzYyODY2NzI5JGo1MiRsMCRoMA..) - An algorithm that computes embeddings based on random walks through a graph.

- GPT (Generative Pre-trained Transformer) - A series of models (e.g. GPT-4) that use transformers for generating text that you can also use for generating embeddings.

- Universal Sentence Encoder - Designed to convert sentences into embeddings.

- Doc2Vec - An extension of the Word2Vec model to generate embeddings for entire documents or paragraphs, capturing the overall meaning.

- [ResNet](https://en.wikipedia.org/wiki/Residual_neural_network) (Residual Networks) - Primarily used in image processing, ResNet models can also be used to generate embeddings for images that capture visual features and patterns.

- VGGNet - VGGNet models are used in image processing to generate embeddings for images, capturing various levels of visual information.


## Creating Embeddings

Each embedding model is different and captures different aspects of the data. As such, you cannot compare embeddings created by different models.

You need to use the same model to generate the embeddings for the data you want to compare.

Many embedding models provide APIs that you can use to generate embeddings for your data.

In a previous lesson, you looked at embeddings for movie plots. [Open AI’s text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings/embedding-models) model generated those embeddings.

The code to generate the embeddings loaded the text for each movie plot and sent it to the model to generate the embeddings.

## Load embeddings

The OpenAI `text-embedding-ada-002` model was used to create embeddings for the questions and answers in the dataset. Using these embeddings, you can find similar questions and answers.

The [Quora-QuAD-1000-embeddings.csv](https://data.neo4j.com/llm-vectors-unstructured/Quora-QuAD-1000-embeddings.csv?_gl=1*144fxce*_gcl_au*MjEzNTI4NjkxNy4xNzU3MjU4NDMzLjc4MDQ1OTczLjE3NTg0MTY3NjUuMTc1ODQxNjc2NA..*_ga*MTkzMzgxNTk1LjE3NTcyNTg0MzQ.*_ga_DL38Q8KGQC*czE3NjI4NjUyMDIkbzM3JGcxJHQxNzYyODY3MjM1JGo1MiRsMCRoMA..*_ga_DZP8Z65KK4*czE3NjI4NjUyMDIkbzM3JGcxJHQxNzYyODY3MjM1JGo1MiRsMCRoMA..) file contains the embeddings for the questions and answers in the dataset.

The file has the following structure:

```
question,answer,question_embedding,answer_embedding
"The question","The answer","[0.1, 0.2, 0.3, ...]","[0.4, 0.5, 0.6, ...]"
```

### Load into Neo4j

You will load the data into two nodes, `Question` and `Answer`, with a relationship, `ANSWERED_BY`. The `Question` and `Answer` nodes will store the original `text` and an embedding as properties.

<img 
    src="https://graphacademy.neo4j.com/courses/llm-vectors-unstructured/2-vector-indexes/2-load-embeddings/images/quora-data-model.svg" 
    alt="Data Model"
    style="width: 50%; height: auto; display: block; margin: 0 auto;"
/>

Review the following Cypher statement to load the data into Neo4j and create the nodes and relationships:

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

import textwrap
from neo4j import GraphDatabase
from utils import execute_query, create_embedding

neo4j_uri = os.getenv("NEO4J_URI")
neo4j_user = os.getenv("NEO4J_USERNAME")
neo4j_pass = os.getenv("NEO4J_PASSWORD")
neo4j_db = os.getenv("NEO4J_DATABASE")

neo4j_driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_pass))

cypher = textwrap.dedent("""
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/llm-vectors-unstructured/Quora-QuAD-1000-embeddings.csv' AS row

MERGE (q:Question{text:row.question})
WITH row,q
CALL db.create.setNodeVectorProperty(q, 'embedding', apoc.convert.fromJsonList(row.question_embedding))

MERGE (a:Answer{text:row.answer})
WITH row,a,q
CALL db.create.setNodeVectorProperty(a, 'embedding', apoc.convert.fromJsonList(row.answer_embedding))

MERGE(q)-[:ANSWERED_BY]->(a)
""")

result = execute_query(neo4j_driver, cypher)

In [4]:
cypher = textwrap.dedent("""
MATCH (q:Question)-[r:ANSWERED_BY]->(a:Answer)
RETURN q,r,a
LIMIT 1
""")

result = execute_query(neo4j_driver, cypher)
result

[{'q': {'text': 'What song has the lyrics "someone left the cake out in the rain"?',
   'embedding': [0.0001495040050940588,
    -0.02840697392821312,
    0.004463675431907177,
    0.010260951705276966,
    -0.005625720135867596,
    0.010901856236159801,
    0.013096469454467297,
    -0.02340921200811863,
    -0.013193576596677303,
    -0.008687821216881275,
    0.03943830728530884,
    -0.0019615571945905685,
    -0.02786317653954029,
    -0.005020421463996172,
    0.0014614572282880545,
    0.022166244685649872,
    0.02487228624522686,
    0.01856681890785694,
    0.008059863932430744,
    -0.04676663130521774,
    -0.007140586152672768,
    -0.012597988359630108,
    -0.01504507940262556,
    -0.012999363243579865,
    -0.009089196100831032,
    -0.03234303742647171,
    0.03115185908973217,
    -0.02905435301363468,
    0.03306810185313225,
    -0.01191824022680521,
    -0.008513028733432293,
    0.0015949790831655264,
    -0.03306810185313225,
    -0.005370005499571562,
    -0.0

In [5]:
neo4j_driver.close()