##  Embeddings and Vector Databases With ChromaDB

In this tutorial, we will learn about:
* Representing unstructured objects with vectors
* Using word and text embeddings in Python
* Harnessing the power of vector databases
* Encoding and querying over documents with ChromaDB
* Providing context to LLMs like ChatGPT with ChromaDB

Reference: https://realpython.com/chromadb-vector-database/

### Vector Basics
You can describe vectors with variable levels of complexity, but one great starting place is to think of a vector as an array of numbers.

In [1]:
#sample Vectors using numpy
import numpy as np

vec1 = np.array([1, 0])
vec2 = np.array([0, 1])

vec1 + vec2



array([1, 1])

In [2]:
vec1.shape

(2,)

**A few keyword to remember:**

* Dimension: The dimension of a vector is the number of elements that it contains. In the example above, vector1 and vector2 are both two-dimensional since they each have two elements. You can only visualize vectors with three dimensions or less, but generally, vectors can have any number of dimensions. In fact, as you’ll see later, vectors that encode words and text tend to have hundreds or thousands of dimensions.

* Magnitude: The magnitude of a vector is a non-negative number that represents the vector’s size or length. You can also refer to the magnitude of a vector as the norm, and you can denote it with ||v|| or |v|. There are many different definitions of magnitude or norm, but the most common is the Euclidean norm or 2-norm. You’ll learn how to compute this later.

* Unit vector: A unit vector is a vector with a magnitude of one. In the example above, vector1 and vector2 are unit vectors.

* Direction: The direction of a vector specifies the line along which the vector points. You can represent direction using angles, unit vectors, or coordinates in different coordinate systems.

* Dot product (scalar product): The dot product of two vectors, u and v, is a number given by u ⋅ v = ||u|| ||v|| cos(θ), where θ is the angle between the two vectors. Another way to compute the dot product is to do an element-wise multiplication of u and v and sum the results. The dot product is one of the most important and widely used vector operations because it measures the similarity between two vectors. You’ll see more of this later on.

* Orthogonal vectors: Vectors are orthogonal if their dot product is zero, meaning that they’re at a 90 degree angle to each other. You can think of orthogonal vectors as being completely unrelated to each other.

* Dense vector: A vector is considered dense if most of its elements are non-zero. Later on, you’ll see that words and text are most usefully represented with dense vectors because each dimension encodes meaningful information.

In [3]:
import numpy as np

v1 = np.array([1, 0])
v2 = np.array([0, 1])
v3 = np.array([np.sqrt(2), np.sqrt(2)])

# Dimension
print("Dimension of v1: ", v1.shape)


# Magnitude
print("Magnitude of v1: ", np.sqrt(np.sum(v1**2)))

print("Magnitude of v1: ", np.linalg.norm(v1))


print("Magnitude of v3: ", np.linalg.norm(v3))


# Dot product
print("Dot product of v1 and v2: ", np.sum(v1 * v2))


print("Dot product of v1 and v3: ", v1 @ v3)

Dimension of v1:  (2,)
Magnitude of v1:  1.0
Magnitude of v1:  1.0
Magnitude of v3:  2.0
Dot product of v1 and v2:  0
Dot product of v1 and v3:  1.4142135623730951


### Vector Similarity
The foundation for this measurement lies in the dot product, which serves as the bedrock for many vector similarity metrics.

One issue with the dot product, when used in isolation, is that it can take on any value and is therefore difficult to interpret in absolute terms. For example, if you know only that the dot product between two vectors is -3, then it’s unclear what that means without more context.

To overcome this shortcoming, one common approach is to use cosine similarity, a normalized form of the dot product. You compute cosine similarity by taking the cosine of the angle between two vectors. In essence, you rearrange the cosine definition of the dot product from earlier to solve for cos(θ).


Cosine similarity disregards the magnitude of both vectors, forcing the calculation to lie between -1 and 1. This is a really nice property because it gives cosine similarity the following interpretations:

* A value of 1 means the angle between the two vectors is 0 degrees. In other words, the two vectors are similar because they point in the exact same direction. Keep in mind this doesn’t mean that the vectors have the same magnitude.

* A value of 0 means the angle between the two vectors is 90 degrees. In this case, the vectors are orthogonal and unrelated to each other.

* A value of -1 means the angle between the two vectors is 180 degrees. This is an interesting case where the vectors are dissimilar because they point in opposite directions.

## Embeddings

Embeddings are a way to represent data such as words, text, images, and audio in a numerical format that computational algorithms can more easily process.

More specifically, embeddings are dense vectors that characterize meaningful information about the objects that they encode. The most common kinds of embeddings are word and text embeddings.

### Word Embeddings
A word embedding is a vector that captures the semantic meaning of word. Ideally, words that are semantically similar in natural language should have embeddings that are similar to each other in the encoded vector space. Analogously, words that are unrelated or opposite of one another should be further apart in the vector space.

In [15]:
import spacy

nlp = spacy.load("en_core_web_md")

dog_embedding = nlp.vocab["dog"].vector

print("Type of embedding: ", type(dog_embedding))

print("Shape of Embedding:", dog_embedding.shape)

dog_embedding

Type of embedding:  <class 'numpy.ndarray'>
Shape of Embedding: (300,)


array([ 1.2330e+00,  4.2963e+00, -7.9738e+00, -1.0121e+01,  1.8207e+00,
        1.4098e+00, -4.5180e+00, -5.2261e+00, -2.9157e-01,  9.5234e-01,
        6.9880e+00,  5.0637e+00, -5.5726e-03,  3.3395e+00,  6.4596e+00,
       -6.3742e+00,  3.9045e-02, -3.9855e+00,  1.2085e+00, -1.3186e+00,
       -4.8886e+00,  3.7066e+00, -2.8281e+00, -3.5447e+00,  7.6888e-01,
        1.5016e+00, -4.3632e+00,  8.6480e+00, -5.9286e+00, -1.3055e+00,
        8.3870e-01,  9.0137e-01, -1.7843e+00, -1.0148e+00,  2.7300e+00,
       -6.9039e+00,  8.0413e-01,  7.4880e+00,  6.1078e+00, -4.2130e+00,
       -1.5384e-01, -5.4995e+00,  1.0896e+01,  3.9278e+00, -1.3601e-01,
        7.7732e-02,  3.2218e+00, -5.8777e+00,  6.1359e-01, -2.4287e+00,
        6.2820e+00,  1.3461e+01,  4.3236e+00,  2.4266e+00, -2.6512e+00,
        1.1577e+00,  5.0848e+00, -1.7058e+00,  3.3824e+00,  3.2850e+00,
        1.0969e+00, -8.3711e+00, -1.5554e+00,  2.0296e+00, -2.6796e+00,
       -6.9195e+00, -2.3386e+00, -1.9916e+00, -3.0450e+00,  2.48

You first import spacy and load the medium English model into an object called nlp. You then look up the embedding for the word dog with nlp.vocab["dog"].vector and store it as dog_embedding. Calling type(dog_embedding) tells you that the embedding is a NumPy array, and dog_embedding.shape indicates that the embedding has 300 dimensions. Lastly, dog_embedding[0:10] shows the values of the first 10 dimensions.

This is pretty neat! The nlp.vocab object allows you to find the word embedding for any word in the model’s vocabulary. You can now assess the similarity between word embeddings using metrics like cosine similarity. 

In [16]:
def compute_cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
    """Compute the cosine similarity between two vectors"""

    return (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [17]:
dog_embedding = nlp.vocab["dog"].vector
cat_embedding = nlp.vocab["cat"].vector
apple_embedding = nlp.vocab["apple"].vector
tasty_embedding = nlp.vocab["tasty"].vector
delicious_embedding = nlp.vocab["delicious"].vector
truck_embedding = nlp.vocab["truck"].vector

print("Cat Vs Dog : ", compute_cosine_similarity(dog_embedding, cat_embedding))
print("Apple Vs Delicious : ", compute_cosine_similarity(apple_embedding, delicious_embedding))

Cat Vs Dog :  0.8220817
Apple Vs Delicious :  0.5347654


### Text Embeddings
Text embeddings encode information about sentences and documents, not just individual words, into vectors. This allows you to compare larger bodies of text to each other just like you did with word vectors. Because they encode more information than a single word embedding, text embeddings are a more powerful representation of information.

Text embeddings are typically the fundamental objects stored in vector databases like ChromaDB

In [1]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [
         "The canine barked loudly.",
         "The dog made a noisy bark.",
         "He ate a lot of pizza.",
         "He devoured a large quantity of pizza pie.",
]

text_embeddings = model.encode(texts)

print("Type: ", type(text_embeddings))

print("Shape: ", text_embeddings.shape)

  from .autonotebook import tqdm as notebook_tqdm


Type:  <class 'numpy.ndarray'>
Shape:  (4, 384)


In [2]:
model.encode('hey'*600)

array([-4.15964164e-02, -2.46818159e-02,  7.78930783e-02, -6.14949502e-03,
        1.70014743e-02,  1.26202719e-03,  1.23739600e-01, -1.97151094e-03,
        5.89809977e-02, -4.21656072e-02,  5.91972917e-02, -7.12989345e-02,
        1.26703799e-01,  2.51503158e-02, -1.38013838e-02, -6.40119463e-02,
       -2.13553403e-02, -2.36788057e-02, -4.56758700e-02,  7.49071548e-03,
       -8.09309781e-02,  1.64253470e-02,  4.02019247e-02,  3.40726860e-02,
       -9.14752558e-02,  4.13327254e-02, -2.85369856e-03, -5.18556917e-03,
        6.23375550e-02, -3.98614444e-02,  1.68919116e-02,  1.91846248e-02,
        5.93932904e-03, -8.97368938e-02,  1.82577427e-02,  4.33312170e-03,
        1.75719745e-02, -6.44410774e-02, -6.92551285e-02,  4.20069136e-02,
       -6.00515008e-02, -3.75331156e-02,  5.68970330e-02, -6.44810423e-02,
        5.67378737e-02,  1.10392459e-02, -7.32904077e-02, -2.01278105e-02,
        2.71938723e-02, -1.15391687e-02,  2.39583943e-03,  1.29432669e-02,
       -2.86375731e-02,  

While all the texts in this example are single sentences, you can encode longer texts up to a specified word length. For example, "all-MiniLM-L6-v2" encodes texts up to 256 words. It’ll truncate any text longer than this.

In [20]:
text_embeddings_dict = dict(zip(texts, list(text_embeddings)))

dog_text_1 = "The canine barked loudly."
dog_text_2 = "The dog made a noisy bark."

print("Similarity between dog related texts: ", compute_cosine_similarity(text_embeddings_dict[dog_text_1],
                          text_embeddings_dict[dog_text_2]))


pizza_text_1 = "He ate a lot of pizza."
pizza_test_2 = "He devoured a large quantity of pizza pie."

print("Similaroity between Pizza Texts: ", compute_cosine_similarity(text_embeddings_dict[pizza_text_1],
                          text_embeddings_dict[pizza_test_2]))

print("Similaroity between Pizza Text and Dog Text: ", compute_cosine_similarity(text_embeddings_dict[dog_text_1],
                          text_embeddings_dict[pizza_test_2]))

Similarity between dog related texts:  0.77686167
Similaroity between Pizza Texts:  0.787134
Similaroity between Pizza Text and Dog Text:  0.11757195


* The cosine similarity between The canine barked loudly and The dog made a noisy bark is relatively high even though the two sentences use different words. The same is true for the similarity between He ate a lot of pizza and He devoured a large quantity of pizza pie. Because the text embeddings encode semantic meaning, any pair of related texts should have a high cosine similarity.

* As you might expect, the cosine similarity between The canine barked loudly and He ate a lot of pizza is low because the sentences are unrelated to each other.

## ChromaDB

### Vector Database

A vector database is a database that allows you to efficiently store and query embedding data. Vector databases extend the capabilities of traditional relational databases to embeddings. However, the key distinguishing feature of a vector database is that query results aren’t an exact match to the query. Instead, using a specified similarity metric, the vector database returns embeddings that are similar to a query.

As an example use case, suppose you’ve stored company documents in a vector database. This means each document has been embedded and can be compared to other embeddings through a similarity metric like cosine similarity.

Here are the core components of a vector database that you should know about:

* **Embedding function**: When using a vector database, oftentimes you’ll store and query data in its raw form, rather than uploading embeddings themselves. Internally, the vector database needs to know how to convert your data to embeddings, and you have to specify an embedding function for this. For text, you can use the embedding functions available in the SentenceTransformers library or any other function that maps raw text to vectors.

* **Similarity metric**: To assess embedding similarity, you need a similarity metric like cosine similarity, the dot product, or Euclidean distance. As you learned previously, cosine similarity is a popular choice, but choosing the right similarity metric depends on your application.

* **Indexing**: When you’re dealing with a large number of embeddings, comparing a query embedding to every embedding stored in the database is often too slow. To overcome this, vector databases employ indexing algorithms that group similar embeddings together. At query time, the query embedding is compared to a smaller subset of embeddings based on the index. Because the embeddings recommended by the index aren’t guaranteed to have the highest similarity to the query, this is called approximate nearest neighbor search.

* **Metadata**: You can store metadata with each embedding to help give context and make query results more precise. You can filter your embedding searches on metadata much like you would in a relational database. For example, you could store the year that a document was published as metadata and only look for similar documents that were published in a given year.

* **Storage location**: With any kind of database, you need a place to store the data. Vector databases can store embeddings and metadata both in memory and on disk. Keeping data in memory allows for faster reads and writes, while writing to disk is important for persistent storage.

* **CRUD operations**: Most vector databases support create, read, update, and delete (CRUD) operations. This means you can maintain and interact with data like you would in a relational database.

**ChromaDB** is an open-source vector database designed specifically for LLM applications. ChromaDB offers you both a user-friendly API and impressive performance, making it a great choice for many embedding applications

In [4]:
import chromadb
from chromadb.utils import embedding_functions

#set up paths, model..
CHROMA_DATA_PATH = "chroma_data/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "demo_docs"

client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

You then instantiate a PersistentClient object that writes your embedding data to CHROMA_DB_PATH. By doing this, you ensure that data will be stored at CHROMA_DB_PATH and persist to new clients. Alternatively, you can use chromadb.Client() to instantiate a ChromaDB instance that only writes to memory and doesn’t persist on disk.

Next, you instantiate your embedding function and the ChromaDB collection to store your documents in:



In [7]:
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(EMBED_MODEL)

#define collection
collection = client.create_collection(
    name=COLLECTION_NAME,
    embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"},)

  from .autonotebook import tqdm as notebook_tqdm


 collection is the object that stores your embedded documents along with any associated metadata. If you’re familiar with relational databases, then you can think of a collection as a table. In this example, your collection is named demo_docs, it uses the "all-MiniLM-L6-v2" embedding function that you instantiated, and it uses the cosine similarity distance function as specified by metadata={"hnsw:space": "cosine"}.

In [11]:
documents = [
    "The latest iPhone model comes with impressive features and a powerful camera.",
    "Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.",
    "Einstein's theory of relativity revolutionized our understanding of space and time.",
    "Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.",
    "The American Revolution had a profound impact on the birth of the United States as a nation.",
    "Regular exercise and a balanced diet are essential for maintaining good physical health.",
    "Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
    "Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
    "Startup companies often face challenges in securing funding and scaling their operations.",
    "Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
]

genres = [
    "technology",
    "travel",
    "science",
    "food",
    "history",
    "fitness",
    "art",
    "climate change",
    "business",
    "music",
]

#Add data to DB
collection.add(
    documents=documents,
    ids=[f"id{i}" for i in range(len(documents))],
    metadatas=[{"genre": g} for g in genres]
)

Add of existing embedding ID: id0
Add of existing embedding ID: id1
Add of existing embedding ID: id2
Add of existing embedding ID: id3
Add of existing embedding ID: id4
Add of existing embedding ID: id5
Add of existing embedding ID: id6
Add of existing embedding ID: id7
Add of existing embedding ID: id8
Add of existing embedding ID: id9
Insert of existing embedding ID: id0
Insert of existing embedding ID: id1
Insert of existing embedding ID: id2
Insert of existing embedding ID: id3
Insert of existing embedding ID: id4
Insert of existing embedding ID: id5
Insert of existing embedding ID: id6
Insert of existing embedding ID: id7
Insert of existing embedding ID: id8
Insert of existing embedding ID: id9


In this block, you define a list of ten documents in documents and specify the genre of each document in genres. You then add the documents and genres using collection.add(). Each document in the documents argument is embedded and stored in the collection. You also have to define the ids argument to uniquely identify each document and embedding in the collection. You accomplish this with a list comprehension that creates a list of ID strings.

The metadatas argument is optional, but most of the time, it’s useful to store metadata with your embeddings. In this case, you define a single metadata field, "genre", that records the genre of each document. When you query a document, metadata provides you with additional information that can be helpful to better understand the document’s contents. You can also filter on metadata fields, just like you would in a relational database query.

In [31]:
#query database
query_results = collection.query(
    query_texts=["Find me some delicious pizza!"],
    n_results=1,
)

query_results

{'ids': [['id3']],
 'distances': [[0.4305993757604256]],
 'metadatas': [[{'genre': 'food'}]],
 'embeddings': None,
 'documents': [['Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.']],
 'uris': None,
 'data': None}

In [21]:
query_results = collection.query(
    query_texts=["Teach me about history",
                 "What's going on in the world?"],
    include=["documents", "distances"],
    n_results=2
)

query_results

{'ids': [['id2', 'id4'], ['id7', 'id2']],
 'distances': [[0.6265882785638517, 0.6904193065163069],
  [0.8002944119697346, 0.8882106526920683]],
 'metadatas': None,
 'embeddings': None,
 'documents': [["Einstein's theory of relativity revolutionized our understanding of space and time.",
   'The American Revolution had a profound impact on the birth of the United States as a nation.'],
  ["Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
   "Einstein's theory of relativity revolutionized our understanding of space and time."]],
 'uris': None,
 'data': None}

**Keep in mind that so-called similar documents returned from a semantic search over embeddings may not actually be relevant to the task that you’re trying to solve. The success of a semantic search is somewhat subjective, and you or your stakeholders might not agree on the quality of the results.**

In [44]:
#Filtering on metadata
query_results = collection.query(
    query_texts="Teach me about music history",
    include=["documents", "distances"],
    n_results=1
)
print("Result without metadata: ", query_results, '\n ')

#metadata equals music
query_results = collection.query(
    query_texts="Teach me about music history",
    include=["documents", "distances"],
    where={'genre': {'$eq':'music'}},
    n_results=2
)
print("Result with metadata: ", query_results, '\n')

#filter in a list of metadata
query_results = collection.query(
    query_texts="Teach me about music history",
    include=["documents", "distances"],
    where={'genre': {'$in':['music', 'history']}},
    n_results=2
)
print("Result with extra metadata: ", query_results)

Result without metadata:  {'ids': [['id2']], 'distances': [[0.7625819974843479]], 'metadatas': None, 'embeddings': None, 'documents': [["Einstein's theory of relativity revolutionized our understanding of space and time."]], 'uris': None, 'data': None} 
 
Result with metadata:  {'ids': [['id9']], 'distances': [[0.818632860764435]], 'metadatas': None, 'embeddings': None, 'documents': [["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'"]], 'uris': None, 'data': None} 

Result with extra metadata:  {'ids': [['id9', 'id4']], 'distances': [[0.818632860764435, 0.8200413343809558]], 'metadatas': None, 'embeddings': None, 'documents': [["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'", 'The American Revolution had a profound impact on the birth of the United States as a nation.']], 'uris': None, 'data': None}


This query filters the collection of documents that have either a music or history genre, as specified by where={"genre": {"$in": ["music", "history"]}}.

In [46]:
#Update existing document
collection.update(
    ids=["id1", "id2"],
    documents=["The new iPhone is awesome!",
               "Bali has beautiful beaches"],
    metadatas=[{"genre": "tech"}, {"genre": "beaches"}]
)

query_results = collection.get(ids=["id1", "id2"])

query_results["documents"]

['The new iPhone is awesome!', 'Bali has beautiful beaches']

 If you’re not sure whether a document exists for an ID, you can use collection.upsert(). This works the same way as collection.update(), except it’ll insert new documents for IDs that don’t exist.

In [47]:
#deleting a doxument
collection.delete(ids=["id0", "id1"])

collection.get(ids=["id0"])

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}