# Embeddings Type Comparison

## Setup and Introduction

Embeddings are dense vector representations of data (often words, sentences, or documents) in a continuous vector space. They capture semantic meaning and relationships between items in a way that's useful for machine learning models. In essence, embeddings translate high-dimensional, sparse data into lower-dimensional, dense vectors where similar items are closer together in the vector space. Now lets look at several types of embeddings:

**Float Embeddings**: They are the most common and traditional type of embeddings. They use floating-point numbers (typically 32-bit or 64-bit) to represent each dimension of the embedding vector, offering high precision but requiring more storage and computational resources. They're the standard for most machine learning tasks

**Int8 Embeddings**: Int8 embeddings use 8-bit integers instead of floating-point numbers. This is a form of quantization, providing a good balance between precision and efficiency. They reduce memory usage and can speed up computations, making them suitable for resource-constrained environments.

**Binary Embeddings**: Represent each dimension with a single bit (0 or 1), resulting in extremely compact and fast-to-process embeddings. While they sacrifice precision, they excel in large-scale similarity search tasks.

This notebook demonstrates a comparison between the three types of embeddings: float embeddings, int8 embeddings, and binary embeddings. Cohere embbeding model on Amazon Bedrock is used to generate these embeddings and then we will compare their memory footprint and retrieval speed. The vector databases that are used are OpenSearch and FAISS (Facebook AI Similarity Search).

## Install Libraries

In [35]:
!pip install -U --quiet boto3 pandas datasets faiss-cpu opensearch-py

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.32.85 requires botocore==1.34.85, but you have botocore 1.34.152 which is incompatible.[0m[31m
[0m

## Setup Bedrock Client

The `invoke_bedrock` function is set up to interact with Amazon Bedrock's API. It sends requests to generate embeddings for given texts using the specified embedding types.

In [36]:
import boto3
import json

bedrock = boto3.client(service_name="bedrock")
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

def invoke_bedrock(texts, input_type, embedding_types, model_id="cohere.embed-english-v3"):

    response = bedrock_runtime.invoke_model(
        body= json.dumps({
            "texts": texts,
            "input_type": input_type,
            "truncate": "END",
            "embedding_types": embedding_types
        }),
    	modelId=model_id,
        accept="application/json", 
        contentType="application/json"
    )
    response_body = json.loads(response.get("body").read())
    embedding_output = response_body.get("embeddings")
    response_type = response_body.get("response_type")
    
    return embedding_output

## Setup dataset

We use the [MSMARCO dataset](https://huggingface.co/datasets/microsoft/ms_marco), a large-scale dataset for information retrieval tasks. We load a subset of this dataset, extracting passages and queries for our embedding experiments.

In [37]:
from datasets import load_dataset

# Load the MSMARCO dataset from Hugging Face
dataset = load_dataset("ms_marco", "v2.1", split="train[:96]")
passages = dataset['passages']
corpus = [ passage["passage_text"][0] for passage in passages ]
queries = dataset['query']


## Vector Embedding Memory Footprint

This section demonstrates how different embedding types affect memory usage.

The `calculate_embedding_memory` function computes the memory footprint of embeddings based on their type (float, int8, or binary). It calculates the total size, embedding dimension, number of embeddings, and size of each element.

In [38]:
import numpy as np

def calculate_embedding_memory(embeddings, embed_type="float"):

    if embed_type == "float":
        data_type=np.float32
    elif embed_type == "int8":
        data_type=np.int8
    elif embed_type == "ubinary":
        data_type=np.uint8

    embeddings_np = np.array(embeddings).astype(data_type)
    num_embeddings, embedding_dim = embeddings_np.shape
    
    # Get the size of each element in bytes
    element_size = embeddings_np.itemsize
    
    # Calculate the total size
    total_size = num_embeddings * embedding_dim * element_size
    
    return total_size, embedding_dim, num_embeddings, element_size

### Float embeddings

Float embeddings are the standard output of most embedding models. They use 32 bits (4 bytes) per dimension, providing high precision but requiring more memory.

In [45]:
embed_type="float"
response= invoke_bedrock(corpus, input_type="search_document", embedding_types=[embed_type])
embeddings_float = response[embed_type]

# Calculate memory
total_size_float, embedding_dim, num_embeddings, element_size = calculate_embedding_memory(embeddings_float, embed_type)
total_size_float_mb = total_size_float / (1024 * 1024)

print(f"Total Size: {total_size_float} bytes")
print(f"Total Size (MB): {total_size_float_mb} MB")
print(f"Size of each embedding dimension: {element_size} byte(s)")
print(f"Embeddings dimension: {embedding_dim}")
print(f"Number of embeddings: {num_embeddings}")

Total Size: 393216 bytes
Total Size (MB): 0.375 MB
Size of each embedding dimension: 4 byte(s)
Embeddings dimension: 1024
Number of embeddings: 96


### int8 embeddings

Int8 embeddings quantize the float embeddings to 8-bit integers. This reduces memory usage to 1 byte per dimension, potentially sacrificing some precision for efficiency.

In [46]:
embed_type="int8"
response= invoke_bedrock(corpus, input_type="search_document", embedding_types=[embed_type])
embeddings_int8 = response[embed_type]

# Calculate memory
total_size_int8, embedding_dim, num_embeddings, element_size = calculate_embedding_memory(embeddings_int8, embed_type)
total_size_int8_mb = total_size_int8 / (1024 * 1024)

print(f"Total Size: {total_size_int8} bytes")
print(f"Total Size (MB): {total_size_int8_mb} MB")
print(f"Size of each embedding dimension: {element_size} byte(s)")
print(f"Embeddings dimension: {embedding_dim}")
print(f"Number of embeddings: {num_embeddings}")

Total Size: 98304 bytes
Total Size (MB): 0.09375 MB
Size of each embedding dimension: 1 byte(s)
Embeddings dimension: 1024
Number of embeddings: 96


### Binary embeddings

Binary embeddings further compress the representation to 1 bit per dimension. While this dramatically reduces memory usage, it may lead to a more significant loss in precision compared to float or int8 embeddings.

In [47]:
embed_type="ubinary"
response= invoke_bedrock(corpus, input_type="search_document", embedding_types=[embed_type])
embeddings_binary = response[embed_type]

# Calculate memory
total_size_binary, embedding_dim, num_embeddings, element_size = calculate_embedding_memory(embeddings_binary, embed_type)
total_size_binary_mb = total_size_binary / (1024 * 1024)

print(f"Total Size: {total_size_binary} bytes")
print(f"Total Size (MB): {total_size_binary_mb} MB")
print(f"Size of each embedding dimension: {element_size} byte(s)")
print(f"Embeddings dimension: {embedding_dim}")
print(f"Number of embeddings: {num_embeddings}")

Total Size: 12288 bytes
Total Size (MB): 0.01171875 MB
Size of each embedding dimension: 1 byte(s)
Embeddings dimension: 128
Number of embeddings: 96


### Comparing Sizes

In [48]:
from IPython.display import display, HTML
import pandas as pd

data = {
    "Embedding Type": ["Float Embeddings", "Int8 Embeddings", "Binary Embeddings"],
    "Total Size (Bytes)": [total_size_float, total_size_int8, total_size_binary],
    "Total Size (MB)": [total_size_float_mb, total_size_int8_mb, total_size_binary_mb]
}
df = pd.DataFrame(data)
display(df.style.set_properties(**{'text-align': 'left'}))

Unnamed: 0,Embedding Type,Total Size (Bytes),Total Size (MB)
0,Float Embeddings,393216,0.375
1,Int8 Embeddings,98304,0.09375
2,Binary Embeddings,12288,0.011719


## Retrieval Speed

This section compares the retrieval speed of different embedding types using OpenSearch and FAISS.

### Prerequisite

**Before running this cell below, ensure to have Opensearch cluster running. For simplicity, you can deploy an opensearch cluster on the same host where this notebook is running using Docker**

`docker run -d -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:2.11.1`

Below we use the default `user: admin` and `password: admin`. You might need to replace the host/port/username/password if OpenSearch is hosted somewhere else

In [57]:
from opensearchpy import OpenSearch, helpers
import time 

#Disable SSL warnings from OpenSearch
from requests.packages import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

def create_os_index(index_name, index_body, documents, embeddings):
    
    # Here we use the default user: admin and password: admin
    # You might need to replace the host/port/username/password if OpenSearch is hosted somewhere else
    os_client = OpenSearch(
        hosts=[{'host': 'localhost', 'port': 9200}],
        http_auth=('admin', 'admin'),
        use_ssl=True,
        verify_certs=False
    )
    
    # Delete the index if it exists already
    os_client.indices.delete(index=index_name, ignore=[400, 404])
    
    # Create the new index
    response = os_client.indices.create(index_name, body=index_body)
    print("OpenSearch index created:", response)
    
    # We use a batch size of 512, embed the documents and then index this to OpenSearch
    batch_size = 512
    doc_id = 0
    for start_idx in range(0, len(documents), batch_size):
        batch_documents = documents[start_idx:start_idx+batch_size]

        # Do a bulk upsert to OpenSearch
        batch = []
        for document, doc_emb in zip(batch_documents, embeddings):
            batch.append({
                    "_index": index_name,
                    "_id": doc_id,
                    "_source": {
                        "text": document,
                        "text_emb": doc_emb
                    }
                })
            doc_id += 1
        helpers.bulk(os_client, batch)

    print("Indexing of documents finished")
    
    # Give opensearch some time to index the data
    time.sleep(1)

    
def query_os(index_name, query_emb, top_k=5):
    
    query_body = {
        "size": top_k,
        "query": {
            "knn": {
            "text_emb": {
                "vector": query_emb,
                "k": top_k
                }
            }
        }
    }
    
    start_time = time.time()
    hits = os_client.search(index=index_name, body=query_body)["hits"]["hits"]
    end_time = time.time()

    elapsed_time = end_time - start_time
    elapsed_time_secs = f"{elapsed_time:.6f} seconds"
    
    return hits, elapsed_time_secs


### float embeddings

We create an Opensearch index using float embeddings and perform a search. This serves as a baseline for comparison with other embedding types.

In [58]:
embed_type="float"
index_name = "index-float"

index_body = {
    'settings': {
        'index': {
            'number_of_shards': 1,
            "knn": True,
            "knn.algo_param.ef_search": 100
        }
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "text_emb": {
                "type": "knn_vector",
                "dimension": 1024
            }
        }
    }
}
create_os_index(index_name, index_body, corpus, embeddings_float)

# Search in your index. First we define the query
query = queries[0]
response = invoke_bedrock([query], input_type="search_query", embedding_types=[embed_type])
query_emb = np.array(response[embed_type]).astype('int8')[0]

hits, response_time_float = query_os(index_name, query_emb)

print(f"Time taken for retrieval: {response_time_float}\n\n")




OpenSearch index created: {'acknowledged': True, 'shards_acknowledged': True, 'index': 'index-float'}
Indexing of documents finished
Time taken for retrieval: 0.011233 seconds




### int8 embeddings

Int8 embeddings are used to create a opensearch index. The search performance is compared to the float embeddings baseline.

In [59]:
embed_type="int8"
index_name = "index-int8"

# We specify a 'text_emb' property, that has "data_type": "byte". As engine we must use Lucene 
index_body = {
    'settings': {
        'index': {
            'number_of_shards': 1,
            "knn": True,
            "knn.algo_param.ef_search": 100
        }
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "text_emb": {
                "type": "knn_vector",
                "dimension": 1024,      #Use 1024 for the large models, 384 for the light models
                "data_type": "byte",    #Set data_type as byte
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "lucene", #Set Lucene as your engine
                    "parameters": {
                        "ef_construction": 256,
                        "m": 48
                    }
                }
            }
        }
    }
}
create_os_index(index_name, index_body, corpus, embeddings_int8)

# Search in your index. First we define the query
query = queries[0]
response = invoke_bedrock([query], input_type="search_query", embedding_types=[embed_type])
query_emb = np.array(response[embed_type]).astype('int8')[0]

hits, response_time_int8 = query_os(index_name, query_emb)

print(f"Time taken for retrieval: {response_time_int8}\n\n")    

OpenSearch index created: {'acknowledged': True, 'shards_acknowledged': True, 'index': 'index-int8'}
Indexing of documents finished
Time taken for retrieval: 0.008207 seconds




### Binary embeddings

Opensearch does not support storing binary embeddings, so for this we will utilize FAISS for this. A binary FAISS index is created using binary embeddings and search is performed.

**It is important to note that FAISS is an in memory vector store so search will be typically faster when compared with searching an opensearch cluster, so this is not an apples to apples comparison like we did with the float embeddings and int8 embeddings. However this is still shown to demonstrate how search will be performed for binary embeddings**

In [61]:
import faiss
import time

embed_type="ubinary"

#Cast embeddings to numpy
embeddings_bin = np.array(embeddings_binary).astype('uint8')

#Add the embeddings to the faiss index
num_dim = 1024   #Use 1024 dimensions for the embed-english-v3.0
index = faiss.IndexBinaryFlat(num_dim)
index.add(embeddings_bin)

# Search in your index
query = queries[0]
response= invoke_bedrock([query], input_type="search_query", embedding_types=[embed_type])
query_emb = np.array(response[embed_type]).astype('uint8')

start_time = time.time()
hits_scores, hits_doc_ids = index.search(query_emb, k=min(10*5, index.ntotal))
end_time = time.time()

elapsed_time = end_time - start_time
response_time_binary = f"{elapsed_time:.6f} seconds"
print(f"Time taken for retrieval: {response_time_binary}\n\n")

Time taken for retrieval: 0.000177 seconds




### Bringing it all together

In [63]:
from IPython.display import display, HTML
import pandas as pd

data = {
    "Embedding Type": ["Float Embeddings", "Int8 Embeddings", "Binary Embeddings (With FAISS)"],
    "Retrieval Speed (Seconds)": [response_time_float, response_time_int8, response_time_binary]
}
df = pd.DataFrame(data)
display(df.style.set_properties(**{'text-align': 'left'}))

Unnamed: 0,Embedding Type,Retrieval Speed (Seconds)
0,Float Embeddings,0.011233 seconds
1,Int8 Embeddings,0.008207 seconds
2,Binary Embeddings (With FAISS),0.000177 seconds
