# Binary and int8 Embeddings

## Setup and Introduction
This notebook demonstrates a comparison between three types of embeddings: float embeddings, int8 embeddings, and binary embeddings. We use the Cohere embed model on Amazon Bedrock to generate these embeddings and compare their memory footprint and retrieval speed using FAISS (Facebook AI Similarity Search).

In [None]:
!pip install boto3
!pip install datasets

## Setup Bedrock Client

The `invoke_bedrock` function is set up to interact with Amazon Bedrock's API. It sends requests to generate embeddings for given texts using the specified embedding types.


In [128]:
import boto3
import json

bedrock = boto3.client(service_name="bedrock")
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

def invoke_bedrock(texts, input_type, embedding_types, model_id="cohere.embed-english-v3"):

    response = bedrock_runtime.invoke_model(
        body= json.dumps({
            "texts": texts,
            "input_type": input_type,
            "truncate": "END",
            "embedding_types": embedding_types
        }),
    	modelId=model_id,
        accept="application/json", 
        contentType="application/json"
    )
    response_body = json.loads(response.get("body").read())
    embedding_output = response_body.get("embeddings")
    response_type = response_body.get("response_type")
    
    return embedding_output

## Setup dataset

We use the MSMARCO dataset, a large-scale dataset for information retrieval tasks. We load a subset of this dataset, extracting passages and queries for our embedding experiments.

In [129]:
from datasets import load_dataset

# Load the MSMARCO dataset from Hugging Face
dataset = load_dataset("ms_marco", "v2.1", split="train[:96]")
passages = dataset['passages']
corpus = [ passage["passage_text"][0] for passage in passages ]
queries = dataset['query']


## Vector Embedding Memory Footprint

This section demonstrates how different embedding types affect memory usage.

The `calculate_embedding_memory` function computes the memory footprint of embeddings based on their type (float, int8, or binary). It calculates the total size, embedding dimension, number of embeddings, and size of each element.

In [130]:
import numpy as np

def calculate_embedding_memory(embeddings, embed_type="float"):

    if embed_type == "float":
        data_type=np.float32
    elif embed_type == "int8":
        data_type=np.int8
    elif embed_type == "ubinary":
        data_type=np.uint8

    embeddings_np = np.array(embeddings).astype(data_type)
    num_embeddings, embedding_dim = embeddings_np.shape
    
    # Get the size of each element in bytes
    element_size = embeddings_np.itemsize
    
    # Calculate the total size
    total_size = num_embeddings * embedding_dim * element_size
    
    return total_size, embedding_dim, num_embeddings, element_size

### Float embeddings

Float embeddings are the standard output of most embedding models. They use 32 bits (4 bytes) per dimension, providing high precision but requiring more memory.

In [131]:
embed_type="float"
response= invoke_bedrock(corpus, input_type="search_document", embedding_types=[embed_type])
embeddings_float = response[embed_type]

# Calculate memory
total_size, embedding_dim, num_embeddings, element_size = calculate_embedding_memory(embeddings_float, embed_type)

print(f"Total Size: {total_size} bytes")
print(f"Total Size (MB): {total_size / (1024 * 1024)} MB")
print(f"Size of each embedding dimension: {element_size} byte(s)")
print(f"Embeddings dimension: {embedding_dim}")
print(f"Number of embeddings: {num_embeddings}")

Total Size: 393216 bytes
Total Size (MB): 0.375 MB
Size of each embedding dimension: 4 byte(s)
Embeddings dimension: 1024
Number of embeddings: 96


### int8 embeddings

Int8 embeddings quantize the float embeddings to 8-bit integers. This reduces memory usage to 1 byte per dimension, potentially sacrificing some precision for efficiency.

In [132]:
embed_type="int8"
response= invoke_bedrock(corpus, input_type="search_document", embedding_types=[embed_type])
embeddings_int8 = response[embed_type]

# Calculate memory
total_size, embedding_dim, num_embeddings, element_size = calculate_embedding_memory(embeddings_int8, embed_type)

print(f"Total Size: {total_size} bytes")
print(f"Total Size (MB): {total_size / (1024 * 1024)} MB")
print(f"Size of each embedding dimension: {element_size} byte(s)")
print(f"Embeddings dimension: {embedding_dim}")
print(f"Number of embeddings: {num_embeddings}")

Total Size: 98304 bytes
Total Size (MB): 0.09375 MB
Size of each embedding dimension: 1 byte(s)
Embeddings dimension: 1024
Number of embeddings: 96


### Binary embeddings

Binary embeddings further compress the representation to 1 bit per dimension. While this dramatically reduces memory usage, it may lead to a more significant loss in precision compared to float or int8 embeddings.

In [133]:
embed_type="ubinary"
response= invoke_bedrock(corpus, input_type="search_document", embedding_types=[embed_type])
embeddings_binary = response[embed_type]

# Calculate memory
total_size, embedding_dim, num_embeddings, element_size = calculate_embedding_memory(embeddings_binary, embed_type)

print(f"Total Size: {total_size} bytes")
print(f"Total Size (MB): {total_size / (1024 * 1024):.2f} MB")
print(f"Size of each embedding dimension: {element_size} byte(s)")
print(f"Embeddings dimension: {embedding_dim}")
print(f"Number of embeddings: {num_embeddings}")

Total Size: 12288 bytes
Total Size (MB): 0.01 MB
Size of each embedding dimension: 1 byte(s)
Embeddings dimension: 128
Number of embeddings: 96


## Retrieval Speed

This section compares the retrieval speed of different embedding types using FAISS.

In [134]:
!pip install faiss-cpu



### float embeddings

We create a FAISS index using float embeddings and perform a search. This serves as a baseline for comparison with other embedding types.

In [139]:
import faiss
import time

embed_type="float"

#Cast embeddings to numpy
embeddings_fl = np.array(embeddings_float).astype('float32')

#Add the embeddings to the faiss index
num_dim = 1024   #Use 1024 dimensions for the embed-english-v3.0
index = faiss.IndexFlatL2(num_dim)
index.add(embeddings_fl)

# Search in your index
query = queries[0]
response= invoke_bedrock([query], input_type="search_query", embedding_types=[embed_type])
query_emb = np.array(response[embed_type]).astype('float32')

start_time = time.time()
hits_scores, hits_doc_ids = index.search(query_emb, k=min(10*5, index.ntotal))
end_time = time.time()

elapsed_time = end_time - start_time
elapsed_time_secs = f"{elapsed_time:.6f} seconds"
hits, time = search(index, query, embed_type)
print(f"Time taken for retrieval: {time}\n\n")
# for hit in hits:
#     print(f"{hit['score']:.2f}", corpus[hit['doc_id']])

Time taken for retrieval: 0.000089 seconds




### int8 embeddings

Int8 embeddings are used to create a quantized FAISS index. The search performance is compared to the float embeddings baseline.

In [155]:
import time

embed_type="int8"

#Cast embeddings to numpy
embeddings_int8 = np.array(embeddings_int8).astype('int8')

#Add the embeddings to the faiss index
num_dim = 1024   #Use 1024 dimensions for the embed-english-v3.0
quantizer = faiss.ScalarQuantizer(num_dim, faiss.ScalarQuantizer.QT_8bit)  
index = faiss.IndexHNSWSQ(num_dim, quantizer, 256)
index.add(embeddings_int8)

# Search in your index
query = queries[0]
response= invoke_bedrock([query], input_type="search_query", embedding_types=[embed_type])
query_emb = np.array(response[embed_type]).astype('int8')

start_time = time.time()
hits_scores, hits_doc_ids = index.search(query_emb, k=min(10*5, index.ntotal))
end_time = time.time()

elapsed_time = end_time - start_time
elapsed_time_secs = f"{elapsed_time:.6f} seconds"
hits, time = search(index, query, embed_type)
print(f"Time taken for retrieval: {time}\n\n")

# for hit in hits:
#     print(f"{hit['score']:.2f}", corpus[hit['doc_id']])

TypeError: Wrong number or type of arguments for overloaded function 'new_IndexHNSWSQ'.
  Possible C/C++ prototypes are:
    faiss::IndexHNSWSQ::IndexHNSWSQ()
    faiss::IndexHNSWSQ::IndexHNSWSQ(int,faiss::ScalarQuantizer::QuantizerType,int,faiss::MetricType)
    faiss::IndexHNSWSQ::IndexHNSWSQ(int,faiss::ScalarQuantizer::QuantizerType,int)


### Binary embeddings

A binary FAISS index is created using binary embeddings. The search performance is again compared to the other embedding types.

In [141]:
import faiss
import time

embed_type="ubinary"

#Cast embeddings to numpy
embeddings_bin = np.array(embeddings_binary).astype('uint8')

#Add the embeddings to the faiss index
num_dim = 1024   #Use 1024 dimensions for the embed-english-v3.0
index = faiss.IndexBinaryFlat(num_dim)
index.add(embeddings_bin)

# Search in your index
query = queries[0]
response= invoke_bedrock([query], input_type="search_query", embedding_types=[embed_type])
query_emb = np.array(response[embed_type]).astype('uint8')

start_time = time.time()
hits_scores, hits_doc_ids = index.search(query_emb, k=min(10*5, index.ntotal))
end_time = time.time()

elapsed_time = end_time - start_time
elapsed_time_secs = f"{elapsed_time:.6f} seconds"
hits, time = search(index, query, embed_type)
print(f"Time taken for retrieval: {time}\n\n")
# for hit in hits:
#     print(f"{hit['score']:.2f}", corpus[hit['doc_id']])

Time taken for retrieval: 0.000064 seconds


