### Embeddings benchmark

This notebook was created to evaluate the performance of various transformer models in generating embeddings for a dataset of comments, reviews on NHS.UK. 
- The key objective is to measure and analyze the time taken to generate embeddings for different sizes of text samples. This analysis helps in understanding the computational efficiency and performance trade-offs between different models.

It loads a reviews dataset from an Azure ML workspace (published reviews), prepares a  random sample  of the data, and uses a range of pre-trained transformer models to perform text embeddings. These models were the best performing on the Hugging Face leaderboard and were explored as we tested their performance when we were in the quest of finding the best embedding models to be used on our classification tasks.
The process is ran twice, once using a CPU basic compute on a smaller sample which simulates the deployment capabilities and also on GPU to do it on a bigger dataset.

In [1]:
import time

from transformers import AutoTokenizer, AutoModel

In [2]:
from azureml.core import Workspace, Dataset

subscription_id = #REDACTED
resource_group = #REDACTED
workspace_name = #REDACTED

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='published_10k_subset')

In [3]:
df = dataset.to_pandas_dataframe()
sample = df.loc[0:5, ['Comment Text']]
sample100 = df.loc[0:100, ['Comment Text']]
sample2k = df.loc[0:2000, ['Comment Text']]

In [8]:
embeddings_models = ["BAAI/bge-large-en",
                     "BAAI/bge-base-en",
                     "BAAI/bge-small-en", 
                     "thenlper/gte-large", 
                     "thenlper/gte-base", 
                     "sentence-transformers/all-mpnet-base-v2", 
                     "intfloat/e5-large-v2"]

In [6]:
# Create a dictionary to store the tokenizer and model objects for each embedding
embeddings_dict = {}

for embedding in embeddings_models:
    tokenizer = AutoTokenizer.from_pretrained(embedding)
    model = AutoModel.from_pretrained(embedding)
    
    embeddings_dict[embedding] = {
        "tokenizer": tokenizer,
        "model": model
    }

# Iterate over the embeddings and encode the texts
for embedding, embedding_dict in embeddings_dict.items():

    # List to store timings for current embedding
    timings = []

    for text in sample100['Comment Text']:
        start_time = time.time()
        # Prefix text with "query: " if using "intfloat/e5-large-v2" embedding
        if embedding == "intfloat/e5-large-v2":
            text = "query: " + text

        # Specify the maximum length as 512 to be on comparative terms with other embeddings which have a maximum cut-off 512
        tokens = embedding_dict["tokenizer"](text, return_tensors="pt", max_length=512, truncation=True, padding=True)
        embeddings_out = embedding_dict["model"](**tokens)

        end_time = time.time()

        timings.append(end_time - start_time)
    
    # Add timings as a new column to the dataframe
    sample100[embedding] = timings


# Save the DataFrame
sample100.to_csv("sample100_with_timings.csv")


Downloading:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/616 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

In [12]:
# Do the embedding process multiple times per text, record the timings for each run, and then compute the average for each text.
# Create a dictionary to store the tokenizer and model objects for each embedding
embeddings_dict = {}

for embedding in embeddings_models:
    tokenizer = AutoTokenizer.from_pretrained(embedding)
    model = AutoModel.from_pretrained(embedding)
    
    embeddings_dict[embedding] = {
        "tokenizer": tokenizer,
        "model": model
    }

# Number of times to embed each text to calculate average time
num_runs = 10

# Iterate over the embeddings and encode the texts
for embedding, embedding_dict in embeddings_dict.items():

    # List to store average timings for current embedding
    avg_timings = []

    for text in sample2k['Comment Text']:
        
        # Prefix text with "query: " if using "intfloat/e5-large-v2" embedding
        if embedding == "intfloat/e5-large-v2":
            text = "query: " + text
        
        # Store timings for multiple runs
        timings_for_text = []

        for _ in range(num_runs):
            start_time = time.time()

            # Specify the maximum length as 512
            tokens = embedding_dict["tokenizer"](text, return_tensors="pt", max_length=512, truncation=True, padding=True)
            embeddings_out = embedding_dict["model"](**tokens)

            end_time = time.time()
            
            timings_for_text.append(end_time - start_time)

        # Compute average timing for the text and append to avg_timings
        avg_time = sum(timings_for_text) / num_runs
        avg_timings.append(avg_time)

    # Add average timings as a new column to the dataframe
    sample2k[embedding] = avg_timings



# Save the DataFrame
sample2k.to_csv("sample2k_avg.csv")


## CPU

In [9]:
# Do the embedding process multiple times per text, record the timings for each run, and then compute the average for each text.
# Create a dictionary to store the tokenizer and model objects for each embedding
embeddings_dict = {}

for embedding in embeddings_models:
    tokenizer = AutoTokenizer.from_pretrained(embedding)
    model = AutoModel.from_pretrained(embedding)
    
    embeddings_dict[embedding] = {
        "tokenizer": tokenizer,
        "model": model
    }

# Number of times to embed each text to calculate average time
num_runs = 10

# Iterate over the embeddings and encode the texts
for embedding, embedding_dict in embeddings_dict.items():

    # List to store average timings for current embedding
    avg_timings = []

    for text in sample100['Comment Text']:
        
        # Prefix text with "query: " if using "intfloat/e5-large-v2" embedding
        if embedding == "intfloat/e5-large-v2":
            text = "query: " + text
        
        # Store timings for multiple runs
        timings_for_text = []

        for _ in range(num_runs):
            start_time = time.time()

            # Specify the maximum length as 512
            tokens = embedding_dict["tokenizer"](text, return_tensors="pt", max_length=512, truncation=True, padding=True)
            embeddings_out = embedding_dict["model"](**tokens)

            end_time = time.time()
            
            timings_for_text.append(end_time - start_time)

        # Compute average timing for the text and append to avg_timings
        avg_time = sum(timings_for_text) / num_runs
        avg_timings.append(avg_time)

    # Add average timings as a new column to the dataframe
    sample100[embedding] = avg_timings



# Save the DataFrame
sample100.to_csv("sample100_CPU_avg.csv")


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/616 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34G [00:00<?, ?B/s]