# General-purpose text embeddings with E5-Large

This notebook describes how to use the [E5 model](https://arxiv.org/pdf/2212.03533.pdf) (Emb**E**ddings from
bidir**E**ctional **E**ncoder r**E**presentations) to generate text embeddings on the IPU. This [state-of-the-art](https://syncedreview.com/2022/12/13/microsofts-e5-text-embedding-model-tops-the-mteb-benchmark-with-40x-fewer-parameters/)  text embeddings model can be used for general purpose text embeddings for any tasks requiring a single-vector representation of texts, including embeddings retrieval and semantic search, clustering and classification. The E5 model provides general-purpose checkpoints trained without labels (unsupervised) and fine-tuned checkpoints.

!['E5 Model'](images/2Dretrieval.svg)

Here, we demonstrate how to use the fine-tuned E5 large checkpoint for inference over 4 IPUs.

First, install the requirements for running this notebook. This will install `gradio` and `sentence_transformers`:

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
! pip install -r requirements.txt

Next, import the general required modules for the notebook:

In [1]:
import os
import torch
import poptorch
import numpy as np
from tqdm.notebook import tqdm
import logging

We need to instatiate some global parameters that we will use to run the model. Here, we define the model name (the checkpoint which will be downloaded from the Hugging Face Hub) and the micro batch size. The micro batch size is set to 1, as we use on-device loops (`device iterations` specific to the IPU) to set a effective batch size of 32. A random seed is also set for reproducibility.

The checkpoint (`model_name`) can be directly modified to use one of the [unsupervised](https://github.com/microsoft/unilm/tree/master/e5#english-pre-trained-models) checkpoints. 

In [2]:
logger = logging.getLogger("")

model_name = 'intfloat/e5-large'
pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod4")

n_ipu = 1
micro_batch_size = 2
device_iterations = 256
replication_factor = None

max_seq_len = 512

random_seed = 42

Next, define the `transformers` `AutoTokenizer` to instantiate a vocabulary tokenizer for our input text, for the task we define an maximum input sequence length of 512 and pad each sequence to the maximum sequence length.

In [3]:
from transformers import AutoTokenizer, BatchEncoding

tokenizer = AutoTokenizer.from_pretrained(model_name)

def transform_func(example) -> BatchEncoding:
    return tokenizer(
        example['text'],
        max_length=max_seq_len,
        padding="max_length",
        truncation=True
    )

We define some IPU specific configurations to get the most out of the model. A different configuration is described here for the smaller checkpoint `e5-small` as it can run directly on a single IPU. The larger model is pipelined over 4 IPUs.

In [4]:
from config import get_ipu_config

ipu_config = get_ipu_config(pod_type, n_ipu, device_iterations, replication_factor, random_seed)



The model config needs to be instantiated for the E5 model. E5 uses a bidirectional encoder, essentially the encoder stage of a BERT model, to generate the trained embeddings. The config will define the architecture of the model, such as the number of encoder layers and size of the hidden dimension within the model.

The larger E5 model is run over 4 IPUs, to do this, we use IPU pipeline parallelism - the stages of the model run on each IPU used are defined by the `PipelinedE5Model` class from `modeling_e5.py` which subclasses the BERT encoder and uses the `parallelize()` function to define the device information and stage for each set of layers in the model.

To run the model on the IPU, we simply need to import the `PipelinedE5Model`, pass the pretrained config to it and define the custom IPU config for the model, as certain parameters in the IPU config are used within the parallelisation function.

Finally, the model is passed into a `poptorch.inferenceModel()` wrapper to create an IPU-ready executor for it.

In [5]:
from transformers import AutoConfig, AutoModel
from modeling_e5 import PipelinedE5Model

from optimum.graphcore.modeling_utils import to_pipelined

e5_config = AutoConfig.from_pretrained(model_name)
e5_model = PipelinedE5Model.from_pretrained(model_name, config=e5_config).eval().half()
e5_model.ipu_config = ipu_config

ipu_options = ipu_config.to_options(for_inference=True)
e5_model_ipu = poptorch.inferenceModel(e5_model.parallelize(), ipu_options)

Lets load a dataset to try out the model. Using the Hugging Face `datasets` library we can load a pre-existing dataset from the Hugging Face Hub. In this case, lets use the `go_emotions` dataset. Later in the notebook, we will use this dataset to perform create a basic semantic search functionality.

The dataset first needs to be tokenized, we can use the `map()` method to tokenize each of the inputs of the dataset.

Finally, we can convert the Hugging Face Arrow format dataset to a Pytorch ready dataset with `set_format` which converts the tokenized inputs into tensors.

In [6]:
from datasets import Dataset, load_dataset

dataset = load_dataset("rotten_tomatoes")

tokenized_dataset = dataset.map(transform_func, batched=True)

tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "token_type_ids"])

Found cached dataset rotten_tomatoes (/home/arsalanu/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /home/arsalanu/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/cache-0b4c773d2d3df4ba.arrow
Loading cached processed dataset at /home/arsalanu/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/cache-2872aae7edfd2479.arrow
Loading cached processed dataset at /home/arsalanu/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/cache-231657c4476b8f66.arrow


The tokenized dataset is passed to the [`poptorch.Dataloader`](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/batching.html) to create a IPU-ready batched dataloader.

In [7]:
from transformers import default_data_collator as data_collator

poptorch_dataloader = poptorch.DataLoader(
    ipu_options,
    tokenized_dataset['train'],
    batch_size=micro_batch_size,
    shuffle=False,
    drop_last=True,
    num_workers=2,
    collate_fn=data_collator
)

We define a simple `infer()` function which will perform inference iteratively on each batch and return the concatenated list of embeddings for the entire dataset.

In [8]:
def infer(model, dataloader):
    encoded_embeds = []
    with torch.no_grad():
        for batch_dict in tqdm(dataloader, desc='encoding'):
            lat = time.time()
            outputs = model(**batch_dict)
            lat = time.time() - lat
            
            encoded_embeds.append(outputs)
            print(f"batch len: {len(batch_dict['input_ids'])} | batch latency: {lat}s | per_sample: {lat/len(batch_dict['input_ids'])}s | throughput: {len(batch_dict['input_ids'])/lat} samples/s")
    
    return torch.cat(encoded_embeds, axis=0)

To run the model, first we pass an arbitrary call to the model using the first batch to ensure we have compiled the model executable (or loaded the already compiled executable).

In [9]:
import time

c = time.time()
e5_model_ipu(**next(iter(poptorch_dataloader)))
print(f"Compile time: {time.time() - c}")

Graph compilation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00]


Compile time: 29.782036304473877


Then, simply call the `infer` function to generate embeddings for the full dataset.

In [10]:
runtime = time.time()
embeddings = infer(e5_model_ipu, poptorch_dataloader)
runtime = time.time() - runtime

e5_model_ipu.detachFromDevice()

encoding:   0%|          | 0/4 [00:00<?, ?it/s]

batch len: 2048 | batch latency: 2.3720037937164307s | per_sample: 0.001158204977400601s | throughput: 863.4050271864089 samples/s
batch len: 2048 | batch latency: 2.368058681488037s | per_sample: 0.0011562786530703306s | throughput: 864.8434331505167 samples/s
batch len: 2048 | batch latency: 2.3680970668792725s | per_sample: 0.0011562973959371448s | throughput: 864.8294145724765 samples/s
batch len: 2048 | batch latency: 2.367774724960327s | per_sample: 0.0011561400024220347s | throughput: 864.9471499170238 samples/s


Lets print out one of the results, and the total IPU runtime.

In [11]:
print(f"IPU runtime: {runtime}\n First embedding: {embeddings[0]}\n Shape: {embeddings[0].shape}")

IPU runtime: 9.597113132476807
 First embedding: tensor([-0.0108, -0.0634,  0.0363,  ...,  0.0274, -0.0333, -0.0008],
       dtype=torch.float16)
 Shape: torch.Size([1024])


The embedding vector in its current state doesn't look particularly meaningful. The embeddings for a single sequence represent low-dimensional numerical representations of the word-level and sentence-level context for each token. These pre-trained embeddings can be used in applications like embedding retrieval for recommender systems, or semantic search for query-matching using cosine-similarity. Both of these use cases take advantage of the generated embeddings space, by performing a relative comparison of the user input sequence embeddings using some proximity metric.

We'll use the open source `sentence_transformers` library which provides utilities for embeddings tasks to perform semantic search on a user query to retrieve the most similar sequences from the dataset to the query. This is a helpful utility for making, for example, more responsive FAQs.

## Semantic search with E5 generated embeddings

Using the `go_emotions` dataset, lets create a simple similarity search engine using `sentence_transformers` semantic search function, which uses cosine similarity to retrieve close-proximity sentences from a given set of embeddings to a given query. We have already generated embeddings for the dataset, so the next step is to do the same with a given query and perform the search.

First, to process the query, we need to tokenize it and convert it to a single-batch input for the model. This has been wrapped into a simple function which tokenizes and prepares a dictionary of model inputs (`input_ids`, `attention_mask`, etc.,) to which we just need to pass a string.

In [12]:
def prepare_query(query: str):
    t_query = tokenizer(
            query,
            max_length=max_seq_len,
            padding="max_length",
            truncation=True
        )

    return {k: torch.as_tensor([t_query[k]]) for k in t_query}

Next, to perform inference with a single input (i.e., effective batch size of 1) we re-instantiate the model by setting all device batching, replication and micro batch-size to 1 and re-compile the model. The change in batch size necessitates a recompilation, since the input shape to the model has been changed. We will follow the steps to initiate the model outlined earlier in the notebook, but forcing the `get_ipu_config` function to have all batching turned off.

In [13]:
ipu_config = get_ipu_config(pod_type, n_ipu=1, device_iterations=1, replication_factor=1, random_seed=random_seed)

inf_model = PipelinedE5Model.from_pretrained(model_name, config=e5_config).eval().half()
inf_model.ipu_config = ipu_config

inf_model = poptorch.inferenceModel(inf_model.parallelize(), ipu_config.to_options(for_inference=True))

inf_model(**prepare_query("Running once to compile"))

Graph compilation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00]


tensor([[-0.0124, -0.0453,  0.0185,  ..., -0.0058,  0.0110,  0.0155]],
       dtype=torch.float16)

In [21]:
from sentence_transformers.util import semantic_search

query = "Strongly disliked this action movie"

query_embeddings = inf_model(**prepare_query(query))
hits = semantic_search(query_embeddings.float(), embeddings.float(), top_k=10)

print(f"\n SEARCH QUERY: {query}")
for n, res in enumerate(hits[0]):
    print(f"\n Result (rank {n+1}) | Score: {res['score']} | Text: {dataset['train']['text'][res['corpus_id']]} ")


 SEARCH QUERY: Strongly disliked this action movie

 Result (rank 1) | Score: 0.8954101204872131 | Text: i hate this movie 

 Result (rank 2) | Score: 0.8791762590408325 | Text: it's a bad action movie because there's no rooting interest and the spectacle is grotesque and boring . 

 Result (rank 3) | Score: 0.8550726771354675 | Text: it is a comedy that's not very funny and an action movie that is not very thrilling ( and an uneasy alliance , at that ) . 

 Result (rank 4) | Score: 0.8523359894752502 | Text: . . . this movie has a glossy coat of action movie excess while remaining heartless at its core . 

 Result (rank 5) | Score: 0.8508359789848328 | Text: 'this movie sucks . ' 

 Result (rank 6) | Score: 0.8469082117080688 | Text: this movie . . . doesn't deserve the energy it takes to describe how bad it is . 

 Result (rank 7) | Score: 0.8436370491981506 | Text: the movie slides downhill as soon as macho action conventions assert themselves . 

 Result (rank 8) | Score: 0.843372

From the results, the pretrained embeddings appear to perform quite well on an unseen dataset without any fine-tuning. Lets turn our semantic-search example into a neat Gradio app to demonstrate a mini "search engine":


In [None]:
import gradio as gr
import pandas as pd

def e5_semantic_search(query):
    query_embeddings = inf_model(**prepare_query(query))
    hits = semantic_search(query_embeddings.float(), embeddings.float(), top_k=5)

    results = {'text':[], 'score':[]}
    for n, res in enumerate(hits[0]):
        results['text'].append(f"{dataset['train']['text'][res['corpus_id']]}")
        results['score'].append(res['score'])
        
    return f"{results}"

demo = gr.Interface(
    fn=e5_semantic_search,
    inputs=gr.Textbox(lines=2, placeholder="Really liked this action movie"),
    outputs=gr.Textbox(lines=2, placeholder="what")
)

demo.launch(server_name='0.0.0.0', share=True)

Running on local URL:  http://0.0.0.0:7864
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
