# General-purpose text embeddings with E5-Large

This notebook describes how to use the [E5 model](https://arxiv.org/pdf/2212.03533.pdf) (Emb**E**ddings from
bidir**E**ctional **E**ncoder r**E**presentations) to generate text embeddings on the IPU. This [state-of-the-art](https://syncedreview.com/2022/12/13/microsofts-e5-text-embedding-model-tops-the-mteb-benchmark-with-40x-fewer-parameters/)  text embeddings model can be used for general purpose text embeddings for any tasks requiring a single-vector representation of texts, including retrieval, clustering and classification. The model provides general-purpose checkpoints trained without labels (unsupervised) and fine-tuned checkpoints.

Here, we demonstrate how to use the fine-tuned E5 large checkpoint for inference over 4 IPUs. The checkpoint (`model_name`) can be directly modified to use one of the [unsupervised](https://github.com/microsoft/unilm/tree/master/e5#english-pre-trained-models) checkpoints. 

First, import the general requirements for running this notebook:

In [1]:
import torch
import poptorch
import numpy as np
from tqdm.notebook import tqdm
import logging

We need to instatiate some global parameters that we will use to run the model. Here, we define the model name (the checkpoint which will be downloaded from the Hugging Face Hub) and the micro batch size. The micro batch size is set to 1, as we use on-device loops (`device iterations` specific to the IPU) to set a effective batch size of 32. A random seed is also set for reproducibility.

In [2]:
logger = logging.getLogger("")

model_name = 'intfloat/e5-large'

micro_batch_size = 1

random_seed = 0

In [3]:
%env POPLAR_ENGINE_OPTIONS={"autoReport.all":"true", "autoReport.directory":"./e5-l-prof"}

env: POPLAR_ENGINE_OPTIONS={"autoReport.all":"true", "autoReport.directory":"./e5-l-prof"}


Next, define the `transformers` `AutoTokenizer` to instantiate a vocabulary tokenizer for our input text, for the task we define an maximum input sequence length of 512 and pad each sequence to the maximum sequence length.

In [4]:
from transformers import AutoTokenizer, BatchEncoding

tokenizer = AutoTokenizer.from_pretrained(model_name)

def transform_func(example) -> BatchEncoding:
    return tokenizer(
        example['input_texts'],
        max_length=512,
        padding="max_length",
        truncation=True
    )

We define some IPU specific configurations to get the most out of the model. A different configuration is described here for the smaller checkpoint `e5-small` as it can run directly on a single IPU. The larger model is pipelined over 4 IPUs.

In [5]:
from optimum.graphcore import IPUConfig

ipu_configs = {}

ipu_configs['intfloat/e5-small'] = {
    "inference_device_iterations": 32,
    "inference_replication_factor": 1,
    "executable_cache_dir": "./exe_cache",
    "matmul_proportion": 0.5,
    "replicated_tensor_sharding": True,
    "ipus_per_replica": 1,
    "profile_dir": "e5-small-profile"
}

ipu_configs['intfloat/e5-small-unsupervised'] = ipu_configs['intfloat/e5-small']

ipu_configs['intfloat/e5-large'] = {
    "embedding_serialization_factor": 2,
    "enable_half_partials": True,
    "executable_cache_dir": "./exe_cache",
    "inference_device_iterations": 32,
    "inference_replication_factor": 1,
    "ipus_per_replica": 4,
    "layers_per_ipu": [3, 7, 7, 7],
    "matmul_proportion": [0.1, 0.15, 0.15, 0.15],
    "profile_dir": "e5-large-profile",
    "recompute_checkpoint_every_layer": True,
    "replicated_tensor_sharding": True,
}

ipu_configs['intfloat/e5-large-unsupervised'] = ipu_configs['intfloat/e5-large']

ipu_config = IPUConfig.from_dict(ipu_configs[model_name]).eval()
ipu_opts = ipu_config.to_options(for_inference=True)

ipu_opts.randomSeed(random_seed)
torch.manual_seed(random_seed)

`replicated_tensor_sharding` is not used when `replication_factor=1`


<torch._C.Generator at 0x7fbdb5e7def0>

The model config needs to be instantiated for the E5 model. E5 uses a bidirectional encoder, essentially the encoder stage of a BERT model, to generate the trained embeddings. The config will define the architecture of the model, such as the number of encoder layers and size of the hidden dimension within the model.

The larger E5 model is run over 4 IPUs, to do this, we use IPU pipeline parallelism - the stages of the model run on each IPU used are defined by the `PipelinedE5Model` class from `modeling_e5.py` which subclasses the BERT encoder and uses the `parallelize()` function to define the device information and stage for each set of layers in the model.

To run the model on the IPU, we simply need to import the `PipelinedE5Model`, pass the pretrained config to it and define the custom IPU config for the model, as certain parameters in the IPU config are used within the parallelisation function.

Finally, the model is passed into a `poptorch.inferenceModel()` wrapper to create an IPU-ready executor for it.

In [6]:
from transformers import AutoConfig
from modeling_e5 import PipelinedE5Model

e5_config = AutoConfig.from_pretrained(model_name)
e5_model = PipelinedE5Model.from_pretrained(model_name, config=e5_config)
e5_model.ipu_config = ipu_config

e5_model_ipu = poptorch.inferenceModel(e5_model.parallelize(), ipu_opts)

Here, we create a dummy dataset for the model. Using the Hugging Face `datasets` library we can create the dataset from a simple dictionary of `'input_texts'` and use the `map()` method to tokenize each of the inputs of the dataset.

Finally, we can convert the Hugging Face Arrow format dataset to a Pytorch ready dataset with `set_format` which converts the tokenized inputs into tensors.

In [7]:
from datasets import Dataset

examples = {}
examples['input_texts'] = ["Some cats don't learn how to eat solid foods till they are five years old."] * 1024
dataset: Dataset = Dataset.from_dict(examples)

dataset = dataset.map(transform_func, batched=True)

dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "token_type_ids"])

Map:   0%|          | 0/1024 [00:00<?, ? examples/s]

The tokenized dataset is passed to the [`poptorch.Dataloader`](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/batching.html) to create a IPU-ready batched dataloader.

In [8]:
from transformers import default_data_collator as data_collator

poptorch_dataloader = poptorch.DataLoader(
    ipu_opts,
    dataset,
    batch_size=micro_batch_size,
    shuffle=False,
    drop_last=False,
    num_workers=2,
    collate_fn=data_collator
#   mode=poptorch.DataLoaderMode.Async
)

We define a simple `infer()` function which will perform inference iteratively on each batch and return the concatenated list of embeddings for the entire dataset.

In [12]:
def infer(model, dataloader):
    encoded_embeds = []
    with torch.no_grad():
        for batch_dict in tqdm(dataloader, desc='encoding'):
            lat = time.time()
            
            outputs = model(**batch_dict)
            encoded_embeds.append(outputs.cpu().numpy())
            
            lat = time.time() - lat
            print(f"batch latency: {lat}s | per_sample: {lat/len(batch_dict['input_ids'])}")
    
    return np.concatenate(encoded_embeds, axis=0)

To run the model, first we pass an arbitrary call to the model using the first batch to ensure we have compiled the model executable (or loaded the already compiled executable).

In [10]:
import time

c = time.time()
e5_model_ipu(**next(iter(poptorch_dataloader)))
print(f"Compile time: {time.time() - c}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Graph compilation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:46<00:00]


Compile time: 212.9538974761963


Then, simply call the infer function to generate embeddings for the full dataset.

In [13]:
runtime = time.time()
embeddings = infer(e5_model_ipu, poptorch_dataloader)
runtime = time.time() - runtime

encoding:   0%|          | 0/32 [00:00<?, ?it/s]

batch latency: 0.4789144992828369s | per_sample: 0.014966078102588654
batch latency: 0.4763667583465576s | per_sample: 0.014886461198329926
batch latency: 0.480635404586792s | per_sample: 0.01501985639333725
batch latency: 0.4762880802154541s | per_sample: 0.01488400250673294
batch latency: 0.476625919342041s | per_sample: 0.014894559979438782
batch latency: 0.47629594802856445s | per_sample: 0.01488424837589264
batch latency: 0.4761989116668701s | per_sample: 0.014881215989589691
batch latency: 0.4761807918548584s | per_sample: 0.014880649745464325
batch latency: 0.4760255813598633s | per_sample: 0.014875799417495728
batch latency: 0.4758172035217285s | per_sample: 0.014869287610054016
batch latency: 0.4763762950897217s | per_sample: 0.014886759221553802
batch latency: 0.4763514995574951s | per_sample: 0.014885984361171722
batch latency: 0.4764840602874756s | per_sample: 0.014890126883983612
batch latency: 0.4766852855682373s | per_sample: 0.014896415174007416
batch latency: 0.4760186

Finally, lets print out one of the results, and the total IPU runtime.

In [14]:
print(f"IPU runtime: {runtime}\n First embedding: {embeddings[0]}\n Shape: {embeddings[0].shape}")

IPU runtime: 32.74954581260681
 First embedding: [-0.03733141 -0.09052704  0.02737014 ... -0.01642533  0.0154235
 -0.00438281]
 Shape: (1024,)
