# General-purpose text embeddings with E5-Large

This notebook describes how to use the [E5 model](https://arxiv.org/pdf/2212.03533.pdf) (Emb**E**ddings from
bidir**E**ctional **E**ncoder r**E**presentations) to generate text embeddings on the IPU. This [state-of-the-art](https://syncedreview.com/2022/12/13/microsofts-e5-text-embedding-model-tops-the-mteb-benchmark-with-40x-fewer-parameters/)  text embeddings model can be used for general purpose text embeddings for any tasks requiring a single-vector representation of texts, including retrieval, clustering and classification. The model provides general-purpose checkpoints trained without labels (unsupervised) and fine-tuned checkpoints.

Here, we demonstrate how to use the fine-tuned E5 large checkpoint for inference over 4 IPUs. The checkpoint (`model_name`) can be directly modified to use one of the [unsupervised](https://github.com/microsoft/unilm/tree/master/e5#english-pre-trained-models) checkpoints. 

First, import the general requirements for running this notebook:

In [137]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [211]:
import os
import torch
import poptorch
import numpy as np
from tqdm.notebook import tqdm
import logging

We need to instatiate some global parameters that we will use to run the model. Here, we define the model name (the checkpoint which will be downloaded from the Hugging Face Hub) and the micro batch size. The micro batch size is set to 1, as we use on-device loops (`device iterations` specific to the IPU) to set a effective batch size of 32. A random seed is also set for reproducibility.

In [212]:
logger = logging.getLogger("")

model_name = 'intfloat/e5-large'
pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod4")

n_ipu = 1
micro_batch_size = 2
device_iterations = 256
replication_factor = None

max_seq_len = 512

random_seed = 42

Next, define the `transformers` `AutoTokenizer` to instantiate a vocabulary tokenizer for our input text, for the task we define an maximum input sequence length of 512 and pad each sequence to the maximum sequence length.

In [219]:
from transformers import AutoTokenizer, BatchEncoding

tokenizer = AutoTokenizer.from_pretrained(model_name)

def transform_func(example) -> BatchEncoding:
    return tokenizer(
        example['text'],
        max_length=max_seq_len,
        padding="max_length",
        truncation=True
    )

We define some IPU specific configurations to get the most out of the model. A different configuration is described here for the smaller checkpoint `e5-small` as it can run directly on a single IPU. The larger model is pipelined over 4 IPUs.

In [214]:
from config import get_ipu_config

ipu_config = get_ipu_config(pod_type, n_ipu, device_iterations, replication_factor, random_seed)

The model config needs to be instantiated for the E5 model. E5 uses a bidirectional encoder, essentially the encoder stage of a BERT model, to generate the trained embeddings. The config will define the architecture of the model, such as the number of encoder layers and size of the hidden dimension within the model.

The larger E5 model is run over 4 IPUs, to do this, we use IPU pipeline parallelism - the stages of the model run on each IPU used are defined by the `PipelinedE5Model` class from `modeling_e5.py` which subclasses the BERT encoder and uses the `parallelize()` function to define the device information and stage for each set of layers in the model.

To run the model on the IPU, we simply need to import the `PipelinedE5Model`, pass the pretrained config to it and define the custom IPU config for the model, as certain parameters in the IPU config are used within the parallelisation function.

Finally, the model is passed into a `poptorch.inferenceModel()` wrapper to create an IPU-ready executor for it.

In [215]:
from transformers import AutoConfig, AutoModel
from modeling_e5 import PipelinedE5Model

from optimum.graphcore.modeling_utils import to_pipelined

e5_config = AutoConfig.from_pretrained(model_name)
e5_model = PipelinedE5Model.from_pretrained(model_name, config=e5_config).eval().half()
e5_model.ipu_config = ipu_config

ipu_options = ipu_config.to_options(for_inference=True)
e5_model_ipu = poptorch.inferenceModel(e5_model.parallelize(), ipu_options)

Here, we create a dummy dataset for the model. Using the Hugging Face `datasets` library we can create the dataset from a simple dictionary of `'input_texts'` and use the `map()` method to tokenize each of the inputs of the dataset.

Finally, we can convert the Hugging Face Arrow format dataset to a Pytorch ready dataset with `set_format` which converts the tokenized inputs into tensors.

In [220]:
from datasets import Dataset, load_dataset

# examples = {}
# examples['input_texts'] = ["Some cats don't learn how to eat solid foods till they are five years old."] * 8192
# dataset: Dataset = Dataset.from_dict(examples)

dataset = load_dataset("go_emotions")


dataset = dataset.map(transform_func, batched=True)

dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "token_type_ids"])

No config specified, defaulting to: go_emotions/simplified
Found cached dataset go_emotions (/home/arsalanu/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d)


  0%|          | 0/3 [00:00<?, ?it/s]

Map:   0%|          | 0/43410 [00:00<?, ? examples/s]

Map:   0%|          | 0/5426 [00:00<?, ? examples/s]

Map:   0%|          | 0/5427 [00:00<?, ? examples/s]

The tokenized dataset is passed to the [`poptorch.Dataloader`](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/batching.html) to create a IPU-ready batched dataloader.

In [221]:
from transformers import default_data_collator as data_collator

poptorch_dataloader = poptorch.DataLoader(
    ipu_options,
    dataset['train'],
    batch_size=micro_batch_size,
    shuffle=False,
    drop_last=False,
    num_workers=2,
    collate_fn=data_collator
#   mode=poptorch.DataLoaderMode.Async
)

We define a simple `infer()` function which will perform inference iteratively on each batch and return the concatenated list of embeddings for the entire dataset.

In [222]:
def infer(model, dataloader):
    encoded_embeds = []
    with torch.no_grad():
        for batch_dict in tqdm(dataloader, desc='encoding'):
            lat = time.time()
            outputs = model(**batch_dict)
            lat = time.time() - lat
            
            encoded_embeds.append(outputs.cpu().numpy())
            print(f"batch len: {len(batch_dict['input_ids'])} | batch latency: {lat}s | per_sample: {lat/len(batch_dict['input_ids'])}s | throughput: {len(batch_dict['input_ids'])/lat} samples/s")
    
    return np.concatenate(encoded_embeds, axis=0)

To run the model, first we pass an arbitrary call to the model using the first batch to ensure we have compiled the model executable (or loaded the already compiled executable).

In [223]:
import time

c = time.time()
e5_model_ipu(**next(iter(poptorch_dataloader)))
print(f"Compile time: {time.time() - c}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Graph compilation: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:04<00:00]


Compile time: 33.835981369018555


Then, simply call the infer function to generate embeddings for the full dataset.

In [210]:
runtime = time.time()
embeddings = infer(e5_model_ipu, poptorch_dataloader)
runtime = time.time() - runtime

encoding:   0%|          | 0/4 [00:00<?, ?it/s]

batch len: 2048 | batch latency: 2.3971943855285645s | per_sample: 0.0011705050710588694s | throughput: 854.3320526543075 samples/s
batch len: 2048 | batch latency: 2.384589672088623s | per_sample: 0.001164350425824523s | throughput: 858.847970353822 samples/s
batch len: 2048 | batch latency: 2.386845350265503s | per_sample: 0.0011654518311843276s | throughput: 858.0363196853909 samples/s
batch len: 2048 | batch latency: 2.377833366394043s | per_sample: 0.0011610514484345913s | throughput: 861.2882756817264 samples/s


Finally, lets print out one of the results, and the total IPU runtime.

In [224]:
print(f"IPU runtime: {runtime}\n First embedding: {embeddings[0]}\n Shape: {embeddings[0].shape}")

IPU runtime: 9.685287714004517
 First embedding: [-0.03784  -0.09094   0.02756  ... -0.01642   0.01617  -0.004623]
 Shape: (1024,)
