# General-purpose text embeddings on the IPU

This notebook describes how to use supported embeddings models to generate SOTA text embeddings on the IPU. You can use the:
* [E5 model](https://arxiv.org/pdf/2212.03533.pdf) (Emb**E**ddings from bidir**E**ctional **E**ncoder r**E**presentations) to generate text embeddings on the IPU.
* [Sentence Transformers MPNet Base V2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) An embeddings model based on the MPNet base model.

Here, we demonstrate how to use the fine-tuned E5-large model for inference, and then show how to use the embeddings for a semantic search application example.

First, install the requirements for running this notebook:

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
! pip install sentence-transformers
! pip install --find-links https://download.pytorch.org/whl/cpu/torch_stable.html torch==1.13.1+cpu

Next, import the required modules for the notebook:

In [1]:
import os
import torch
import poptorch
import numpy as np
from tqdm.notebook import tqdm
import logging

We need to instantiate some global parameters that will be used to run the model. Here, we define the model name (the checkpoint which will be downloaded from the Hugging Face Hub) and the effective batch size. 

The **micro batch size** (number of batches to process in parallel) is set to a smaller value of 2 due to its greater effect on device memory. 

We use on-IPU loops (**device iterations**) which iterate over a number of batches sequentially (where the iteration takes place on-device in one dataloader call), to extend the batch size for more throughput benefit (this is more efficient than loading smaller batches on the host a large number of times). 

Data parallelism is controlled by the **replication factor**, i.e. how many devices the batch sizes are replicated over. This value is set to `None` by default as it will be automatically determined by the `pod_type` of the machine being used. By default, the model itself requires 1 IPU to run, and if running on a IPU POD4 (4 IPU) machine, the replication factor is set to 4. Similarly, if running on an IPU POD16, it is set to 16. This can be overidden with a different value if needed, i.e., if `replication_factor=N` the model will be replicated over `N` IPUs as long as `N * n_ipu (number of IPUs a single instance of the model uses) <= total available IPUs`.

The total effective batch size for inference is calculated by:
```
effective_batch_size = replication_factor * device_iterations * micro_batch_size
```

The model itself, through model pipelining, can also be run over **2** or **4** IPUs (by setting `model_ipu` to 2 or 4), in which case the replication factor will be adjusted accordingly. The reason we might want to spread the model over more IPUs is to reduce the memory consumption of the model over a single machine (e.g., with 4 IPUs, we compute far fewer layers per IPU, while with 1 IPU, all model layers are on a single IPU) allowing for higher batch sizes to be used. This is particularly beneficial on an IPU POD16 machine, as the 4-IPU pipelined version of the model can be run at a higher effective batch size (with higher micro batch size) and achieve even higher overall batched throughput.

The maximum sequence length for the tokenizer is set to 512, as this is the default maximum positional embeddings value based on the bidirectional encoder configuration for the pre-trained checkpoint.

The checkpoint (`model_name`) can be directly modified to use one of the [unsupervised](https://github.com/microsoft/unilm/tree/master/e5#english-pre-trained-models) checkpoints. 

In [2]:
logger = logging.getLogger("")

model_name = 'sentence-transformers/all-mpnet-base-v2' #'intfloat/e5-large' #'intfloat/e5-small'
n_ipu = os.getenv("NUM_AVAILABLE_IPU", 4)

model_ipu = 1
micro_batch_size = 2
device_iterations = 256
replication_factor = None

max_seq_len = 512

random_seed = 42

Next, define the `transformers` `AutoTokenizer` to instantiate a vocabulary tokenizer for our input text, for the task we define an maximum input sequence length of 512 and pad each sequence to the maximum sequence length.

In [3]:
from transformers import AutoTokenizer, BatchEncoding

tokenizer = AutoTokenizer.from_pretrained(model_name)

def transform_func(example) -> BatchEncoding:
    return tokenizer(
        example['text'],
        max_length=max_seq_len,
        padding="max_length",
        truncation=True
    )

The model config needs to be instantiated for the E5 model. E5 uses a bidirectional encoder, essentially the encoder stage of a BERT model, to generate the trained embeddings. The config will define the architecture of the model, such as the number of encoder layers and size of the hidden dimension within the model.

We define some IPU specific configurations to get the most out of the model, the `get_ipu_config` function will set up the IPU config according to the model config, taking into consideration the defined number of IPUs for model parallelism, the number of IPUs available and batching configurations.

In [4]:
from config import get_ipu_config
from transformers import AutoConfig

model_config = AutoConfig.from_pretrained(model_name)

ipu_config = get_ipu_config(
    model_config, n_ipu, model_ipu, device_iterations, replication_factor, random_seed)



To run the model on the IPU, we use a simple wrapper class for embeddings models called `IPUEmbeddingsModel`. This loads the embeddings model and performs pooling and normalisation on the output. Lets write this method out here to see what it does:

In [5]:
import torch
import logging
from typing import Optional, List

from transformers import AutoModel
from optimum.graphcore.modeling_utils import to_pipelined

logger = logging.getLogger("e5")

class IPUEmbeddingsModel(torch.nn.Module):
    def __init__(self, model_name, model_config, ipu_config, fp16=True):
        super().__init__()
        self.model = AutoModel.from_pretrained(model_name, config=model_config)
        print(self.model)
        self.model = to_pipelined(self.model, ipu_config)
        self.model = self.model.parallelize()
        if fp16: self.model = self.model.half()
    
    def pool(
        self, 
        last_hidden_states: torch.Tensor,
        attention_mask: torch.Tensor,
        pool_type: str
        ) -> torch.Tensor:
             
        last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    
        if pool_type == "avg":
            emb = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
        elif pool_type == "cls":
            emb = last_hidden[:, 0]
        else:
            raise ValueError(f"pool_type {pool_type} not supported")

        return emb
    
    def forward(self, pool_type: str ='avg', **kwargs) -> torch.Tensor:

        outputs = self.model(**kwargs)
        embeds = self.pool(outputs.last_hidden_state, kwargs["attention_mask"], pool_type=pool_type)
        embeds = torch.nn.functional.normalize(embeds, p=2, dim=-1)

        return embeds

The `IPUEmbeddingsModel` class instantiates the Transformers model from the checkpoint (`model_name`) and applies IPU parallelisation to it over the defined number of IPUs, applying some optimisations at the same time. Then, it can perform a forward pass using any supported model along with the pooling and normalisation required for embeddings models. 

To run the model on the IPU, the IPU config needs to be first converted to an `IPUOptions` class and passed alongside the model to the `poptorch.inferenceModel` wrapper:

In [6]:
import modeling_bert, modeling_mpnet

model = IPUEmbeddingsModel(
    model_name = model_name,
    model_config = model_config,
    ipu_config = ipu_config
)

ipu_options = ipu_config.to_options(for_inference=True)
model = poptorch.inferenceModel(model, ipu_options)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0): MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_features

Lets load a dataset to try out the model. Using the Hugging Face `datasets` library we can load a pre-existing dataset from the Hugging Face Hub. In this case, lets use the `rotten_tomatoes` film review dataset. Later in the notebook, we will use this dataset to perform create a basic semantic search functionality.

The dataset first needs to be tokenized, we can use the `map()` method to tokenize each of the inputs of the dataset.

Finally, we can convert the Hugging Face Arrow format dataset to a Pytorch ready dataset with `set_format` which converts the tokenized inputs into tensors.

In [7]:
from datasets import Dataset, load_dataset

dataset = load_dataset("rotten_tomatoes")

tokenized_dataset = dataset.map(transform_func, batched=True)

tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

Found cached dataset rotten_tomatoes (/home/arsalanu/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /home/arsalanu/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/cache-264068b7cd1b9008.arrow
Loading cached processed dataset at /home/arsalanu/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/cache-e00562798c9ffcfe.arrow


Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

The tokenized dataset is passed to the [`poptorch.Dataloader`](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/batching.html) to create a IPU-ready batched dataloader.

In [8]:
from transformers import default_data_collator as data_collator

poptorch_dataloader = poptorch.DataLoader(
    ipu_options,
    tokenized_dataset['train'],
    batch_size=micro_batch_size,
    shuffle=False,
    drop_last=True,
    num_workers=2,
    collate_fn=data_collator
)

We define a simple `infer()` function which will perform inference iteratively on each batch and return the concatenated list of embeddings for the entire dataset.

In [9]:
def infer(model, dataloader):
    encoded_embeds = []
    with torch.no_grad():
        for batch_dict in tqdm(dataloader, desc='encoding'):
            lat = time.time()
            outputs = model(**batch_dict)
            lat = time.time() - lat
            
            encoded_embeds.append(outputs)
            print(f"batch len: {len(batch_dict['input_ids'])} | batch latency: {lat}s | per_sample: {lat/len(batch_dict['input_ids'])}s | throughput: {len(batch_dict['input_ids'])/lat} samples/s")
    
    return torch.cat(encoded_embeds, axis=0)

To run the model, first we pass an arbitrary call to the model using the first batch to ensure we have compiled the model executable (or loaded the already compiled executable).

In [10]:
import time

c = time.time()
model(**next(iter(poptorch_dataloader)))
print(f"Compile time: {time.time() - c}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Graph compilation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:21<00:00]


Compile time: 156.66175532341003


Then, simply call the `infer` function to generate embeddings for the full dataset.

In [11]:
runtime = time.time()
embeddings = infer(model, poptorch_dataloader)
runtime = time.time() - runtime

model.detachFromDevice()

encoding:   0%|          | 0/4 [00:00<?, ?it/s]

batch len: 2048 | batch latency: 0.9031727313995361s | per_sample: 0.00044100231025367975s | throughput: 2267.5618171359815 samples/s
batch len: 2048 | batch latency: 0.8890142440795898s | per_sample: 0.00043408898636698723s | throughput: 2303.675125161044 samples/s
batch len: 2048 | batch latency: 0.888725996017456s | per_sample: 0.00043394824024289846s | throughput: 2304.4222957103348 samples/s
batch len: 2048 | batch latency: 0.8889319896697998s | per_sample: 0.00043404882308095694s | throughput: 2303.8882881926033 samples/s


Lets print out one of the results, and the total IPU runtime.

In [12]:
print(f"IPU runtime: {runtime}\n First embedding: {embeddings[0]}\n Shape: {embeddings[0].shape}")

IPU runtime: 3.665679454803467
 First embedding: tensor([ 1.4702e-02, -7.2594e-03,  1.1894e-02, -1.5137e-02,  5.8746e-03,
        -4.0703e-03, -2.7069e-02,  8.7814e-03,  6.8398e-03,  1.9318e-02,
        -1.7578e-02,  1.9852e-02, -6.7322e-02,  3.0914e-02,  8.8379e-02,
        -6.8542e-02,  2.5558e-02, -1.6830e-02,  2.5696e-02, -2.9953e-02,
         1.6052e-02, -2.5848e-02, -8.2245e-03, -1.3725e-02, -3.2379e-02,
        -4.5380e-02,  4.3091e-02,  5.2246e-02,  4.3182e-02, -2.6886e-02,
        -7.5562e-02, -7.0877e-03,  1.4450e-02,  5.1231e-03,  1.8477e-06,
        -4.3365e-02,  9.6817e-03, -3.0594e-02, -5.8365e-03, -4.5471e-02,
        -7.0129e-02, -2.9659e-04, -4.1534e-02, -2.8900e-02,  3.7781e-02,
        -6.2042e-02, -1.4290e-02,  2.6337e-02, -5.9814e-02, -5.4893e-03,
         1.9455e-02, -5.2948e-02, -1.1711e-02, -2.5604e-02,  3.4790e-02,
        -1.0109e-02, -2.0233e-02,  5.6458e-02, -7.1096e-04,  1.4687e-02,
        -3.5126e-02,  2.8824e-02, -2.4933e-02,  1.7426e-02,  9.2468e-02,
  

The embedding vector in its current state doesn't look particularly meaningful. The embeddings for a single sequence represent low-dimensional numerical representations of the word-level and sentence-level context for each token. These pre-trained embeddings can be used in applications like embedding retrieval for recommender systems, or semantic search for query-matching using cosine-similarity. Both of these use cases take advantage of the generated embeddings space, by performing a relative comparison of the user input sequence embeddings using some proximity metric.

We'll use the open source `sentence_transformers` library which provides utilities for embeddings tasks to perform semantic search on a user query to retrieve the most similar sequences from the dataset to the query. This is a helpful utility for making, for example, more responsive FAQs.

## Semantic search with E5 generated embeddings

Using the `rotten_tomatoes` dataset, lets create a simple similarity search engine using `sentence_transformers` semantic search function, which uses cosine similarity to retrieve close-proximity sentences from a given set of embeddings to a given query. We have already generated embeddings for the dataset, so the next step is to do the same with a given query and perform the search.

First, to process the query, we need to tokenize it and convert it to a single-batch input for the model. This has been wrapped into a simple function which tokenizes and prepares a dictionary of model inputs (`input_ids`, `attention_mask`, etc.,) to which we just need to pass a string.

In [13]:
def prepare_query(query: str):
    t_query = tokenizer(
            query,
            max_length=max_seq_len,
            padding="max_length",
            truncation=True
        )

    return {k: torch.as_tensor([t_query[k]]) for k in t_query}

Next, to perform inference with a single input (i.e., effective batch size of 1) we re-instantiate the model by setting all device batching, replication and micro batch-size to 1 and re-compile the model. The change in batch size necessitates a recompilation, since the input shape to the model has been changed. We will follow the steps to initiate the model outlined earlier in the notebook, with the only change being setting the `get_ipu_config` function to have all batching turned off.

In [14]:
ipu_config = get_ipu_config(model_config, n_ipu, model_ipu=1, device_iterations=1, replication_factor=1, random_seed=random_seed)

inf_model = IPUEmbeddingsModel(
    model_name = model_name,
    model_config = model_config,
    ipu_config = ipu_config
)

inf_model = poptorch.inferenceModel(inf_model, ipu_config.to_options(for_inference=True))

inf_model(**prepare_query("Running once to compile"))

Graph compilation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:51<00:00]


tensor([[-2.6932e-02,  3.1342e-02, -2.2217e-02, -5.3680e-02,  7.3013e-03,
          5.1941e-02,  3.7811e-02, -2.4506e-02, -1.6937e-02,  2.7962e-03,
          7.0374e-02,  7.4463e-02,  3.8574e-02, -4.4067e-02,  2.1652e-02,
          7.5562e-02, -6.2752e-03,  2.0935e-02, -5.2185e-02, -5.7487e-03,
         -4.4678e-02, -2.0340e-02,  5.6213e-02, -3.9101e-03, -3.6499e-02,
         -3.6041e-02,  1.8204e-02, -2.0126e-02,  7.0381e-03,  2.8015e-02,
         -3.0708e-03,  1.1101e-02,  6.1302e-03, -3.3051e-02,  1.3113e-06,
          7.4310e-03,  7.8247e-02,  3.4912e-02, -4.5471e-02, -2.0790e-03,
         -7.2250e-03,  6.4575e-02,  2.5543e-02,  6.8054e-02,  1.2589e-02,
         -5.0306e-04, -4.1626e-02, -6.0699e-02, -1.7944e-02,  4.6277e-04,
          1.2772e-02, -6.1981e-02, -2.6569e-03,  7.5035e-03,  1.2077e-02,
         -4.1870e-02,  1.0063e-02, -3.3913e-03, -1.6098e-02,  4.7119e-02,
          4.4373e-02,  6.0913e-02,  9.3307e-03, -3.3894e-03, -2.1500e-02,
         -2.3468e-02,  3.4912e-02, -1.

Finally, we can use the model to embed a single query, and perform a semantic search across the full dataset embeddings to retrieve highly relevant reviews to the query.

In [15]:
from sentence_transformers.util import semantic_search

query = "Strongly disliked this action movie"

query_embeddings = inf_model(**prepare_query(query))
hits = semantic_search(query_embeddings.float(), embeddings.float(), top_k=10)

print(f"\n SEARCH QUERY: {query}")
for n, res in enumerate(hits[0]):
    print(f"\n Result (rank {n+1}) | Score: {res['score']} | Text: {dataset['train']['text'][res['corpus_id']]} ")


 SEARCH QUERY: Strongly disliked this action movie

 Result (rank 1) | Score: 0.761756420135498 | Text: it's a bad action movie because there's no rooting interest and the spectacle is grotesque and boring . 

 Result (rank 2) | Score: 0.6813229322433472 | Text: i hate this movie 

 Result (rank 3) | Score: 0.6623662710189819 | Text: features nonsensical and laughable plotting , wooden performances , ineptly directed action sequences and some of the worst dialogue in recent memory . 

 Result (rank 4) | Score: 0.6618168950080872 | Text: 'this movie sucks . ' 

 Result (rank 5) | Score: 0.6547254323959351 | Text: it is a comedy that's not very funny and an action movie that is not very thrilling ( and an uneasy alliance , at that ) . 

 Result (rank 6) | Score: 0.6510480046272278 | Text: the worst film of the year . 

 Result (rank 7) | Score: 0.6449141502380371 | Text: the acting is amateurish , the cinematography is atrocious , the direction is clumsy , the writing is insipid and the

From the results, the pretrained embeddings appear to perform quite well on an unseen dataset without any fine-tuning.
