# HuggingFace Pretrained BGE M3 Inference on Inf2

## Introduction

This notebook demonstrates how to compile and run a HuggingFace 🤗 BGE-M3(xlm-roberta) model for accelerated inference on Neuron. This notebook will use the ['BAAI/bge-m3'](https://huggingface.co/BAAI/bge-m3) model, which is primarily used for word embeddings. 

This Jupyter notebook should be run on a Inf2 instance (`inf2.8xlarge` or larger).

Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [PyTorch Installation Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). You can select the kernel from the 'Kernel -> Change Kernel' option on the top of this Jupyter notebook page.

## Install Dependencies
This tutorial requires the following pip packages:

- `torch-neuronx`
- `neuronx-cc`
- `transformers`

Most of these packages will be installed when configuring your environment using the Trn1 setup guide. The additional dependencies must be installed here:

In [1]:
%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect
!pip install -U transformers

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting transformers
  Downloading transformers-4.41.0-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.23.0 (from transformers)
  Downloading huggingface_hub-0.23.0-py3-none-any.whl.metadata (12 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.5.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_6

## Compile the model into an AWS Neuron optimized TorchScript

In the following section, we load the model and tokenizer, get s sample input, run inference on CPU, compile the model for Neuron using `torch_neuronx.trace()` and save the optimized model as `TorchScript`.

`torch_neuronx.trace()` expects a tensor or tuple of tensor inputs to use for tracing, so we unpack the tokenzier output. Additionally, the input shape that's used duing compilation must match the input shape that's used during inference. To handle this, we pad the inputs to the maximum size that we will see during inference.

In [5]:
import torch
import torch_neuronx
from transformers import AutoTokenizer, AutoModel


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]



# Create the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-m3')
model = AutoModel.from_pretrained('BAAI/bge-m3')

# Get an example input
sentences = ['This is an example sentence', 'Each sentence is converted']

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output_cpu = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings_cpu = cls_pooling(model_output_cpu, encoded_input['attention_mask'])

print("Sentence embeddings cpu:")
print(sentence_embeddings_cpu)

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
example = (
    encoded_input['input_ids'],
    encoded_input['attention_mask'],
)

# Compile the model
model_neuron = torch_neuronx.trace(model,example)

# Save the TorchScript for inference deployment
filename = 'model.pt'
torch.jit.save(model_neuron, filename)

Sentence embeddings cpu:
tensor([[-0.3375,  1.2365, -1.0101,  ..., -0.6634, -0.0480, -0.3944],
        [ 0.3626, -0.0248, -0.4376,  ..., -0.0577, -0.8273,  1.4676]])
2024-05-21T15:57:36Z Running DoNothing
2024-05-21T15:57:36Z DoNothing finished after 0.000 seconds
2024-05-21T15:57:36Z Running AliasDependencyInduction
2024-05-21T15:57:36Z AliasDependencyInduction finished after 0.005 seconds
2024-05-21T15:57:36Z Running CanonicalizeIR
2024-05-21T15:57:36Z CanonicalizeIR finished after 0.024 seconds
2024-05-21T15:57:36Z Running LegalizeCCOpLayout
2024-05-21T15:57:37Z LegalizeCCOpLayout finished after 0.024 seconds
2024-05-21T15:57:37Z Running ResolveComplicatePredicates
2024-05-21T15:57:37Z ResolveComplicatePredicates finished after 0.022 seconds
2024-05-21T15:57:37Z Running AffinePredicateResolution
2024-05-21T15:57:37Z AffinePredicateResolution finished after 0.024 seconds
2024-05-21T15:57:37Z Running EliminateDivs
2024-05-21T15:57:37Z EliminateDivs finished after 0.023 seconds
2024-05

## Run inference and compare results

In this section we load the compiled model, run inference on Neuron, and compare the CPU and Neuron outputs.

In [1]:
# Load the TorchScript compiled model
model_neuron = torch.jit.load(filename)

# Run inference using the Neuron model
output_neuron = model_neuron(*example)
print(f"CPU last_hidden_state:    {model_output_cpu['last_hidden_state'][0][0][:10]}")
print(f"Neuron last_hidden_state: {output_neuron['last_hidden_state'][0][0][:10]}")


NameError: name 'torch' is not defined