# Deploy [Microsoft E5 Large V2](https://huggingface.co/intfloat/e5-large-v2) on SageMaker

This notebook provides a detailed walkthrough for deploying Microsoft E5 Large V2 model from HuggingFace on Amazon SageMaker. This model serves as the embedding model within a custom RAG architecture. For additional insights and in-depth information, please refer to the accompanying blog post.

Steps:
1. **Prepare the Deployment Package**
    * Organize the necessary files including requirements.txt, serving.properties, and model.py within a designated directory.
    * Package the directory contents into a tar.gz file.
    * Upload the Deployment Package to Amazon S3

2. **Upload the packaged tar.gz file to an Amazon S3 bucket**
   - Upload the packaged `tar.gz` file to an Amazon S3 bucket. This serves as the storage location for the deployment package.

3. **Deploy the Model as a SageMaker Endpoint**
    - Utilize SageMaker's capabilities to deploy the packaged model as an endpoint for later API inference.

*Note: This notebook assumes familiarity with Amazon SageMaker and basic concepts of deploying machine learning models. Additional documentation and resources are available for further reference and exploration.*

### 0. Initialization

In [2]:
%%capture
!pip install -U transformers

In [17]:
%%capture
!pip install -U torch

In [1]:
from torch import Tensor
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

In [None]:
import sagemaker
import boto3

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

In [3]:
%%capture

tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-large-v2')
model = AutoModel.from_pretrained('intfloat/e5-large-v2')

### 1. Preparing deployment package

In [11]:
save_dir = 'pt_save_pretrained'
%mkdir {save_dir}
%mkdir {save_dir}/code

tokenizer.save_pretrained(save_dir)
model.save_pretrained(save_dir)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [12]:
%%writefile {save_dir}/code/inference.py

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
from torch import Tensor

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


def model_fn(model_dir):
  tokenizer = AutoTokenizer.from_pretrained(model_dir)
  model = AutoModel.from_pretrained(model_dir)
  return model, tokenizer

def predict_fn(data, model_and_tokenizer):
    model, tokenizer = model_and_tokenizer

    input_texts = data.pop("inputs")
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
    
    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)

    return {"vectors": embeddings.tolist()}

Writing pt_save_pretrained/code/inference.py


### 2. Upload model artifacts gz file to S3

In [13]:
# create tar
import tarfile
model_s3_name = 'e5-large-v2-embedding-model.tar.gz'

with tarfile.open(model_s3_name, 'w:gz') as f:
    f.add('pt_save_pretrained/', arcname='.')
    
f.close

<bound method TarFile.close of <tarfile.TarFile object at 0x7f47ac870190>>

In [14]:
# save to s3
s3_bucket = 'ai-models'
bucket_prefix = 'embeddings'

model_filename = model_s3_name
model_s3_key = f'{bucket_prefix}/' + model_filename
model_url = f's3://{s3_bucket}/{model_s3_key}'

In [None]:
!aws s3 cp e5-large-v2-embedding-model.tar.gz s3://{s3_bucket}/{model_s3_key}

### 3. Deploy as SageMaker Inference Endpoint

In [None]:
import time

unix_time = int(time.time())
endpoint_name = f"{'e5-large-v2'}-{unix_time}"
print(f'Endpoint name: {endpoint_name}')

In [19]:
from sagemaker.huggingface.model import HuggingFaceModel


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=model_url,       # path to your model and script
   role=role,                    # iam role with permissions to create an Endpoint
   transformers_version="4.26",  # transformers version used
   pytorch_version="1.13",        # pytorch version used
   py_version='py39',            # python version used
)

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type="ml.g5.xlarge"
    )

----------!

### 4. Clean-up resources

In [22]:
#predictor.delete_model()
#predictor.delete_endpoint()