# Deploy `BAAI/bge-base-en-v1.5` Text Embedding Model (728 Dimension) to Amazon SageMaker

In this notebook, we demonstrate to package and deploy `BAAI/bge-base-en-v1.5` text embedding model with **768** dimensions.

`bge` is short for BAAI general embedding.

*NOTE*: If you need to search the long relevant passages to a short query (s2p retrieval task), you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction need to be added to passages.

Refer to **Model Card:** <https://huggingface.co/BAAI/bge-base-en-v1.5#using-huggingface-transformers> for more details.

**NOTE:** bge model sizes and dimension
- `BAAI/bge-base-en-v1.5`: **~438MB** (Dimensions: 768)
- `BAAI/bge-large-en-v1.5`: **~1.34GB** (Dimensions: 1024)

## References
- <<https://huggingface.co/BAAI/bge-base-en-v1.5>
- <https://github.com/FlagOpen/FlagEmbedding>

## Inference script

Refer to [inference.py](./models/bi-encoders/bge-base-en-v15/code/inference.py) for implementation details.

In [None]:
# !pip install -Uq boto3 sagemaker rich watermark ipywidgets
# %load_ext rich
# %load_ext watermark
# %watermark -p boto3,sagemaker,ipywidgets,transformers
# %watermark -m -t -v

In [None]:
import shutil
import sys
from pathlib import Path

from huggingface_hub import snapshot_download
from rich import print
from sagemaker import get_execution_role
from sagemaker.deserializers import JSONDeserializer
from sagemaker.s3 import S3Uploader, s3_path_join
from sagemaker.serializers import JSONSerializer
from sagemaker.session import Session

sys.path.append("./utils")
from utils import sm_utils, utils

In [None]:
session = Session()
bucket = session.default_bucket()
role = get_execution_role()
region = session.boto_region_name

HF_MODEL_ID = "BAAI/bge-base-en-v1.5"
model_base_name = HF_MODEL_ID.split("/")[-1].replace(".", "")
model_folder = Path(f"./models/bi-encoders/{model_base_name}")
model_archive_path = model_folder.joinpath("model.tar.gz")
s3_baseuri = s3_path_join(f"s3://{bucket}/models", f"txt-embedding-models/{model_base_name}")

print(f"Region: [i]{region}[/i]")
print(f"bucket name: {s3_baseuri}")
print(f"Model dir: {model_folder}")

In [None]:
model_bin = model_folder.joinpath("pytorch_model.bin")

if not model_bin.exists():
    print("Downloading model ...")
    snapshot_download(
        repo_id=HF_MODEL_ID,
        local_dir=str(model_folder),
        local_dir_use_symlinks=False,
        allow_patterns=["1_Pooling", "*.txt", "*.json", "*.bin"],
    )
else:
    print(f"Model already downloaded. {model_folder}")

### Create Model

- Compress model artifacts to `model.tar.gz`
- Upload model to S3

In [None]:
utils.clear_ipynb_dirs(model_folder)  # remove any .ipynb_checkpoints, __pycache__
model_archive_path = model_folder.joinpath("model.tar.gz")
if model_archive_path.exists():
    print(f"Deleting existing model: {model_archive_path}")
    model_archive_path.unlink(missing_ok=True)

print(f"Creating archive with base_dir={model_folder}")
model_archive_path = shutil.make_archive(
    format="gztar",  # tar.gz format
    base_name=model_folder.name,  # will create model.tar.gz
    root_dir=model_folder,  # dir to chdir into before archiving
)

In [None]:
# Verify contents of the model archive.
# !tar tvf $model_archive_path

#### Upload archive to S3

In [None]:
print(f"Uploading model from {model_archive_path} to \n{s3_baseuri} ...")
model_data_url = S3Uploader.upload(
    local_path=str(model_archive_path),
    desired_s3_uri=s3_baseuri,
    sagemaker_session=session,
)
print(f"Model Data URL: {model_data_url}")

Create HuggingFaceModel with model data and custom `inference.py` script

<https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-model>

### Deploy to real-time endpoint

Serveless endpoint can only host models with total image size + model size <= 10GB

The HuggingFace Transformers Model container + BGE Basemodel exceeds 10GB So, we deploy to real-time endpoint here.

Helper functions to create and deploy huggingface model is under [utils (sm_utils)](./utils/sm_utils.py) module.

In [None]:
instance_type = "ml.c5.4xlarge"
suffix = utils.get_suffix()  # returns a uui4-datetime formatted string
model_name = f"{model_base_name}-{suffix}"
env = {"HF_TASK": "feature-extraction"}  # HF_TASK is required for HF models

# function to create and deploy model to real-time endpoint
predictor = sm_utils.create_deploy_huggingface_model(
    model_name=model_name,
    model_s3uri=model_data_url,
    role=role,
    instance_type=instance_type,
    env=env,
)

### Wait for endpoint to come online  (`InService`)

In [None]:
sm_utils.get_endpoint_status(endpoint_name=model_name)

### Inference

Refer to [models/bi-encoders/bge-base-en-v15/code/inference.py](./models/bi-encoders/bge-base-en-v15/code/inference.py) for complete implementation.

**Model Card:** <https://huggingface.co/BAAI/bge-base-en-v1.5#using-huggingface-transformers>

```python
def generate_embeddings(texts, model, tokenizer, normalize=True):
    """
    Generate embeddings for a list of texts using a pre-trained model.

    Args:
        texts (List[str]): List of texts to calculate embeddings for.
        model (AutoModel): Pre-trained model.
        tokenizer (AutoTokenizer): Tokenizer corresponding to the pre-trained model.
        normalize (bool, optional): Whether to normalize the embeddings. Defaults to True.

    Returns:
        Tensor: Tensor containing the embeddings for the texts.
    """

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Tokenize the texts
    encoded_input = tokenizer(
        texts, max_length=512, padding=True, truncation=True, return_tensors="pt"
    )

    encoded_input = encoded_input.to(device)

    # Get the embeddings for the texts
    with torch.no_grad():
        model_output = model(**encoded_input)

        # Perform pooling. In this case, cls pooling.
        sentence_embeddings = model_output[0][:, 0]


    # Normalize embeddings if required
    if normalize:
        sentence_embeddings = F.normalize(text_embeddings, p=2, dim=1)

    # convert to numpy array
    sentence_embeddings = sentence_embeddings.cpu().numpy()
    ret_value = sentence_embeddings.tolist()

    return ret_value

...

```



### Invoke Endpoint

Before we invoke we attach `JSONSerializer` and `JSONDeserializer` to the predictor object.

In [None]:
predictor.serializer = JSONSerializer()
predictor.deserializer = JSONDeserializer()

In [None]:
sentences = ["That is a happy person", "That is a very happy person"]

embeddings = predictor.predict(sentences)
print(f"Embedding dimensions: {len(embeddings)}")  # returns 2 embeddings
print(embeddings[0])

## Cleanup

Delete resources after use

In [None]:
print(f"Deleting model and endpoint: {model_name}")
predictor.delete_model()
predictor.delete_endpoint()