> **_NOTE:_**  **This script is supposed to be executed at SageMaker Notebook!**

## prerequesites
- We have setup an **SageMaker Notebook**, the **S3 bucket** to store the bindle, and config their permission

## Step 1
Use git to clone this file to your SageMaker Notebook instance, and open this run.ipynb at your SageMaker Notebook

## Step 2
Prepare the model file for SageMaker. Run below code blocks in sequence.

In [None]:
!mkdir handler
!mkdir handler/code
!mkdir handler/MAR-INF

In [None]:
%%writefile handler/code/requirements.txt
sentence-transformers==5.0.0

In [None]:
%%writefile handler/MAR-INF/MANIFEST.json
{
  "runtime": "python",
  "model": {
    "modelName": "neuralsparse",
    "handler": "neural_sparse_handler.py",
    "modelVersion": "1.0",
    "configFile": "neural_sparse_config.yaml"
  },
  "archiverVersion": "0.9.0"
}

In [None]:
%%writefile handler/neural_sparse_config.yaml
## configs about dynamic batch inference
batchSize: 16
maxBatchDelay: 5
responseTimeout: 300

In [None]:
%%writefile handler/neural_sparse_handler.py

import os
import re
import itertools
import json
import torch

from ts.torch_handler.base_handler import BaseHandler
from sentence_transformers.sparse_encoder import SparseEncoder

model_id = os.environ.get(
    "MODEL_ID", "opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte"
)
max_bs = int(os.environ.get("MAX_BS", 32))
trust_remote_code = model_id.endswith("gte")

class SparseEncodingModelHandler(BaseHandler):
    def __init__(self):
        super().__init__()
        self.initialized = False

    def initialize(self, context):
        self.manifest = context.manifest
        properties = context.system_properties

        # Print initialization parameters
        print(f"Initializing SparseEncodingModelHandler with model_id: {model_id}")

        # load model and tokenizer
        self.device = torch.device(
            "cuda:" + str(properties.get("gpu_id"))
            if torch.cuda.is_available()
            else "cpu"
        )
        print(f"Using device: {self.device}")
        self.model = SparseEncoder(model_id, device=self.device, trust_remote_code=trust_remote_code)
        self.initialized = True

    def preprocess(self, requests):
        inputSentence = []
        batch_idx = []

        for request in requests:
            request_body = request.get("body")
            if isinstance(request_body, bytearray):
                request_body = request_body.decode("utf-8")
                request_body = json.loads((request_body))
            if isinstance(request_body, list):
                inputSentence += request_body
                batch_idx.append(len(request_body))
            else:
                inputSentence.append(request_body)
                batch_idx.append(1)

        return inputSentence, batch_idx

    def handle(self, data, context):
        inputSentence, batch_idx = self.preprocess(data)
        model_output = self.model.encode_document(inputSentence, batch_size=max_bs)
        sparse_embedding = list(map(dict,self.model.decode(model_output)))

        outputs = [sparse_embedding[s:e]
           for s, e in zip([0]+list(itertools.accumulate(batch_idx))[:-1],
                           itertools.accumulate(batch_idx))]
        return outputs

Wrap the handler folder to a tarball. And upload it to your S3 bucket.

In handler/neural_sparse_handler.py, we define the model loading, pre-process, inference and post-process. We use mixed-precision to accelerate the inference.

In handler/neural_sparse_config.yaml, we define some configs for the torch serve (include dynamic micro-batching)

In [None]:
import os

bucket_name = "your_bucket_name"
os.system("tar -czvf neural-sparse-handler.tar.gz -C handler/ .")
os.system(
    f"aws s3 cp neural-sparse-handler.tar.gz s3://{bucket_name}/neural-sparse-handler.tar.gz"
)

## Step 3
Use SageMaker python SDK to deploy the tarball on a real-time inference endpoint

Here we use ml.g5.xlarge. It's a GPU instance with good price-performance.

Please modify the region base according to your settings

In [None]:
# constants that can be customized for models
model_id = "opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte"
max_batch_size = "32"

# constants related to deployment
model_name = "ns-handler"
endpoint_name = "ns-handler"
instance_type = "ml.g5.xlarge"
initial_instance_count = 1

# run this cell
import boto3
import sagemaker
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

role = sagemaker.get_execution_role()
sess = boto3.Session()
region = sess.region_name
smsess = sagemaker.Session(boto_session=sess)

envs = {
    "TS_ASYNC_LOGGING": "true",
    "MODEL_ID": model_id,
    "MAX_BS": max_batch_size,
    "PRUNE_RATIO": prune_ratio,
}

baseimage = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=region,
    py_version="py312",
    image_scope="inference",
    version="2.6",
    instance_type=instance_type,
)

model = Model(
    model_data=f"s3://{bucket_name}/neural-sparse-handler.tar.gz",
    image_uri=baseimage,
    role=role,
    predictor_cls=Predictor,
    name=model_name,
    sagemaker_session=smsess,
    env=envs,
)

endpoint_name = endpoint_name
predictor = model.deploy(
    instance_type=instance_type,
    initial_instance_count=initial_instance_count,
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
    ModelDataDownloadTimeoutInSeconds=3600,
    ContainerStartupHealthCheckTimeoutInSeconds=3600,
    VolumeSizeInGB=16,
)

print(predictor.endpoint_name)

## Step 4

After we create the endpoint, use some sample request to see how it works

In [None]:
# run this cell
import json

body = ["Currently New York is rainy."]
amz = boto3.client("sagemaker-runtime")

response = amz.invoke_endpoint(
    EndpointName=predictor.endpoint_name,
    Body=json.dumps(body),
    ContentType="application/json",
)

res = response["Body"].read()
results = json.loads(res.decode("utf8"))
results

response:
```json
{'response': [{'has': 0.19832642376422882,
   'new': 0.9849710464477539,
   'like': 0.20112557709217072,
   'now': 0.7473171949386597,
   'state': 0.20818853378295898,
   'still': 0.26296505331993103,
   'going': 0.17759032547473907,
   'york': 1.5465646982192993,
   'water': 0.5180262327194214,
   'present': 0.24726435542106628,
   'today': 0.5316043496131897,
   'currently': 0.6706798672676086,
   'current': 0.9104140996932983,
   'dry': 0.2999960780143738,
   'rain': 1.3858059644699097,
   'weather': 1.4669378995895386,
   'climate': 0.392688512802124,
   'wet': 1.070887804031372,
   'happening': 0.3875649571418762,
   'ny': 1.4108916521072388,
   'brooklyn': 0.2983669638633728,
   'yorkshire': 0.15651951730251312,
   'manhattan': 0.969535231590271,
   'flood': 0.2403770089149475,
   'flooding': 0.4161500036716461,
   'rainfall': 0.9889746904373169,
   'damp': 0.38938602805137634,
   'moist': 0.32199856638908386,
   'mist': 0.2026219218969345,
   'precipitation': 0.5729197263717651,
   'drought': 0.41227778792381287,
   'rains': 0.8187123537063599,
   'rainy': 1.4709837436676025,
   'nyc': 1.308121681213379,
   'yorker': 0.6350979804992676,
   'monsoon': 0.6218147873878479,
   'raining': 0.9827804565429688,
   'cloudy': 0.6314691305160522,
   'nyu': 0.7196483612060547}],
 'tokens': [{'inputTokens': 8, 'outputTokens': 0}]}
```

## Step 5
> **_NOTE:_**  **This step is supposed to be executed at an instance have access to OpenSearch cluster!**

Register this SageMaker endpoint at your OpenSearch cluster

Please check the OpenSearch doc for more information. Here we provide one demo request body using access_key and secret_key. Please choose the authentication according to your use case.

### create connector

(Fill the region and predictor.endpoint_name in request body)
```json
POST /_plugins/_ml/connectors/_create
{
  "name": "test",
  "description": "Test connector for Sagemaker model",
  "version": 1,
  "protocol": "aws_sigv4",
  "credential": {
    "access_key": "your access key",
    "secret_key": "your secret key"
  },
  "parameters": {
    "region": "{region}",
    "service_name": "sagemaker",
    "input_docs_processed_step_size": 2,
  },
  "actions": [
    {
      "action_type": "predict",
      "method": "POST",
      "headers": {
        "content-type": "application/json"
      },
      "url": "https://runtime.sagemaker.{region}.amazonaws.com/endpoints/{predictor.endpoint_name}/invocations",
      "request_body": "${parameters.input}"
    }
  ],
  "client_config":{
      "max_retry_times": -1,
      "max_connection": 60,
      "retry_backoff_millis": 10
  }
}
```

### register model
```json
POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "test",
  "function_name": "remote",
  "version": "1.0.0",
  "connector_id": "{connector id}",
  "description": "Test connector for Sagemaker model"
}
```