# SageMaker Example

## 1. Create your container repository

open aws console and create a repository for your container: https://us-west-2.console.aws.amazon.com/ecr/create-repository?region=us-west-2

for example `236995464743.dkr.ecr.us-west-2.amazonaws.com/sagemaker_endpoint/vllm`

In [None]:
# login
!aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 434444145045.dkr.ecr.us-east-1.amazonaws.com

VLLM_VERSION = "v0.5.5"
REPO_NAME = "sagemaker_endpoint/vllm"
CONTAINER = f"434444145045.dkr.ecr.us-east-1.amazonaws.com/{REPO_NAME}:{VLLM_VERSION}"


## 2. Build the container

demo codes are in `app/`
build and push the docker with following commands:

In [None]:
!docker build --build-arg VLLM_VERSION={VLLM_VERSION} -t {REPO_NAME}:{VLLM_VERSION} .
!docker tag {REPO_NAME}:{VLLM_VERSION} {CONTAINER}
!docker push {CONTAINER}

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


### 3.1 Init SageMaker session

In [None]:
# !pip install boto3 sagemaker transformers
import re
import json
import os,dotenv
import boto3
import sagemaker
from sagemaker import Model


dotenv.load_dotenv()
print(os.environ)

boto_sess = boto3.Session(
    region_name='us-east-1'
)

sess = sagemaker.session.Session(boto_session=boto_sess)
# role = sagemaker.get_execution_role()
role = os.environ.get('role')

environ({'USER': 'ubuntu', 'SSH_CLIENT': '111.198.223.106 58543 22', 'XDG_SESSION_TYPE': 'tty', 'SHLVL': '2', 'HOME': '/home/ubuntu', 'SSL_CERT_FILE': '/usr/lib/ssl/cert.pem', 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1000/bus', 'LOGNAME': 'ubuntu', '_': '/home/ubuntu/workspace/llm_model_hub/miniconda3/envs/py311/bin/python', 'XDG_SESSION_CLASS': 'user', 'VSCODE_CLI_REQUIRE_TOKEN': '08749abe-85df-47fe-a1e0-ea3c34d7468b', 'XDG_SESSION_ID': '2809', 'PATH': '/home/ubuntu/workspace/llm_model_hub/miniconda3/envs/py311/bin:/home/ubuntu/workspace/llm_model_hub/miniconda3/condabin:/home/ubuntu/.vscode-server/cli/servers/Stable-fee1edb8d6d72a0ddff41e5f71a671c23ed924b9/server/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'VSCODE_AGENT_FOLDER': '/home/ubuntu/.vscode-server', 'XDG_RUNTIME_DIR': '/run/user/1000', 'SSL_CERT_DIR': '/usr/lib/ssl/certs', 'LANG': 'C.UTF-8', 'SHELL': '/bin/bash', 'PWD': '/home/ubuntu', 'SSH_CONN

### 3.2 Prepare model file

#### Option 1: deploy vllm by scripts

In [None]:
!echo the entrypoint of the endpoint is "start.sh"
!echo ====================================================
!cat vllm_by_scripts/start.sh
!echo ====================================================

!rm vllm_by_scripts.tar.gz
!tar czvf vllm_by_scripts.tar.gz vllm_by_scripts/


s3_code_prefix = f"sagemaker_endpoint/vllm/"
bucket = sess.default_bucket() 
code_artifact = sess.upload_data("vllm_by_scripts.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

#### Option 2: deploy vllm by model_id

In [None]:
!echo write the model_id to file "model_id"
!echo ====================================================
!cat vllm_by_model_id/model_id
!echo ====================================================
!echo 
!echo write envs to file ".env"
!echo ====================================================
!cat vllm_by_model_id/.env
!echo ====================================================

!rm vllm_by_model_id.tar.gz
!tar czvf vllm_by_model_id.tar.gz vllm_by_model_id/


In [3]:
!mkdir -p dummy_dir
!cd dummy_dir && rm -rf ".ipynb_checkpoints"
!tar czvf model.tar.gz dummy_dir/

dummy_dir/
dummy_dir/env
dummy_dir/s5cmd


In [2]:


s3_code_prefix = f"sagemaker_endpoint/vllm/"
bucket = sess.default_bucket() 
code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-434444145045/sagemaker_endpoint/vllm//model.tar.gz


### 3.3 Deploy model

In [3]:
# CONTAINER='434444145045.dkr.ecr.us-east-1.amazonaws.com/sagemaker_endpoint/vllm:v0.5.5'
# model = Model(
#     name=sagemaker.utils.name_from_base("sagemaker-vllm")+"_model",
#     model_data=code_artifact,
#     image_uri=CONTAINER,
#     role=role,
#     sagemaker_session=sess,
# )

# # 部署模型到endpoint
# endpoint_name = sagemaker.utils.name_from_base("sagemaker-vllm")+"_endpoint"
# print(f"endpoint_name: {endpoint_name}")
# predictor = model.deploy(
#     initial_instance_count=1,
#     instance_type='ml.g5.2xlarge',
#     endpoint_name=endpoint_name,
# )

### test deployment from s3

In [4]:
CONTAINER='434444145045.dkr.ecr.us-east-1.amazonaws.com/sagemaker_endpoint/vllm:v0.5.5'
model_path = "s3://sagemaker-us-east-1-434444145045/Qwen2-1-5B-Instruct/6d0410c634ea438fa5018072e84c10a6/finetuned_model_merged/"
# model_id="deepseek-ai/deepseek-coder-1.3b-instruct"
model_id = 'Qwen/Qwen2-1.5B-Instruct'
env={
    "HF_MODEL_ID": model_id,
    "S3_MODEL_PATH":model_path,
}
model = Model(
    name=sagemaker.utils.name_from_base("sagemaker-vllm")+"-model",
    model_data=code_artifact,
    image_uri=CONTAINER,
    role=role,
    sagemaker_session=sess,
    env=env,
)

# 部署模型到endpoint
endpoint_name = sagemaker.utils.name_from_base("sagemaker-vllm")+"-endpoint"
print(f"endpoint_name: {endpoint_name}")
predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.g5.2xlarge',
    endpoint_name=endpoint_name,
)

endpoint_name: sagemaker-vllm-2024-09-03-14-18-30-001-endpoint


-----------!

## 4. Test

you can invoke your model with SageMaker SDK

### 4.1 Message api non-stream mode

In [3]:
runtime = boto3.client('runtime.sagemaker',region_name='us-east-1')
endpoint_name = "Meta-Llama-3-1-8B-Instruct-2024-09-04-10-26-26-004"
payload = {
    # "model": "deepseek-ai/deepseek-coder-1.3b-instruct",
    "model":"Qwen/Qwen2-1.5B-Instruct",
    "messages": [
    {
        "role": "user",
        "content": "Write a quick sort in python"
    }
    ],
    "max_tokens": 1024,
    "stream": False
}
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["message"]["content"])

Sure! Here is a quick sort function in Python:
```python
def quicksort(arr):
    if len(arr) <= 1:
        return arr

    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]

    return quicksort(left) + middle + quicksort(right)
```

This function takes an array as input and returns the sorted array. The function first checks if the input array has length 0 or 1, in which case it returns the array as it is. Otherwise, it chooses a pivot element from the array, which is the middle element in this case. It then separates the elements in the array into three groups: elements less than the pivot, elements equal to the pivot, and elements greater than the pivot. It then recursively applies the quicksort function to the left and right sub-arrays, and finally combines the three sorted sub-arrays to produce the final sorted array.
```makefile
```


### 4.2 Message api stream mode

In [5]:
payload = {
    # "model": "deepseek-ai/deepseek-coder-1.3b-instruct",
     "model":"Qwen/Qwen2-1.5B-Instruct",
    "messages": [
    {
        "role": "user",
        "content": "Write a quick sort in python"
    }
    ],
    "max_tokens": 1024,
    "stream": True
}

response = runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            print(data["choices"][0]["delta"]["content"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]

Sure, here is a simple implementation of Quick Sort in Python:

```python
def quickSort(arr):


    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[len(arr) // 2]
        left = [x for x in arr if x < pivot]
        middle = [x for x in arr if x == pivot]
        right = [x for x in arr if x > pivot]
        return quickSort(left) + middle + quickSort(right)
```

This function works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted.

### 4.3 Completion api non-stream mode

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True)
messages=[
    { 'role': 'user', 'content': "write a quick sort algorithm in python."}
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

payload = {
    "model": "deepseek-ai/deepseek-coder-1.3b-instruct",
    "prompt": prompt,
    "max_tokens": 1024,
    "stream": False
}

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["text"])

### 4.4 Completion api stream mode

In [None]:
payload = {
    "model": "deepseek-ai/deepseek-coder-1.3b-instruct",
    "prompt": prompt,
    "max_tokens": 1024,
    "stream": True
}

response = runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.end()
            # print(data)
            print(data["choices"][0]["text"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
