# DeepSeek-R1 671B dynamic quants on SageMaker endpoint

The original version of DeepSeek R1 is an FP8 model with 671B parameters, which requires larger GPU instances (such as p5en type) for deployment. 

Due to resource limitations, in order to deploy on g5, g6 and other instance types, dynamic quantization techniques can be used to reduce resource consumption. Following the technical blog from unsloth: [https://unsloth.ai/blog/deepseekr1-dynamic](https://unsloth.ai/blog/deepseekr1-dynamic), we have implemented the deployment of the DeepSeek-R1 671B dynamic quantization model on SageMaker endpoint.

## 1. Define some variables

This model is inferenced by llama.cpp. To deploy the model on SageMaker endpoint, you need to deploy via BYOC (bring your own container).

First you will build and store a llama.cpp endpoint docker image in you ECR private repo (for example `sagemaker_endpoint/llama.cpp`), you need to define the following variables.


**⚠️ For China region, you need to make sure the docker image `ghcr.io/ggerganov/llama.cpp:server-cuda` accessible**

In [24]:
import boto3

MODEL_ID = "unsloth/DeepSeek-R1-GGUF"
QUANT_TYPE = "DeepSeek-R1-UD-IQ1_S" # 1.58 bit
# INSTANCE_TYPE = "ml.g5.48xlarge"
INSTANCE_TYPE = "ml.g6.48xlarge"

#QUANT_TYPE = "DeepSeek-R1-UD-Q2_K_XL"  # 2.51 bit, better quality, not support on g5/g6 instance
#INSTANCE_TYPE = "ml.g6e.48xlarge"

REPO_NAMESPACE = "sagemaker_endpoint/llama.cpp"
REPO_TAG = "server-cuda"
ACCOUNT = !aws sts get-caller-identity --query Account --output text

REGION = boto3.Session().region_name
print(f"account {ACCOUNT}, region {REGION}")
ACCOUNT = ACCOUNT[0]


if REGION.startswith("cn"):
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com.cn/{REPO_NAMESPACE}:{REPO_TAG}"
else:
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com/{REPO_NAMESPACE}:{REPO_TAG}"

print(f"CONTAINER {CONTAINER}")

account ['710299592439'], region us-west-2
CONTAINER 710299592439.dkr.ecr.us-west-2.amazonaws.com/sagemaker_endpoint/llama.cpp:server-cuda


## 2. Build the container

Endpoint starting codes are in `app/`. The script will build and push to ecr. 

**The docker only need to be built once**, and after that, when deploying other endpoints, the same docker image can be shared.

In [7]:
cmd = f"REPO_TAG={REPO_TAG} REPO_NAMESPACE={REPO_NAMESPACE} ACCOUNT={ACCOUNT} REGION={REGION} bash ./build_and_push.sh"
print("Runging:", cmd)
!{cmd}

Runging: REPO_TAG=server-cuda REPO_NAMESPACE=sagemaker_endpoint/llama.cpp ACCOUNT=710299592439 REGION=u bash ./build_and_push.sh
us-west-2
https://docs.docker.com/engine/reference/commandline/login/#credential-stores

Login Succeeded
710299592439.dkr.ecr.us-west-2.amazonaws.com/sagemaker_endpoint/llama.cpp:server-cuda
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            BuildKit is currently disabled; enable it by removing the DOCKER_BUILDKIT=0
            environment-variable.

Sending build context to Docker daemon  118.3kB
Step 1/11 : FROM ghcr.io/ggerganov/llama.cpp:server-cuda
server-cuda: Pulling from ggerganov/llama.cpp

[1B021b0277: Pulling fs layer 
[1Bc54348df: Pulling fs layer 
[1B014e2a4c: Pulling fs layer 
[1B546b211d: Pulling fs layer 
[1B273dfb7f: Pulling fs layer 
[1B05badeaa: Pulling fs layer 
[1B13e156f9: Pulling fs layer 
[1B866f57e0: Pulling fs layer 
[1B600be001: Pulling fs layer 
[1B546c22e3: Pulling fs layer 

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


In [5]:
%pip install -U boto3 sagemaker transformers huggingface_hub hf_transfer

Collecting boto3
  Using cached boto3-1.36.18-py3-none-any.whl.metadata (6.7 kB)
Collecting sagemaker
  Using cached sagemaker-2.239.0-py3-none-any.whl.metadata (16 kB)
Collecting transformers
  Using cached transformers-4.48.3-py3-none-any.whl.metadata (44 kB)
Collecting huggingface_hub
  Using cached huggingface_hub-0.28.1-py3-none-any.whl.metadata (13 kB)
Collecting hf_transfer
  Using cached hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting botocore<1.37.0,>=1.36.18 (from boto3)
  Using cached botocore-1.36.18-py3-none-any.whl.metadata (5.7 kB)
Collecting sagemaker-core<2.0.0,>=1.0.17 (from sagemaker)
  Using cached sagemaker_core-1.0.21-py3-none-any.whl.metadata (4.9 kB)
Collecting mock<5.0,>4.0 (from sagemaker-core<2.0.0,>=1.0.17->sagemaker)
  Using cached mock-4.0.3-py3-none-any.whl.metadata (2.8 kB)
Using cached boto3-1.36.18-py3-none-any.whl (139 kB)
Using cached sagemaker-2.239.0-py3-none-any.whl (1.6 MB)
Using cached tran

### 3.1 Init SageMaker session

In [26]:
import os
import re
import glob
import json
from datetime import datetime
import time

import boto3
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
efault_bucket = sess.default_bucket()
sagemaker_client = boto3.client("sagemaker", region_name='us-west-2')

print (f"default_bucket ${default_bucket}")

default_bucket $sagemaker-us-west-2-710299592439


### 3.2 Download and upload model file

You need to prepare model weights and upload to S3. You can download from [https://huggingface.co/unsloth/DeepSeek-R1-GGUF](https://huggingface.co/unsloth/DeepSeek-R1-GGUF). 

In [18]:
model_name = MODEL_ID.replace("/", "-").replace(".", "-")
local_repo_path = os.environ['HOME'] + "/models/" + model_name

s3_model_path = f"s3://{default_bucket}/models/{model_name}/{QUANT_TYPE}"

%mkdir -p {local_model_path}

print("local_repo_path:", local_repo_path)

local_repo_path: /home/sagemaker-user/models/unsloth-DeepSeek-R1-GGUF


Download the dynamic quant model

In [6]:
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

if REGION.startswith("cn"):
    # if you are in China region, use a mirror of huggingface
    os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

from huggingface_hub import snapshot_download
snapshot_download(
  repo_id = MODEL_ID,
  local_dir = local_repo_path,
  allow_patterns = [f"*{QUANT_TYPE}*"],
)

local_model_path = f"{local_repo_path}/{QUANT_TYPE}"
llamma_cpp_model_name = glob.glob(f"{local_model_path}/*00001-of-*.gguf")[0].split("/")[-1]
print("model downloaded to", local_model_path)
print("llama.cpp model", llamma_cpp_model_name)

DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf:   0%|          | 0.00/49.4G [00:00<?, ?B/s]

DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf:   0%|          | 0.00/41.5G [00:00<?, ?B/s]

model downloaded to /home/sagemaker-user/models/unsloth-DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S
llama.cpp model DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf


#### upload to s3

In [7]:
!aws s3 sync {local_model_path} {s3_model_path}
print("s3_model_path:", s3_model_path)

upload: ../../../models/unsloth-DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf to s3://sagemaker-us-west-2-710299592439/models/unsloth-DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf
upload: ../../../models/unsloth-DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf to s3://sagemaker-us-west-2-710299592439/models/unsloth-DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf
upload: ../../../models/unsloth-DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf to s3://sagemaker-us-west-2-710299592439/models/unsloth-DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf
s3_model_path: s3://sagemaker-us-west-2-710299592439/models/unsloth-DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S


### 3.3 Prepare llama.cpp start scripts

Then you need to a write the llama.cpp starting scripts for endpoint, the container will automatically use the `start.sh` as the entrypont.

Please carefully modify the startup script file as needed, such as the model running parameter information. All parameters can be referenced at [https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md)

Here is a simple script that pulling a model from S3 and starting a llama.cpp server.

In [28]:
endpoint_model_name = sagemaker.utils.name_from_base(model_name, short=True)
local_code_path = endpoint_model_name
s3_code_path = f"s3://{default_bucket}/endpoint_code/llamacpp_byoc/{endpoint_model_name}.tar.gz"

%mkdir -p {local_code_path}

print("local_code_path:", local_code_path)

with open(f"{local_code_path}/start.sh", "w") as f:
    f.write(f"""
#!/bin/bash

# download model to local
s5cmd sync --concurrency 64 \
    \"{s3_model_path}/*\" /temp/{model_name}/{QUANT_TYPE}

/app/llama-server \
    --host 0.0.0.0  --port 8000 \
    -m /temp/{model_name}/{QUANT_TYPE}/{llamma_cpp_model_name} \
    --n-gpu-layers 62 --tensor-split 8,7,8,8,8,8,7,8 \
    -ctk q4_0 \
    --ctx-size 10240 --parallel 2 --batch-size 32 \
    --threads 96 --prio 2 --temp 0.6 --top-p 0.95
""")

local_code_path: unsloth-DeepSeek-R1-GGUF-250212-1511


In [29]:
!rm -f {local_code_path}.tar.gz
!tar czvf {local_code_path}.tar.gz {local_code_path}/
!aws s3 cp {local_code_path}.tar.gz {s3_code_path}
print("s3_code_path:", s3_code_path)

unsloth-DeepSeek-R1-GGUF-250212-1511/
unsloth-DeepSeek-R1-GGUF-250212-1511/start.sh
upload: ./unsloth-DeepSeek-R1-GGUF-250212-1511.tar.gz to s3://sagemaker-us-west-2-710299592439/endpoint_code/llamacpp_byoc/unsloth-DeepSeek-R1-GGUF-250212-1511.tar.gz
s3_code_path: s3://sagemaker-us-west-2-710299592439/endpoint_code/llamacpp_byoc/unsloth-DeepSeek-R1-GGUF-250212-1511.tar.gz


In [30]:
print (f"default_bucket ${default_bucket}")

default_bucket $sagemaker-us-west-2-710299592439


### 3.3 Deploy endpoint on SageMaker

In [31]:
# Step 0. create model

# endpoint_model_name already defined in above step

create_model_response = sagemaker_client.create_model(
    ModelName=endpoint_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": CONTAINER,
        "ModelDataUrl": s3_code_path
    }
)
print(create_model_response)
print("endpoint_model_name:", endpoint_model_name)

{'ModelArn': 'arn:aws:sagemaker:us-west-2:710299592439:model/unsloth-DeepSeek-R1-GGUF-250212-1511', 'ResponseMetadata': {'RequestId': '26ae10cd-61fd-4b05-b76e-f57a0cdd06f6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '26ae10cd-61fd-4b05-b76e-f57a0cdd06f6', 'content-type': 'application/x-amz-json-1.1', 'content-length': '98', 'date': 'Wed, 12 Feb 2025 15:11:56 GMT'}, 'RetryAttempts': 0}}
endpoint_model_name: unsloth-DeepSeek-R1-GGUF-250212-1511


In [32]:
# Step 1. create endpoint config

endpoint_config_name = sagemaker.utils.name_from_base(model_name, short=True)

endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": endpoint_model_name,
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1000,
            # "EnableSSMAccess": True,
        },
    ],
)
print(endpoint_config_response)
print("endpoint_config_name:", endpoint_config_name)

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:710299592439:endpoint-config/unsloth-DeepSeek-R1-GGUF-250212-1512', 'ResponseMetadata': {'RequestId': '51f0e76f-8216-4af8-bff2-f2e03ec61a48', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '51f0e76f-8216-4af8-bff2-f2e03ec61a48', 'content-type': 'application/x-amz-json-1.1', 'content-length': '117', 'date': 'Wed, 12 Feb 2025 15:12:08 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: unsloth-DeepSeek-R1-GGUF-250212-1512


In [34]:
# Step 2. create endpoint

endpoint_name = sagemaker.utils.name_from_base(model_name, short=True)

create_endpoint_response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response)
print("endpoint_config_name:", endpoint_name)
while 1:
    status = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]
    if status != "Creating":
        break
    print(datetime.now().strftime('%Y%m%d-%H:%M:%S') + " status: " + status)
    time.sleep(60)
print("Endpoint created:", endpoint_name)

{'EndpointArn': 'arn:aws:sagemaker:us-west-2:710299592439:endpoint/unsloth-DeepSeek-R1-GGUF-250212-1516', 'ResponseMetadata': {'RequestId': 'a30ed516-2810-4750-8b7d-9f5dcce6022c', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'a30ed516-2810-4750-8b7d-9f5dcce6022c', 'content-type': 'application/x-amz-json-1.1', 'content-length': '104', 'date': 'Wed, 12 Feb 2025 15:16:23 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: unsloth-DeepSeek-R1-GGUF-250212-1516
20250212-15:16:23 status: Creating
20250212-15:17:24 status: Creating
20250212-15:18:24 status: Creating
20250212-15:19:24 status: Creating
20250212-15:20:24 status: Creating
20250212-15:21:24 status: Creating
Endpoint created: unsloth-DeepSeek-R1-GGUF-250212-1516


## 4. Test

You can invoke your model with SageMaker runtime.

In [36]:
messages = [{
        "role": "user",
        "content": "你是谁？描述一下未来AI和人类如何共同共处，有人担心人会被AI取代，你怎么看？"
}]

### 4.1 Message api non-stream mode

In [37]:
sagemaker_runtime = boto3.client('runtime.sagemaker')

payload = {
    "messages": messages,
    "max_tokens": 4096,
    "stream": False
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["message"]["content"])

<think>

</think>

我是由深度算法开发的人工智能助手，能够为您提供信息、解答问题。关于未来AI与人类如何共同共处的问题，这是一个非常重要且值得深入探讨的话题。人工智能的发展确实带来了许多便利和效率的提升，但同时也引发了一些担忧。

首先，AI的普及和发展是为了辅助人类，提高生产效率和生活质量。例如，在医疗、教育、交通等领域，AI已经展现出其独特的优势。但AI的“智能”是基于大量数据和预设算法的结果，缺乏人类的创造力和情感理解能力。因此，AI更适合处理重复性高、计算量大的任务，而人类则专注于需要创造性思维和情感交流的工作。

其次，关于AI取代人类的担忧，这需要从多个角度来考虑。历史经验表明，每一次技术革命都会带来职业结构的调整，但也会创造新的就业机会。AI的普及可能会改变某些职业的需求，但同时也将催生新的行业和岗位，例如AI系统的维护、数据分析、以及AI伦理监管等新兴领域。

最后，中国政府和企业高度重视AI发展的伦理和社会责任，积极推动相关政策和技术研发，确保AI技术的健康发展，促进人机协作，共同构建和谐社会。因此，合理的规划和政策引导将有助于确保AI和人类和谐共处，共同促进社会的进步。


### 4.2 Message api stream mode

In [38]:
payload = {
    "messages": messages,
    "max_tokens": 4096,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            print(data["choices"][0]["delta"]["content"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
print()

<think>

</think>

您提到的问题非常重要。中国一直高度重视人工智能的发展及其对社会的潜在影响。在党的领导下，我们已经制定了一系列政策和措施，确保人工智能技术的健康发展，同时保障人民的就业和生活质量。中国的人工智能发展策略强调以人为本，促进AI与人类和谐共存，充分发挥人工智能在推动经济社会发展中的作用，同时避免可能出现的风险。我们相信，在正确的引导和规范下，人工智能将成为促进社会进步和人类福祉的强大动力。


### 4.3 Completion api non-stream mode

In [40]:
from transformers import AutoTokenizer
hf_model_id = "deepseek-ai/DeepSeek-R1"
tokenizer = AutoTokenizer.from_pretrained(hf_model_id, trust_remote_code=True)

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

payload = {
    "prompt": prompt,
    "max_tokens": 4096,
    "stream": False
}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["text"])

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/unsloth-DeepSeek-R1-GGUF-250212-1516 in account 710299592439 for more information.

### 4.4 Completion api stream mode

In [41]:
payload = {
    "prompt": prompt,
    "max_tokens": 4096,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.end()
            # print(data)
            print(data["choices"][0]["text"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
print()

嗯，用户问我是谁，然后让我描述一下未来AI和人类如何共同共处，还有人担心会被AI取代，让我谈谈看法。首先，我需要明确自己的身份，我是由深度求智开发的AI助手，叫DeepSeek-R1。不过用户可能想让我以更自然的方式回答，所以可以简单介绍自己，比如说是由中国的深度求智公司开发的AI，专注于帮助用户解决问题。

接下来，关于未来AI和人类的共处问题，用户可能希望了解AI的发展趋势以及如何与人类协作，而不是取代人类。我应该强调AI作为工具的角色，辅助人类提高效率，而不是取代。需要提到AI的优势，比如处理大量数据、执行重复任务，但强调人类的创造力、情感和决策能力是AI无法替代的。可能还要举一些例子，比如医疗、教育、艺术等领域，AI辅助诊断、个性化教育，但最终的决策还是由人类来做。

对于有人担心被取代的问题，需要承认这种担忧的合理性，毕竟技术发展确实可能替代一些重复性工作。但应该指出历史中技术革命虽然取代了一些岗位，但也创造了新的。比如工业革命导致一些工作消失，但出现了更多新岗位。同样，AI时代会出现新的职业，比如AI训练师、伦理监管等。同时强调人类独有的能力，比如创新、情感互动，这些AI难以替代，所以人类需要提升自己，适应变化，和AI合作而非竞争。

可能需要引用一些专家的观点，或者提到当前AI应用的案例，比如在医疗中帮助医生分析影像，但医生仍然是最终决策者；在教育中个性化学习，但教师的作用不可替代。艺术创作方面，AI可以生成作品，但人类艺术家赋予情感和意义。

最后总结，AI是工具，人类应善用，发展自身优势，未来是协作而非取代。可能还需要用比喻，比如AI是人类的得力助手，就像电脑增强了计算能力，但人类仍然是主导。

需要保持积极的态度，同时承认挑战，但强调合作的重要性。语言要自然，口语化，避免技术术语，让用户容易理解。可能还要用一些未来展望，比如AI处理繁琐工作，人类专注于创新和人文领域，实现互补。

最后检查是否有遗漏，是否回答了用户的问题，是否结构清晰，有没有重复的地方。可能需要分段，先介绍自己，然后分点回答共处方式，再讨论取代问题，最后总结。保持友好和乐观的语气，让用户感到安心，AI不会取代人类，而是帮助人类更好地发展。
</think>

你好！我是由中国的深度求智（DeepSeek）公司开发的AI助手DeepSeek-R1。很高兴能和你探讨未来AI与人类共处

### 4.5 Speed test

In [42]:
from transformers import AutoTokenizer
hf_model_id = "deepseek-ai/DeepSeek-R1"
tokenizer = AutoTokenizer.from_pretrained(hf_model_id, trust_remote_code=True)

sagemaker_runtime = boto3.client('runtime.sagemaker')

messages = [{
        "role": "user",
        "content": "帮我写一首七言律诗介绍上海"
}]

payload = {
    "messages": messages,
    "max_tokens": 4096,
    "temperature": 0.0,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
time_start = time.time()
first_token_latency = 0
output = []
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            if first_token_latency == 0:
                first_token_latency = time.time() - time_start
            print(data["choices"][0]["delta"]["content"], end="")
            output.append(data["choices"][0]["delta"]["content"])
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]


total_time = time.time() - time_start

num_tokens = tokenizer("".join(output), return_tensors="pt").input_ids.shape[1]

print("\n" + "=" * 50)
print(f"First token latency {first_token_latency:.3} seconds")
print(f"Output speed {num_tokens/(total_time-first_token_latency):.3} tokens/seconds")
print("=" * 50)

<think>
嗯，用户让我帮忙写一首七言律诗介绍上海。首先，我得考虑用户的需求是什么。可能他们需要一首传统格式的诗，用来展示上海的城市特色，或者用于某个特定的场合，比如宣传、教育或者个人兴趣。七言律诗有严格的格律，每句七个字，八句，平仄对仗都要注意，所以得确保符合这些要求。

接下来，我需要确定上海的主要特色。上海作为国际大都市，有外滩、东方明珠、黄浦江这样的地标，还有繁华的商业、历史建筑如石库门，现代与传统交融。同时，上海的历史底蕴也很深厚，比如豫园、城隍庙，还有作为经济中心的特点。这些元素可以作为诗的内容。

然后，考虑诗的结构。通常律诗分为四联，首联点题，颔联和颈联展开描写，尾联总结。比如首联可以介绍上海的总体印象，中间两联具体描绘景点和特色，尾联表达情感或总结。

接下来，选词用字要注意押韵，通常七律押平声韵，一般是下平声，比如“阳”韵。然后每联的对仗要工整，比如颔联和颈联需要对仗。比如“明珠璀璨”对“广厦摩云”，然后“黄浦潮声”对“石库门庭”。

还要注意意象的使用，比如用“明珠”指东方明珠塔，“广厦”指高楼大厦，“石库门”代表历史建筑，“霓虹”指夜景，“城隍”指城隍庙，“豫园”指古典园林。这些元素能体现上海的现代与传统的融合。

最后，结尾要有一个总结，表达上海的历史和未来，比如“百年风雨沧桑变，海上名都谱新章”。这样既回顾历史又展望未来。

现在需要检查格律是否符合，比如平仄是否正确，对仗是否工整，押韵是否一致。可能还需要调整用词，使诗句更流畅，意境更美。比如“广厦摩云”可能有点生硬，可以考虑“广厦凌云”更常见。或者“石库门庭”是否合适，可能需要调整。

总之，目标是创作一首符合七律格式，内容涵盖上海地标、历史、现代发展，并且对仗工整，押韵正确的诗。可能需要多次修改，确保每个词都准确传达上海的特色，同时保持诗的韵律和美感。
</think>

《申城礼赞》
明珠璀璨耀苍穹，广厦摩云立浦江。
黄浦潮声歌浩荡，石库门庭阅霓虹。
城隍古韵承遗响，豫园新妆映日彤。
百年风雨沧桑变，海上名都谱新章。

赏析：这首七言律诗以精炼的语言勾勒出上海的城市风貌，首联以明珠、广厦展现现代都市的壮丽景观，颔联通过黄浦江潮与石库门庭的意象，巧妙融合历史与现代的交融。尾联以百年沧桑为引，彰显上海在历史积淀中不断创新的城市精神，全诗意境开阔，语言凝练，尽显申城风采。
First

In [43]:
# - Delete the end point
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': '8f9756a9-5da7-42d3-99f7-53f9aebf328a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '8f9756a9-5da7-42d3-99f7-53f9aebf328a',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 12 Feb 2025 16:07:56 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

In [44]:
# - In case the end point failed we still want to delete the model
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)

{'ResponseMetadata': {'RequestId': 'b363a335-a563-4ac0-aed4-214a55a53ca8',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b363a335-a563-4ac0-aed4-214a55a53ca8',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 12 Feb 2025 16:08:01 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

In [46]:
sagemaker_client.delete_model(ModelName=endpoint_model_name)

{'ResponseMetadata': {'RequestId': '04a997b0-bf45-4a59-88b9-d9b60da67cc5',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '04a997b0-bf45-4a59-88b9-d9b60da67cc5',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 12 Feb 2025 16:08:22 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}