#  Serve Falcon 7B model with Amazon SageMaker Hosting

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

---

In this example we walk through how to deploy and perform inference on the **Falcon 7B model** using the **Large Model Inference(LMI)** container provided by AWS using **DJL Serving** and **Accelerate**. The **Falcon 7B model** is a casual decoder model like the **Falcon 40B model**.


## Setup

Installs the dependencies required to package the model and run inferences using Amazon SageMaker. Update SageMaker, boto3 etc

In [1]:
!pip install sagemaker boto3 --upgrade  --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-events 0.6.0 requires jsonschema[format-nongpl]>=4.3.0, but you have jsonschema 3.2.0 which is incompatible.
docker-compose 1.29.2 requires PyYAML<6,>=3.10, but you have pyyaml 6.0 which is incompatible.
distributed 2022.11.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.2 which is incompatible.
awscli 1.27.71 requires botocore==1.29.71, but you have botocore 1.29.155 which is incompatible.
awscli 1.27.71 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0 which is incompatible.[0m[31m
[0m

In [2]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from sagemaker.utils import name_from_base

## Imports and variables

In [3]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts

s3_code_prefix_accelerate = "baichuan-7B/code_baichuan7b"  # folder within bucket where code artifact will go

s3_model_prefix = (
    "baichuan-7B/model_baichuan7b"  # folder within bucket where code artifact will go
)
region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

In [None]:
## [OPTIONAL] Download the model from Hugging Face and upload the model artifacts on Amazon S3

In [None]:
# from huggingface_hub import snapshot_download
# from pathlib import Path
# import os

# # - This will download the model into the current directory where ever the jupyter notebook is running
# local_model_path = Path("./model")
# local_model_path.mkdir(exist_ok=True)
# model_name = "tiiuae/falcon-7b"
# # Only download pytorch checkpoint files
# allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model", "*.py"]

# # - Leverage the snapshot library to download the model since the model is stored in repository using LFS
# model_download_path = snapshot_download(
#     repo_id=model_name,
#     cache_dir=local_model_path,
#     allow_patterns=allow_patterns,
# )

In [99]:

# # define a variable to contain the s3url of the location that has the model
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"
print(f"Pretrained model will be uploaded to ---- > {pretrained_model_location}")


Pretrained model will be uploaded to ---- > s3://sagemaker-cn-north-1-394224607677/baichuan-7B/model_baichuan7b/


In [None]:
# model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
# print(f"Model uploaded to --- > {model_artifact}")
# print(f"We will set option.s3url={model_artifact}")

In [None]:
# !rm -rf {model_download_path}

### 1. Create SageMaker compatible model artifacts

In order to prepare our model for deployment to a SageMaker Endpoint for hosting, we will need to prepare a few things for SageMaker and our container. We will use a local folder as the location of these files including **serving.properties** that defines parameters for the LMI container and **requirements.txt** to detail what dependies to install.

In [124]:
!rm -rf code_baichuan7b_accelerate
!mkdir -p code_baichuan7b_accelerate

In the **serving.properties** files  define the the **engine** to use and **model** to host. Note the **tensor_parallel_degree** parameter which is set to a value of **1** in this scenario. Since the entire model can fit on a sigle GPU we do not have to divide the model into multiple parts. In this case we will use a 'ml.g5.2xlarge' instance which provides **1** GPU. Be careful not to specify a value larger than the instance provides or your deployment will fail. 

In [125]:
%%writefile ./code_baichuan7b_accelerate/serving.properties
engine=Python
option.tensor_parallel_degree=1
option.s3url = {{s3url}}

Writing ./code_baichuan7b_accelerate/serving.properties


In [126]:
%%writefile ./code_baichuan7b_accelerate/requirements.txt
einops
torch==2.0.1

Writing ./code_baichuan7b_accelerate/requirements.txt


In [127]:
## SKIP

# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path("code_baichuan7b_accelerate/serving.properties").open().read())
Path("code_baichuan7b_accelerate/serving.properties").open("w").write(
    template.render(s3url=pretrained_model_location)
)
!pygmentize code_baichuan7b_accelerate/serving.properties | cat -n

     1	[36mengine[39;49;00m=[33mPython[39;49;00m[37m[39;49;00m
     2	[36moption.tensor_parallel_degree[39;49;00m=[33m1[39;49;00m[37m[39;49;00m
     3	[36moption.s3url[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33ms3://sagemaker-cn-north-1-394224607677/baichuan-7B/model_baichuan7b/[39;49;00m[37m[39;49;00m


### 2. Create a model.py with custom inference code

SageMaker allows you to bring your own script for inference. Here we create our **model.py** file with the appropriate code for the Falcon 7B model.

In [128]:
%%writefile ./code_baichuan7b_accelerate/model.py
from djl_python import Input, Output
import os
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from typing import Any, Dict, Tuple
import warnings
import logging

predictor = None


def get_model(properties):
        
    tensor_parallel_degree = properties["tensor_parallel_degree"]

    model_location = properties["model_dir"]
    if "model_id" in properties:
        model_location = properties["model_id"]
    logging.info(f"Loading model in {model_location}")

    
    model = AutoModelForCausalLM.from_pretrained(
        model_location,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(model_location, trust_remote_code=True)
    generator = pipeline(
        task="text-generation", model=model, tokenizer=tokenizer, device_map="auto"
    )
    return generator


def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model(inputs.get_properties())
    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None
    data = inputs.get_as_json()
    text = data["text"]
    text_length = data["text_length"]
    outputs = predictor(text, do_sample=True, min_length=text_length, max_length=text_length)
    result = {"outputs": outputs}
    return Output().add_as_json(result)

Writing ./code_baichuan7b_accelerate/model.py


### 3. Create the Tarball and then upload to S3 location
Next, we will package our artifacts as `*.tar.gz` files for uploading to S3 for SageMaker to use for deployment

In [129]:
!rm -f model.tar.gz
!rm -rf code_baichuan7b_accelerate/.ipynb_checkpoints
!tar czvf model.tar.gz -C code_baichuan7b_accelerate .

s3_code_artifact_accelerate = sess.upload_data("model.tar.gz", bucket, s3_code_prefix_accelerate)
print(f"S3 Code or Model tar for accelerate uploaded to --- > {s3_code_artifact_accelerate}")

./
./requirements.txt
./model.py
./serving.properties
S3 Code or Model tar for accelerate uploaded to --- > s3://sagemaker-cn-north-1-394224607677/baichuan-7B/code_baichuan7b/model.tar.gz


### 4. Define a serving container, SageMaker Model and SageMaker endpoint
Now that we have uploaded the model artifacts to S3, we can create a SageMaker endpoint.


#### Define the serving container
Here we define the container to use for the model for inference. We will be using SageMaker's Large Model Inference(LMI) container using Accelerate. 

In [4]:
# inference_image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/djl-ds:latest"
inference_image_uri = (
    f"727897471807.dkr.ecr.{region}.amazonaws.com.cn/djl-inference:0.22.1-deepspeed0.8.3-cu118"
)
print(f"Image going to be used is ---- > {inference_image_uri}")


Image going to be used is ---- > 727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn/djl-inference:0.22.1-deepspeed0.8.3-cu118


#### Create SageMaker model, endpoint configuration and endpoint.


In [5]:
model_name_acc = name_from_base(f"baichuan7b-model-acc")
print(model_name_acc)

baichuan7b-model-acc-2023-06-19-01-14-24-365


In [7]:
s3_code_artifact_accelerate="s3://sagemaker-cn-north-1-394224607677/baichuan-7B/code_baichuan7b/model.tar.gz"
create_model_response = sm_client.create_model(
    ModelName=model_name_acc,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact_accelerate},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

Created Model: arn:aws-cn:sagemaker:cn-north-1:394224607677:model/baichuan7b-model-acc-2023-06-19-01-14-24-365


In [8]:
model_name = model_name_acc
print(f"Building EndpointConfig and Endpoint for: {model_name}")

Building EndpointConfig and Endpoint for: baichuan7b-model-acc-2023-06-19-01-14-24-365


In [9]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g4dn.2xlarge",
            "InitialInstanceCount": 1,
            #"ModelDataDownloadTimeoutInSeconds": 600,
            #"ContainerStartupHealthCheckTimeoutInSeconds": 600,
            # "VolumeSizeInGB": 512
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws-cn:sagemaker:cn-north-1:394224607677:endpoint-config/baichuan7b-model-acc-2023-06-19-01-14-24-365-config',
 'ResponseMetadata': {'RequestId': '9be2c988-bc10-4142-a55b-9d179240b4b0',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '9be2c988-bc10-4142-a55b-9d179240b4b0',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '136',
   'date': 'Mon, 19 Jun 2023 01:16:55 GMT'},
  'RetryAttempts': 0}}

In [10]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws-cn:sagemaker:cn-north-1:394224607677:endpoint/baichuan7b-model-acc-2023-06-19-01-14-24-365-endpoint


In [11]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: InService
Arn: arn:aws-cn:sagemaker:cn-north-1:394224607677:endpoint/baichuan7b-model-acc-2023-06-19-01-14-24-365-endpoint
Status: InService


### Run Inference

In [12]:
%%time

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"text": "北京在哪里?", "text_length": 150}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

CPU times: user 14.3 ms, sys: 687 µs, total: 15 ms
Wall time: 12.2 s


'{\n  "outputs":[\n    {\n      "generated_text":"北京在哪里?\\n北京在哪里 您问的这个问题也太笼统了吧 北京具体指北京市还是中国 北京啊,祖国的首都 中国首都,首都,就北京, 北京人不常说北京,而是说北京人。 北京在哪里?北京在北京市啊。 说你在哪儿,就说你家在哪儿,就说你家在北京市不就行了。 北京在北京市啊 北京在哪里 北京在那啊 北京就是城市,他不是在北不是在南,在哪儿都是城市,北京是在中国。 如果你说北京是市,那么北京是北京市,如果你说北京是首都那么北京是中国首都。 北京在中国和北京市 中国的首都就叫北京,北京是首都,北京"\n    }\n  ]\n}'

In [14]:
%%time


response_model = smr_client.invoke_endpoint(
        EndpointName=endpoint_name,
        Body=json.dumps({"text": "北京市在哪里?", "text_length": 150}),
        ContentType="application/json",
    )

response_model["Body"].read().decode("utf8")

CPU times: user 4.13 ms, sys: 19 µs, total: 4.15 ms
Wall time: 8.96 s


'{\n  "outputs":[\n    {\n      "generated_text":"北京市在哪里?\\n北京市在中华人民共和国的北方 北京市在中华人民共和国的地图上位于中央 北京市在中华人民共和国的地图上位于华北平原和内蒙古高原的交界处 北京市在中华人民共和国的地图上位于渤海和黄海的中间 北京市在11609平方公里 北京市在中华人民共和国的北方 北京市在中华人民共和国的地图上位于中央 北京市在中华人民共和国的地图上位于华北平原和内蒙古高原的交界处 北京市在中华人民共和国的地图上位于渤海和黄海的中间 北京市在11609平方公里 北京市 在地图的中心位置. 北京市在中华人民共和国的地图上位于中央 北京市在中华人民共和国的地图上位于华北平"\n    }\n  ]\n}'

In [139]:
start_time = time.time()

while (time.time() - start_time) < 300:  # 300 seconds = 5 minutes
    response_model = smr_client.invoke_endpoint(
        EndpointName=endpoint_name,
        Body=json.dumps({"text": "故宫是哪个朝代创建的?", "text_length": 150}),
        ContentType="application/json",
    )

    print("Loop restarting - answer: " + response_model["Body"].read().decode("utf8"))

Loop restarting - answer: {
  "outputs":[
    {
      "generated_text":"故宫是哪个朝代创建的?\n明、清两代都是故宫。始建于明朝永乐年间,后经明、清两朝的建设,达到今日的规模、建筑群始建于1406年。全部建造用时14年的时间。\n故宫是“紫禁城”内金水河的源头,以太和殿、中和殿、保和殿为中心,东西两翼有东便门和西便门,宫城南北各有长安门(即玄武门)和神武门,并以两组宫院式布局和中轴线对称的建筑形式将72万平方米的面积分成"
    }
  ]
}
Loop restarting - answer: {
  "outputs":[
    {
      "generated_text":"故宫是哪个朝代创建的?\n\n 展开全部 故宫最早为西周所建。\n故宫,旧称紫禁城,位于北京中轴线的中心,是明、清两代的皇宫,无与伦比的古代建筑杰作,世界现存最大、最完整的木质结构的古建筑群。\n故宫旧称紫禁城,位于北京中轴线的中心,是明、清两代(公元1368~1911年)的皇宫,无与伦比的古代建筑杰作,世界现存最大、最完整的木质结构的古建筑群。故宫始建于公元1"
    }
  ]
}
Loop restarting - answer: {
  "outputs":[
    {
      "generated_text":"故宫是哪个朝代创建的?\n故宫是明、清两代的皇宫,是明、清两代24位皇帝居住的皇宫。\n1406年明成祖朱棣始建,1420年建成。\n故宫整体平面为长方形,南北长961米,东西宽753米,四面围有高10米的城墙,城外有宽52米的护城河。紫禁城有4个门,正门名午门,东西门名东华门与西华门,四门上各有三座门楼,合称“午门三阙”"
    }
  ]
}
Loop restarting - answer: {
  "outputs":[
    {
      "generated_text":"故宫是哪个朝代创建的?\n北京、山地。2013年4月、沈阳。2004年12月、天津。2019年7月19日、南京,是世界现存最大、保定。2008年1月。2001年7月13日。1987年12月,北京市故宫博物馆被联合国教科文组织列入“世界文化遗产”, 与同时代的埃及金字塔比肩、重庆,位于北京故宫宁寿宫区

KeyboardInterrupt: 

### Clean Up

In [None]:
# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:
# - In case the end point failed we still want to delete the model
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference%7Cgenerativeai%7Cllm-workshop%7Clab10-falcon-40b-and-7b%7Cfalcon-7b-accelerate.ipynb)
