# Create Serverless endpoint using AWS Chalice
---

**[주의] 이 핸즈온 코드는 워크샵 참석자가 아닌 워크샵 진행자(호스트)가 실행하는 코드입니다!**

### AWS Chalice란?


AWS Chalice는 AWS의 오픈 소스 서버리스 프레임워크로 빠르고 쉽게 서버리스 어플리케이션을 구축할 수 있습니다. Flask 스타일의 마이크로 웹 프레임워크를 기반으로 하고 있으며, 자동으로 AWS Lambda 함수를 생성하고 API Gateway 엔드포인트를 구성해 줍니다. 또한 Amazon DynamoDB, Amazon S3, SQS, SNS 등과 같은 서비스의 통합도 지원합니다.

Chalice는 간단한 웹 애플리케이션 및 마이크로 서비스와 같은 작은 규모의 빠른 프로토타이핑 및 서버리스 애플리케이션 개발에 유용하며, 데이터 과학자가 Lambda 및 API Gateway와 같은 AWS 서비스에 대한 지식이 없더라도 쉽게 사용할 수 있습니다. 또한 Chalice는 일부 내장된 보안 기능, 로깅 및 오류 처리 기능을 제공하므로 개발자는 이러한 작업을 직접 처리할 필요가 없습니다.

참조
- https://aws.github.io/chalice/
- https://github.com/daekeun-ml/aws-chalice-examples

In [2]:
!pip install chalice
# #!sudo yum -y install tree

Collecting chalice
  Obtaining dependency information for chalice from https://files.pythonhosted.org/packages/54/d9/fc8d0744740dd1db2490049ac3035002ec52cde7385d8d14416e829a3bc2/chalice-1.29.0-py3-none-any.whl.metadata
  Using cached chalice-1.29.0-py3-none-any.whl.metadata (9.0 kB)
Collecting pip<23.2,>=9 (from chalice)
  Using cached pip-23.1.2-py3-none-any.whl (2.1 MB)
Collecting inquirer<3.0.0,>=2.7.0 (from chalice)
  Using cached inquirer-2.10.1-py3-none-any.whl (17 kB)
Collecting blessed>=1.19.0 (from inquirer<3.0.0,>=2.7.0->chalice)
  Using cached blessed-1.20.0-py2.py3-none-any.whl (58 kB)
Collecting python-editor>=1.0.4 (from inquirer<3.0.0,>=2.7.0->chalice)
  Using cached python_editor-1.0.4-py3-none-any.whl (4.9 kB)
Collecting readchar>=3.0.6 (from inquirer<3.0.0,>=2.7.0->chalice)
  Using cached readchar-4.0.5-py3-none-any.whl (8.5 kB)
Using cached chalice-1.29.0-py3-none-any.whl (264 kB)
Installing collected packages: python-editor, readchar, pip, blessed, inquirer, chalice

<br>

## 1. Create a project
---

In [27]:
PROJECT = "genai-rag-workshop"
!rm -rf $PROJECT
!chalice new-project $PROJECT

Your project has been generated in ./genai-rag-workshop


In [28]:
cat $PROJECT/.chalice/config.json

{
  "version": "2.0",
  "app_name": "genai-rag-workshop",
  "stages": {
    "dev": {
      "api_gateway_stage": "api"
    }
  }
}


### SageMaker Endpoint name

In [29]:
# us-east-1
#endpoint_emb_kosimcse = 'KoSimCSE-roberta-2023-08-03-22-52-21'
#endpoint_emb_gptj_6b = 'jumpstart-dft-hf-textembedding-gpt-j-6b-fp16'

# us-west-2
#endpoint_emb_kosimcse = 'KoSimCSE-roberta-2023-08-11-07-45-03' ##
#endpoint_emb_gptj_6b = 'jumpstart-dft-hf-textembedding-gpt-j-6b-fp16-1' ##

In [33]:
# us-east-1
# endpoint_llm_llama2_7b = 'jumpstart-dft-meta-textgeneration-llama-2-7b-1'
# endpoint_llm_llama2_13b = 'jumpstart-dft-meta-textgeneration-llama-2-13b'
endpoint_llm_kkulm_12_8b = 'kullm-polyglot-12-8b-v2-1694183328'
# endpoint_llm_falcon_40b = 'jumpstart-dft-hf-llm-falcon-40b-instruct-bf16'

# us-west-2
#endpoint_llm_llama2_7b = 'jumpstart-dft-meta-textgeneration-llama-2-7b-1-1' ##
#endpoint_llm_llama2_13b = 'jumpstart-dft-meta-textgeneration-llama-2-13b-1' ##
#endpoint_llm_kkulm_12_8b = 'kullm-polyglot-12-8b-v2-1694183328' ##
#endpoint_llm_falcon_40b = 'jumpstart-dft-hf-llm-falcon-40b-instruct-bf16-1' ##

In [34]:
cat $PROJECT/.chalice/config.json


{
    "Version": "2.0",
    "app_name": "{{app_name}}",
    "autogen_policy": false,
    "automatic_layer": true,
    "environment_variables": {
        "ENDPOINT_EMB_KOSIMCSE": "{{endpoint_emb_kosimcse}}",        
        "ENDPOINT_EMB_GPTJ_6B": "{{endpoint_emb_gptj_6b}}",        
        "ENDPOINT_LLM_LLAMA2_7B": "{{endpoint_llm_llama2_7b}}",
        "ENDPOINT_LLM_LLAMA2_13B": "{{endpoint_llm_llama2_13b}}",     
        "ENDPOINT_LLM_KKULM_12_8B": "{{endpoint_llm_kkulm_12_8b}}",
        "ENDPOINT_LLM_FALCON_40B": "{{endpoint_llm_falcon_40b}}"  
    },
    "stages": {
        "dev": {
            "api_gateway_stage": "api"
        }    
    }

}


### Setup config.json
Chalice는 IAM 정책 자동 생성 기능이 있지만, 필요한 정책을 가진 IAM 정책을 생성할수 있습니다. 기본적으로는 직접 IAM 정책을 생성하는 것이 안전합니다. <br>
자세한 내용은 https://chalice-fei.readthedocs.io/en/latest/topics/configfile.html 를 참조하기 바랍니다.

`autogen_policy`: 
- 애플리케이션 소스 코드 분석을 기반으로 chalice가 IAM 정책을 자동으로 생성할지 여부를 설정 (디폴트 = True)
- False인 경우, `.chalice/policy-<단계 이름>.json`에서 IAM 정책을 로드
- `iam_policy_file` 지정으로 불러올 policy 파일명을 변경할 수도 있음

In [35]:
%%writefile $PROJECT/.chalice/config.json

{
    "Version": "2.0",
    "app_name": "{{app_name}}",
    "autogen_policy": false,
    "automatic_layer": true,
    "environment_variables": {
        "ENDPOINT_EMB_KOSIMCSE": "{{endpoint_emb_kosimcse}}",        
        "ENDPOINT_EMB_GPTJ_6B": "{{endpoint_emb_gptj_6b}}",        
        "ENDPOINT_LLM_LLAMA2_7B": "{{endpoint_llm_llama2_7b}}",
        "ENDPOINT_LLM_LLAMA2_13B": "{{endpoint_llm_llama2_13b}}",     
        "ENDPOINT_LLM_KKULM_12_8B": "{{endpoint_llm_kkulm_12_8b}}",
        "ENDPOINT_LLM_FALCON_40B": "{{endpoint_llm_falcon_40b}}"  
    },
    "stages": {
        "dev": {
            "api_gateway_stage": "api"
        }    
    }

}

Overwriting genai-rag-workshop/.chalice/config.json


In [36]:
import jinja2
from pathlib import Path
jinja_env = jinja2.Environment()  # jinja environment to generate model configuration templates
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path(f"{PROJECT}/.chalice/config.json").open().read())
Path(f"{PROJECT}/.chalice/config.json").open("w").write(
    template.render(
        endpoint_emb_kosimcse=endpoint_emb_kosimcse,
        endpoint_emb_gptj_6b=endpoint_emb_gptj_6b,
        endpoint_llm_llama2_7b=endpoint_llm_llama2_7b,
#        endpoint_llm_llama2_13b=endpoint_llm_llama2_13b,
        endpoint_llm_kkulm_12_8b=endpoint_llm_kkulm_12_8b,
        endpoint_llm_falcon_40b=endpoint_llm_falcon_40b,        
        app_name=PROJECT
    )
)
!pygmentize {PROJECT}/.chalice/config.json | cat -n

     1	{[37m[39;49;00m
     2	[37m    [39;49;00m[94m"Version"[39;49;00m:[37m [39;49;00m[33m"2.0"[39;49;00m,[37m[39;49;00m
     3	[37m    [39;49;00m[94m"app_name"[39;49;00m:[37m [39;49;00m[33m"genai-rag-workshop"[39;49;00m,[37m[39;49;00m
     4	[37m    [39;49;00m[94m"autogen_policy"[39;49;00m:[37m [39;49;00m[34mfalse[39;49;00m,[37m[39;49;00m
     5	[37m    [39;49;00m[94m"automatic_layer"[39;49;00m:[37m [39;49;00m[34mtrue[39;49;00m,[37m[39;49;00m
     6	[37m    [39;49;00m[94m"environment_variables"[39;49;00m:[37m [39;49;00m{[37m[39;49;00m
     7	[37m        [39;49;00m[94m"ENDPOINT_EMB_KOSIMCSE"[39;49;00m:[37m [39;49;00m[33m"KoSimCSE-roberta-2023-08-11-07-45-03"[39;49;00m,[37m        [39;49;00m
     8	[37m        [39;49;00m[94m"ENDPOINT_EMB_GPTJ_6B"[39;49;00m:[37m [39;49;00m[33m"jumpstart-dft-hf-textembedding-gpt-j-6b-fp16-1"[39;49;00m,[37m        [39;49;00m
     9	[37m        [39;49;00m[94m"ENDPOINT_LLM_LLAMA2_7B"[

#### Setup IAM policy

In [37]:
%%writefile $PROJECT/.chalice/policy-dev.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents",
                "logs:CreateLogGroup"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "sagemaker:InvokeEndpoint",
            "Resource": "*"
        }
    ]
}

Writing genai-rag-workshop/.chalice/policy-dev.json


### Develop `app.py`

app.py는 서버리스 마이크로프레임워크를 구성하는 핵심 스크립트입니다. 파이썬 데코레이터로(decorator)만으로 AWS의 핵심 서비스들을 쉽고 빠르게 설정할 수 있습니다.

In [38]:
%%writefile $PROJECT/app.py 
import os
import io
import json
import boto3
import base64
import logging
import numpy as np

from chalice import Chalice
from chalice import BadRequestError

app = Chalice(app_name="{{app_name}}")
app.debug = True

smr_client = boto3.client("runtime.sagemaker")
logger = logging.getLogger("{{app_name}}")
logger.setLevel(logging.DEBUG)

@app.route("/")
def index():
    return {'hello': 'world'}


@app.route("/emb/{variant_name}", methods=["POST"], content_types=["application/json"])
def invoke_emb(variant_name):

    models = ['gptj_6b', 'kosimcse']
    if variant_name not in models:
        raise BadRequestError("[ERROR] Invalid model!")
    
    logger.info(f"embedding model: {variant_name}")

    if variant_name == "gptj_6b":
        endpoint_name = os.environ["ENDPOINT_EMB_GPTJ_6B"]
    elif variant_name == "kosimcse":
        endpoint_name = os.environ["ENDPOINT_EMB_KOSIMCSE"]        

    payload = app.current_request.json_body

    try:
        response = smr_client.invoke_endpoint(
            EndpointName=endpoint_name, 
            ContentType='application/json',                        
            Body=json.dumps(payload).encode("utf-8")
        ) 
        res = response['Body'].read()
        return json.loads(res.decode("utf-8"))

    except Exception as e:
        print(e)
        print(payload)
        
        
@app.route("/llm/{variant_name}", methods=["POST"], content_types=["application/json"])
def invoke_llm(variant_name):
    
    models = ['llama2_7b', 'llama2_13b', 'kkulm_12_8b', 'falcon_40b']
    if variant_name not in models:
        raise BadRequestError("[ERROR] Invalid model!")
        
    logger.info(f"txt2txt model: {variant_name}")

    if variant_name == "llama2_7b":
        endpoint_name = os.environ["ENDPOINT_LLM_LLAMA2_7B"]
    elif variant_name == "llama2_13b":
        endpoint_name = os.environ["ENDPOINT_LLM_LLAMA2_13B"]
    elif variant_name == "kkulm_12_8b":
        endpoint_name = os.environ["ENDPOINT_LLM_KKULM_12_8B"]
    elif variant_name == "kkulm_12_8b":
        endpoint_name = os.environ["ENDPOINT_LLM_KKULM_12_8B"]
    elif variant_name == "falcon_40b":
        endpoint_name = os.environ["ENDPOINT_LLM_FALCON_40B"]        

    payload = app.current_request.json_body

    try:
        if "llama2" in variant_name:
            response = smr_client.invoke_endpoint(
                EndpointName=endpoint_name, 
                ContentType='application/json',                        
                Body=json.dumps(payload).encode("utf-8"),
                CustomAttributes="accept_eula=true",
            )
        else:
             response = smr_client.invoke_endpoint(
                EndpointName=endpoint_name, 
                ContentType='application/json',                        
                Body=json.dumps(payload).encode("utf-8")
            )           
        res = response['Body'].read()
        return json.loads(res.decode("utf-8"))
        
    except Exception as e:
        print(e)
        print(payload)

Overwriting genai-rag-workshop/app.py


In [39]:
jinja_env = jinja2.Environment()  # jinja environment to generate model configuration templates
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path(f"{PROJECT}/app.py").open().read())
Path(f"{PROJECT}/app.py").open("w").write(
    template.render(
        app_name=PROJECT,
    )
)
!pygmentize {PROJECT}/app.py | cat -n

     1	[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
     2	[34mimport[39;49;00m [04m[36mio[39;49;00m[37m[39;49;00m
     3	[34mimport[39;49;00m [04m[36mjson[39;49;00m[37m[39;49;00m
     4	[34mimport[39;49;00m [04m[36mboto3[39;49;00m[37m[39;49;00m
     5	[34mimport[39;49;00m [04m[36mbase64[39;49;00m[37m[39;49;00m
     6	[34mimport[39;49;00m [04m[36mlogging[39;49;00m[37m[39;49;00m
     7	[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m[37m[39;49;00m
     8	[37m[39;49;00m
     9	[34mfrom[39;49;00m [04m[36mchalice[39;49;00m [34mimport[39;49;00m Chalice[37m[39;49;00m
    10	[34mfrom[39;49;00m [04m[36mchalice[39;49;00m [34mimport[39;49;00m BadRequestError[37m[39;49;00m
    11	[37m[39;49;00m
    12	app = Chalice(app_name=[33m"[39;49;00m[33mgenai-rag-workshop[39;49;00m[33m"[39;49;00m)[37m[39;49;00m
    13	app.debug = [34mTrue[39;49;00m[37m[39;49;00m
    14	[37m[

### requirements.txt

In [40]:
%%writefile $PROJECT/requirements.txt
numpy

Overwriting genai-rag-workshop/requirements.txt


<br>

## 2. Deploying
---
### Local Test
로컬 환경에서 편리하게 테스트를 수행할 수 있습니다. 아래 코드는 SageMaker Studio에서는 동작하지 않습니다!


In [41]:
# !cd $PROJECT && chalice local --port=8200

```
curl -X POST localhost:8200/llm/kkulm_12_8b -H "Content-Type: application/json" -d "{ \"inputs\": \"피자 만드는 법을 알려줘\", \"max_length\":50, \"parameters\": {\"max_new_tokens\": 64, \"top_p\": 0.9} }"
```

```
curl -X POST localhost:8200/llm/llama2_13b -H "Content-Type: application/json" -d "{ \"inputs\": \"Tell me the steps to make a pizza\", \"max_length\":50, \"parameters\": {\"max_new_tokens\": 64, \"top_p\": 0.9} }"
```

```
curl -X POST localhost:8200/emb/gptj_6b -H "Content-Type: application/json" -d "{ \"text_inputs\": \"Tell me the steps to make a pizza\" }"
```

```
curl -X POST localhost:8200/emb/kosimcse -H "Content-Type: application/json" -d "{ \"inputs\": \"Tell me the steps to make a pizza\" }"
```

### Deploy

`chalice deploy`를 실행하면 자동으로 IAM Role, Lambda, API Gateway를 생성해 줍니다. 

In [42]:
!cd $PROJECT && chalice deploy

Creating shared layer deployment package.
Creating app deployment package.
Creating lambda layer: genai-rag-workshop-dev-managed-layer
Updating policy for IAM role: genai-rag-workshop-dev-api_handler
Updating lambda function: genai-rag-workshop-dev
Creating Rest API
Resources deployed:
  - Lambda Layer ARN: arn:aws:lambda:us-east-1:654405684375:layer:genai-rag-workshop-dev-managed-layer:2
  - Lambda ARN: arn:aws:lambda:us-east-1:654405684375:function:genai-rag-workshop-dev
  - Rest API URL: https://mfh8ekbt73.execute-api.us-east-1.amazonaws.com/api/


<br>

## 3. LLM Inference
---


In [8]:
from IPython.display import display, HTML
import boto3
import json
import requests

client = boto3.client('apigateway')
region = boto3.Session().region_name
response = client.get_rest_apis(limit=1)

RESTAPI_ID = response['items'][0]['id']

URL = f'https://{RESTAPI_ID}.execute-api.{region}.amazonaws.com/api/'.replace('"','')
HEADERS = {
    'Content-Type': 'application/json',
    'Accept': 'application/json',
}

In [9]:
URL

'https://mfh8ekbt73.execute-api.us-east-1.amazonaws.com/api/'

In [10]:
RESTAPI_ID

'mfh8ekbt73'

```
curl -X POST https://6bk4r5mo4f.execute-api.us-east-1.amazonaws.com/api/llm/llama2_7b \
-H "Content-Type: application/json" -d "{ \"inputs\": \"Tell me the steps to make a pizza\", \"max_length\":50, \"parameters\": {\"max_new_tokens\": 64, \"top_p\": 0.9} }"
```

### Llama 2-7B

In [19]:
%%time
LLM_URL= f"{URL}llm/llama2_7b"

payload = {
    'inputs': "Please let us know SageMaker's advantages in 100 words",
    'parameters': {
        'max_new_tokens': 128,
        'top_p': 0.9,
        'temperature': 0.2,
        'return_full_text': False
    }
}

response = requests.post(url=LLM_URL, headers=HEADERS, json=payload)
print(response.json()[0]['generation'])

TypeError: 'NoneType' object is not subscriptable

### Llama 2-13B

In [20]:
%%time
LLM_URL = f"{URL}llm/llama2_13b"

payload = {
    'inputs': "Please let us know SageMaker's advantages in 100 words",
    'parameters': {
        'max_new_tokens': 128,
        'top_p': 0.9,
        'temperature': 0.2,
        'return_full_text': False
    }
}

response = requests.post(url=LLM_URL, headers=HEADERS, json=payload)
print(response.json()[0]['generation'])

TypeError: 'NoneType' object is not subscriptable

### KKULM-polyglot-12.8B

In [11]:
%%time
payload = {
    'inputs': "SageMaker의 장점을 알려줘",
    'parameters': {
        'max_new_tokens': 128,
        'top_p': 0.9,
        'temperature': 0.1,
        'return_full_text': False
    }
}

LLM_URL = f"{URL}llm/kkulm_12_8b"
response = requests.post(url=LLM_URL, headers=HEADERS, json=payload)
print(response.json()[0]['generated_text'])

!​1. SageMaker는 데이터를 분석하고, 시각화하고, 인사이트를 도출하는 데 도움이 되는 다양한 도구와 기능을 제공합니다. 이러한 도구와 기능은 데이터 과학자, 데이터 분석가 및 데이터 과학자가 되고자 하는 사람들에게 필수적입니다.​2. SageMaker는 데이터 과학자, 데이터 분석가 및 데이터 과학자가 되고자 하는 사람들에게 필수적인 도구와 기능을 제공합니다. SageMaker는 데이터를 분석하고, 시각화하고, 인사이트를 도출하는 데 도움이
CPU times: user 20.1 ms, sys: 0 ns, total: 20.1 ms
Wall time: 15.3 s


### Falcon-40B

In [None]:
%%time
LLM_URL = f"{URL}llm/falcon_40b"

payload = {
    'inputs': "Please let us know SageMaker's advantages in 100 words",
    'parameters': {
        'max_new_tokens': 128,
        'top_p': 0.9,
        'temperature': 0.2,
        'return_full_text': False
    }
}

response = requests.post(url=LLM_URL, headers=HEADERS, json=payload)
print(response.json()[0]['generated_text'])

### GPT-J-6B Embeddding

In [21]:
%%time
payload = {
    'text_inputs': "embedding",
}

EMB_URL = f"{URL}emb/gptj_6b"
headers = {
    'Content-Type': 'application/json',
    'Accept': 'application/json',
}

response = requests.post(url=EMB_URL, headers=headers, json=payload)
print(response.json()['embedding'][0][:5])

[0.0005843630060553551, -0.0013202275149524212, 0.02084660902619362, 0.018653083592653275, 0.023699166253209114]
CPU times: user 12.9 ms, sys: 2.64 ms, total: 15.5 ms
Wall time: 3.1 s


### KoSimCSE Embedding

In [None]:
%%time
payload = {
    'inputs': "임베딩",
}

EMB_URL = f"{URL}emb/kosimcse"
headers = {
    'Content-Type': 'application/json',
    'Accept': 'application/json',
}

response = requests.post(url=EMB_URL, headers=headers, json=payload)
print(response.json()[0][0][:5])


## Clean-up
---

In [None]:
%store RESTAPI_ID

In [None]:
!cd $PROJECT && chalice delete
!rm -rf $PROJECT 

In [None]:
RESTAPI_ID


## Stress test (Ongoing)
---

In [None]:
import functools
import concurrent.futures

In [None]:
def worker_llama2_7b(LLM_URL, words):
    
    print (LLM_URL, words)
    
    words = {
        'inputs': f"Please let us know SageMaker's advantages in 100 words",
        'parameters': {
            'max_new_tokens': 128,
            'top_p': 0.9,
            'temperature': 0.2,
            'return_full_text': False
        }
    }
    print (words)

    response = requests.post(url=LLM_URL, headers=HEADERS, json=payload)
    res = response.json()[0]['generation']
    
    return res


In [None]:
def worker_falcon_40b(LLM_URL, words):
    
    print (LLM_URL, words)
    
    payload = {
        'inputs': f"Please let us know SageMaker's advantages in {100} words",
        #'inputs': inp,
        
        'parameters': {
            'max_new_tokens': 128,
            'top_p': 0.9,
            'temperature': 0.2,
            'return_full_text': False
        }
    }
    response = requests.post(url=LLM_URL, headers=HEADERS, json=payload)
    
    if response.json() != None:
        res = response.json()[0]['generated_text']
    else:
        res = "None"
    return res

In [None]:
function = functools.partial(worker_falcon_40b, f"{URL}llm/falcon_40b") # 반복되는 것은 먼저 쓰기

In [None]:
with concurrent.futures.ThreadPoolExecutor(max_workers=60) as executor:
    results = list(executor.map(function, [idx+1 for idx in range(60)]))

In [None]:
len(results), results