# Serverless Inference


This example creates serverless inference endpoint using SageMaker for [GALACTICA](https://huggingface.co/facebook/galactica-125m)
model (using mini version) for the task of generating citations for given text prompt. It shows example usage of the endpoint and its creation.

## Resources

| Name                        | Link |
|-----------------------------|------|
| Galactica                   | [https://huggingface.co/facebook/galactica-125m](https://huggingface.co/facebook/galactica-125m) |
| Galactica                   | [https://arxiv.org/abs/2211.09085](https://arxiv.org/abs/2211.09085) |
| AWS SageMaker Serverless    | [https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html) |
| AWS SageMaker SDK           | [https://sagemaker.readthedocs.io/en/stable/](https://sagemaker.readthedocs.io/en/stable/) |



### Permissions

This code is used to create roles and SageMaker session, it can be dependant on the AWS account, role and region used.
For more details refer to the [AWS documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) and [SageMaker Serverless Examples](https://github.com/aws/amazon-sagemaker-examples/blob/main/serverless-inference/Serverless-Inference-Walkthrough.ipynb).

*Note*: Run pip install only when necessary, depending on the environment You are using (for SageMaker notebooks this is not required.)

In [1]:
# !pip install boto3 --upgrade
# !pip install sagemaker --upgrade

In [2]:
import sagemaker
import boto3

try:
    role = sagemaker.get_execution_role()
except ValueError:

session = sagemaker.Session()

In [3]:
MODEL_BUCKET = ""  # TODO: add S3 bucket where model is stored

Setup required imports and set logging level.

In [4]:
import logging

from endpoint import ServerlessEndpoint

logging.basicConfig(level=logging.INFO)

HuggingFace environment variables, request and response size are increased with respect to defaults, since the model takes text as input and produces text, which might have varying length and threfore varying size of the response. 

In [5]:
HF_ENV = {
    "TS_MAX_RESPONSE_SIZE": "13107000",
    "TS_MAX_REQUEST_SIZE": "13107000",
    "MMS_MAX_RESPONSE_SIZE": "13107000",
    "MMS_MAX_REQUEST_SIZE": "13107000",
    "MMS_WORKERS_PER_MODEL": "4",
}

Creates the endpoint for model 

In [6]:
endpoint = ServerlessEndpoint(
    model_name="galactica-125m",
    model_dir=f"s3://{MODEL_BUCKET}/galactica-mini/galactica-citation-prediction.tar.gz",
    role_arn=role,
    env=HF_ENV,
)

Setup the endpoint. This takes a while, in this step model is added to SageMaker models section (as Docker image) and the endpoint is deployed (on AWS Lambda underneath).

Dash logging is produced by AWS SDK, which is not too informative (should be 4 dashes and exclamation point `----!` for ready endpoint. 

In [7]:
endpoint.setup()

INFO:sagemaker.image_uris:Defaulting to CPU type when using serverless inference
INFO:sagemaker:Creating model with name: galactica-125m
INFO:sagemaker:Creating endpoint-config with name galactica-125m-2023-08-04-07-56-15-555
INFO:sagemaker:Creating endpoint with name galactica-125m-2023-08-04-07-56-15-555


----!

INFO:endpoint:Created endpoint!


Run a few examples, the Galactica model should be able to predict the publication name for given prompt relatively well. For niche topics it will not work well, but for general topics it should be able to predict the publication name with high accuracy. For Transformer it correctly identifies *"Attention is All You Need"* with the author.

In [8]:
%time endpoint("The Transformer architecture")  # correct

CPU times: user 15.6 ms, sys: 0 ns, total: 15.6 ms
Wall time: 4.93 s


'Attention is All you Need, Vaswani.# 3'

Second run after the endpoint is warm is much faster.

In [9]:
%time endpoint("Adam optimizer")  # correct

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.3 s


'Adam: A Method for Stochastic Optimization, Kingma, and a batch size of'

In [10]:
%time endpoint("LSTM")  # not the original paper

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 950 ms


'Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Cho,'

In [11]:
%time endpoint("AlphaFold")  # wrong!

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 859 ms


'A Fast and Accurate Method to Estimate Folding Energy, Kabsch'

Remove the endpoint and its configuration.

In [12]:
endpoint.cleanup()

INFO:sagemaker:Deleting model with name: galactica-125m
INFO:sagemaker:Deleting endpoint configuration with name: galactica-125m-2023-08-04-07-56-15-555
INFO:sagemaker:Deleting endpoint with name: galactica-125m-2023-08-04-07-56-15-555
INFO:endpoint:Cleaned up endpoint!
