## Deploy a pretrained/Fine-tuned hateBERT model in AWS Sagemaker with BentoML

BentoML is a flexible and lightweight framework for creating, deploying, and managing machine learning services. We start out to use BentoML for the NLP model (pretrained BERT model for violence detection in message) deployment on cloud service. This notebook demostrates end-to-end project of deploying "GroNLP/hateBERT" model to AWS sagemaker using serverless architecture

### 

### Deploy a pretrained model
Service is the core component of BentoML, where the serving logic is defined. Using pretrained models from the Hugging Face does not require saving the model first in the BentoML model store. A custom runner can be implemented to download and run pretrained models at runtime.

In [2]:
%%writefile service.py
import bentoml

from bentoml.io import Text, JSON
from transformers import pipeline

class PretrainedModelRunnable(bentoml.Runnable):
    SUPPORTED_RESOURCES = ("cpu",)
    SUPPORTS_CPU_MULTI_THREADING = True

    def __init__(self):
        self.classifier = pipeline(task="text-classification", model='GroNLP/hateBERT')

    @bentoml.Runnable.method(batchable=False)
    def __call__(self, input_text):
        return self.classifier(input_text)

runner = bentoml.Runner(PretrainedModelRunnable, name="pretrained_classifier")

svc = bentoml.Service('pretrained_classification_service', runners=[runner])

@svc.api(input=Text(), output=JSON())
async def detectViolence(input_series: str) -> list:
    return await runner.async_run(input_series)

Overwriting service.py


We can now run the BentoML server for our new service in development mode

In [3]:
!bentoml serve service.py:svc --reload

2023-03-30T11:37:37-0600 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "service.py:svc" can be accessed at http://localhost:3000/metrics.
2023-03-30T11:37:38-0600 [INFO] [cli] Starting development HTTP BentoServer from "service.py:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
2023-03-30 11:37:39 circus[23370] [INFO] Loading the plugin...
2023-03-30 11:37:39 circus[23370] [INFO] Endpoint: 'tcp://127.0.0.1:53824'
2023-03-30 11:37:39 circus[23370] [INFO] Pub/sub: 'tcp://127.0.0.1:53825'
2023-03-30T11:37:39-0600 [INFO] [observer] Watching directories: ['/Users/li/OMSA/FullStackDL/BentoML/huggingface_deployment', '/Users/li/bentoml/models']
Some weights of the model checkpoint at GroNLP/hateBERT were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.tr

## Building a Bento
Once the service definition is finalized, we can build the model and service into a `bento`
. Bento is the distribution format for a service. It is a self-contained archive that contains all the source code, model files and dependency specifications required to run the service.

To build a Bento, first create a `bentofile.yaml` file in your project directory


In [4]:
%%writefile bentofile.yaml
service: "service.py:svc"
labels:
include:
- "*.py"
python:
  packages:
  - transformers
  - torch

Overwriting bentofile.yaml


In [5]:
!bentoml build

Building BentoML service "pretrained_classification_service:rnm6k6gpfots36wa" from build context "/Users/li/OMSA/FullStackDL/BentoML/huggingface_deployment".
Locking PyPI package versions.

██████╗░███████╗███╗░░██╗████████╗░█████╗░███╗░░░███╗██╗░░░░░
██╔══██╗██╔════╝████╗░██║╚══██╔══╝██╔══██╗████╗░████║██║░░░░░
██████╦╝█████╗░░██╔██╗██║░░░██║░░░██║░░██║██╔████╔██║██║░░░░░
██╔══██╗██╔══╝░░██║╚████║░░░██║░░░██║░░██║██║╚██╔╝██║██║░░░░░
██████╦╝███████╗██║░╚███║░░░██║░░░╚█████╔╝██║░╚═╝░██║███████╗
╚═════╝░╚══════╝╚═╝░░╚══╝░░░╚═╝░░░░╚════╝░╚═╝░░░░░╚═╝╚══════╝

[32mSuccessfully built Bento(tag="pretrained_classification_service:rnm6k6gpfots36wa").[0m
[33m
Possible next steps:

 * Containerize your Bento with `bentoml containerize`:
    $ bentoml containerize pretrained_classification_service:rnm6k6gpfots36wa[0m
[33m
 * Push to BentoCloud with `bentoml push`:
    $ bentoml push pretrained_classification_service:rnm6k6gpfots36wa[0m


We can now run the BentoML server for our new service in development mode.

In [6]:
!bentoml serve pretrained_classification_service:latest --production

2023-03-30T12:56:29-0600 [INFO] [cli] Environ for worker 0: set CPU thread count to 10
2023-03-30T12:56:29-0600 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "pretrained_classification_service:latest" can be accessed at http://localhost:3000/metrics.
2023-03-30T12:56:30-0600 [INFO] [cli] Starting production HTTP BentoServer from "pretrained_classification_service:latest" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
Some weights of the model checkpoint at GroNLP/hateBERT were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing 

## Deploy a fined-tuned Models
Fine-tuning pretrained models is a powerful practice that allows users to save computation cost and adapt state-of-the-art models to their domain specific dataset. Transformers offers a variety of libraries for fine-tuning pretrained models. 

In [1]:
from datasets import load_dataset
import evaluate
import numpy as np

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    PreTrainedModel,
    Trainer, 
    TrainingArguments
)

datasets = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("GroNLP/hateBERT")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = datasets.map(tokenize_function, batched=True)

label_list = tokenized_datasets["train"].unique("label")

# Load pre-trained model from huggingface hub

model = AutoModelForSequenceClassification.from_pretrained(
    "GroNLP/hateBERT", num_labels=len(label_list)
)

# create training arguments
training_args = TrainingArguments(
    output_dir="test_trainer", 
    evaluation_strategy="epoch",
    num_train_epochs=1,
    per_device_train_batch_size=16,

)

# define train set and eval set
train_set = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
validation_set = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_set,
eval_dataset=validation_set,
compute_metrics=compute_metrics,
)
trainer.train()


Found cached dataset imdb (/Users/li/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/li/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-52fa01bddfaeae7a.arrow
Loading cached processed dataset at /Users/li/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-b997016ac30a8840.arrow
Loading cached processed dataset at /Users/li/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-508808c0d29b9b55.arrow
Some weights of the model checkpoint at GroNLP/hateBERT were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertF

  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 0.29997846484184265, 'eval_accuracy': 0.887, 'eval_runtime': 169.6525, 'eval_samples_per_second': 5.894, 'eval_steps_per_second': 0.737, 'epoch': 1.0}
{'train_runtime': 804.5503, 'train_samples_per_second': 1.243, 'train_steps_per_second': 0.078, 'train_loss': 0.46503030686151414, 'epoch': 1.0}


TrainOutput(global_step=63, training_loss=0.46503030686151414, metrics={'train_runtime': 804.5503, 'train_samples_per_second': 1.243, 'train_steps_per_second': 0.078, 'train_loss': 0.46503030686151414, 'epoch': 1.0})

### save the fine-tuned model
Once the model is fine-tuned, create a Transformers Pipeline with the model and save to the BentoML model store. By design, only Pipelines can be saved with the BentoML Transformers framework APIs. Models, tokenizers, feature extractors, and processors, need to be a part of the pipeline first before they can be saved. Transformers pipelines are callable objects therefore the signatures of the model are saved as __call__ by default.

In [2]:
from transformers import pipeline

classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

bentoml.transformers.save_model(name="text-classifier", pipeline=classifier)

No versions of Flax or Jax found on the current machine. In order to use Flax with transformers 4.x and above, refer to https://github.com/google/flax#quick-install


Model(tag="text-classifier:v7onwdgorscej6wa", path="/Users/li/bentoml/models/text-classifier/v7onwdgorscej6wa/")

Redefine service

In [7]:
%%writefile service.py
import bentoml

from bentoml.io import Text, JSON

runner = bentoml.transformers.get("text-classifier:latest").to_runner()

svc = bentoml.Service("text-classifier_service", runners=[runner])

@svc.api(input=Text(), output=JSON())
async def predict(input_series: str) -> list:
    return await runner.async_run(input_series)

Overwriting service.py


Test the server

In [8]:
!bentoml serve service:svc --reload

2023-03-30T13:15:22-0600 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "service:svc" can be accessed at http://localhost:3000/metrics.
2023-03-30T13:15:22-0600 [INFO] [cli] Starting development HTTP BentoServer from "service:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
2023-03-30 13:15:23 circus[71462] [INFO] Loading the plugin...
2023-03-30 13:15:23 circus[71462] [INFO] Endpoint: 'tcp://127.0.0.1:59318'
2023-03-30 13:15:23 circus[71462] [INFO] Pub/sub: 'tcp://127.0.0.1:59319'
2023-03-30T13:15:23-0600 [INFO] [observer] Watching directories: ['/Users/li/OMSA/FullStackDL/BentoML/huggingface_deployment', '/Users/li/bentoml/models']
2023-03-30T13:16:23-0600 [INFO] [dev_api_server:text-classifier_service] 127.0.0.1:59383 (scheme=http,method=GET,path=/,type=,length=) (status=200,type=text/html; charset=utf-8,length=2859) 0.347ms (trace=2880d889a82fc6fc3abf7c50ccb37489,span=10a054df0cd0cd3a,sampled=0)
2023-03-30T13:16:23-0600 [INFO] [dev_api_server:text-classifier_

Build bento 

In [9]:
!bentoml build

Building BentoML service "text-classifier_service:w3gbwkgpf6xep6wa" from build context "/Users/li/OMSA/FullStackDL/BentoML/huggingface_deployment".
Packing model "text-classifier:v7onwdgorscej6wa"
Locking PyPI package versions.

██████╗░███████╗███╗░░██╗████████╗░█████╗░███╗░░░███╗██╗░░░░░
██╔══██╗██╔════╝████╗░██║╚══██╔══╝██╔══██╗████╗░████║██║░░░░░
██████╦╝█████╗░░██╔██╗██║░░░██║░░░██║░░██║██╔████╔██║██║░░░░░
██╔══██╗██╔══╝░░██║╚████║░░░██║░░░██║░░██║██║╚██╔╝██║██║░░░░░
██████╦╝███████╗██║░╚███║░░░██║░░░╚█████╔╝██║░╚═╝░██║███████╗
╚═════╝░╚══════╝╚═╝░░╚══╝░░░╚═╝░░░░╚════╝░╚═╝░░░░░╚═╝╚══════╝

[32mSuccessfully built Bento(tag="text-classifier_service:w3gbwkgpf6xep6wa").[0m
[33m
Possible next steps:

 * Containerize your Bento with `bentoml containerize`:
    $ bentoml containerize text-classifier_service:w3gbwkgpf6xep6wa[0m
[33m
 * Push to BentoCloud with `bentoml push`:
    $ bentoml push text-classifier_service:w3gbwkgpf6xep6wa[0m


## Deploying Bentos in Sagemaker 
Prerequisites:
- Terraform - Terraform is a tool for building, configuring, and managing infrastructure.
- AWS CLI - installed and configured with an AWS account with permission to Sagemaker, Lambda and ECR configure AWS CLI and login docker

Build and push AWS sagemaker compatible docker image to the AWS ECR repository.



In [15]:
!bentoctl build -b text-classifier_service:latest -f deployment_config.yaml

  from bentoctl.cli import bentoctl
Usage of buildx:
      --builder string   
Multiple '--platform' arguments were found. Make sure to also use '--push' to push images to a repository or generated images will not be saved. See https://docs.docker.com/engine/reference/commandline/buildx_build/#load.
[1A[1B[0G[?25l[+] Building 0.0s (0/0)                                                         
[?25h[1A[0G[?25l[+] Building 0.0s (1/1)                                                         
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 1.64kB                                     0.0s
[0m[?25h[1A[1A[1A[0G[?25l[+] Building 0.2s (2/3)                                                         
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 1.64kB                                     0.0s
[0m[34m => [internal] load .dockerig

## Apply Deployment with Terraform
Initialize terraform project. This installs the AWS provider and sets up the terraform folders.

In [18]:
#!terraform init
#Apply terraform project to create Sagemaker deployment
!terraform apply -var-file=bentoctl.tfvars -auto-approve 

[0m[1maws_apigatewayv2_api.lambda: Refreshing state... [id=271qbbgdqc][0m
[0m[1mdata.aws_ecr_repository.service: Reading...[0m[0m
[0m[1maws_cloudwatch_log_group.api_gw: Refreshing state... [id=/aws/api_gw/finetuned-classifier-gw][0m
[0m[1maws_apigatewayv2_stage.lambda: Refreshing state... [id=$default][0m
[0m[1mdata.aws_ecr_repository.service: Read complete after 2s [id=finetuned_classifier][0m
[0m[1mdata.aws_ecr_image.service_image: Reading...[0m[0m
[0m[1mdata.aws_ecr_image.service_image: Read complete after 1s [id=sha256:a9ae8fb095c3ed89f00f461c6fbf6dd7932bb97bdde8435a603ec435d128f209][0m
[0m[1maws_sagemaker_model.sagemaker_model: Refreshing state... [id=finetuned-classifier-model-w3gbwkgpf6xep6wa][0m
[0m[1maws_sagemaker_endpoint_configuration.endpoint_config: Refreshing state... [id=finetuned-classifier-endpoint-config-w3gbwkgpf6xep6wa][0m
[0m[1maws_sagemaker_endpoint.sagemaker_endpoint: Refreshing state... [id=finetuned-classifier-endpoint][0m
[0m[

## invoke the endpoint

In [20]:
#374806654920.dkr.ecr.us-east-2.amazonaws.com/pretrained_classification:2qgvurwfjojwv6wa
#https://runtime.sagemaker.us-east-2.amazonaws.com/endpoints/quickstart-endpoint/invocations
import boto3

# Create a low-level client representing Amazon SageMaker Runtime
sagemaker_runtime = boto3.client("sagemaker-runtime", region_name="us-east-2")

# The name of the endpoint. The name must be unique within an AWS Region in your AWS account. 
endpoint_name='finetuned-classifier-endpoint'

# After you deploy a model into production using SageMaker hosting 
# services, your client applications use this API to get inferences 
# from the model hosted at the specified endpoint.
response = sagemaker_runtime.invoke_endpoint(
                            EndpointName=endpoint_name, 
                            Body=bytes('{"body": "This is great!"}', 'utf-8') # Replace with your own data.
                            )

# Optional - Print the response body and decode it so it is human read-able.
print(response['Body'].read().decode('utf-8'))

[{"label":"LABEL_0","score":0.6155382990837097}]


## test API

## Delete deployment Use the bentoctl destroy command to remove the registry and the deployment

In [44]:
!bentoctl destroy -f deployment_config.yaml

  from bentoctl.cli import bentoctl
[0m[1mdata.aws_ecr_repository.service: Reading...[0m[0m
[0m[1maws_apigatewayv2_api.lambda: Refreshing state... [id=271qbbgdqc][0m
[0m[1mdata.aws_ecr_repository.service: Read complete after 1s [id=finetuned_classifier][0m
[0m[1mdata.aws_ecr_image.service_image: Reading...[0m[0m
[0m[1mdata.aws_ecr_image.service_image: Read complete after 0s [id=sha256:a9ae8fb095c3ed89f00f461c6fbf6dd7932bb97bdde8435a603ec435d128f209][0m
[0m[1maws_sagemaker_model.sagemaker_model: Refreshing state... [id=finetuned-classifier-model-w3gbwkgpf6xep6wa][0m
[0m[1maws_cloudwatch_log_group.api_gw: Refreshing state... [id=/aws/api_gw/finetuned-classifier-gw][0m
[0m[1maws_sagemaker_endpoint_configuration.endpoint_config: Refreshing state... [id=finetuned-classifier-endpoint-config-w3gbwkgpf6xep6wa][0m
[0m[1maws_sagemaker_endpoint.sagemaker_endpoint: Refreshing state... [id=finetuned-classifier-endpoint][0m
[0m[1maws_apigatewayv2_stage.lambda: Refreshi