Copyright (C) 2022 Intel Corporation
 
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
 
http://www.apache.org/licenses/LICENSE-2.0
 
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions
and limitations under the License.
 

SPDX-License-Identifier: Apache-2.0

# General Description

Version: 1.1 Date: Oct 28, 2022

This notebook outlines the general usage of cloud inference platform using Intel's CPU, quantized model by Intel Neural Compressor on Amazon SageMaker platform. This illustrate how users can use REST api to send a request to the endpoint and get the model output.

Users may wish to based on parts of the codes and customize those to suit their purposes.

# Step 0: Specify the AWS information (Optional)

Users may wish to specify the AWS information in the ./config/config.yaml file to pre-fill the necessary information required for the workflow. Or users may also fill in the necessary fields when executing the steps.

In [None]:
import yaml
with open('./config/sagemaker_config.yaml') as f:
    config_dict = yaml.safe_load(f)
    read_from_yaml = True

# Step 1: Build a custom docker image for inference
    1. Copy the content of the "../src/sagemaker_inference_container" and paste those outside the docker container. 
    2. Modify the AWS credential of the build_and_push.sh 
       Pay attention to the region, account number, algorithm_name and the firewall issue 
    3. Run build_and_push.sh to build the custom docker image for training.

# Step 2: Deploy the model using SageMaker 

Users may wish to change the type of the cluster nodes and the number of the nodes for the serving 

Please change the two variables 'deploy_instance_type' and 'num_of_nodes' to achieve this purpose.

List of EC2 instances: https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-instance-types.html




In [None]:
import sagemaker
from sagemaker.tensorflow.model import TensorFlowModel

if read_from_yaml:
    sagemaker_role = config_dict['role']
    inference_image = config_dict['inference_image_uri']
    model_data = config_dict['quantized_model_s3_path']
else:
    sagemaker_role = '' # AmazonSageMaker-ExecutionRole-xxxxxxxxxxxxxx
    inference_image = '' #e.g.: xxxxxxxxxx.dkr.ecr.us-west-1.amazonaws.com/sagemaker-inteltf-huggingface-inc-inference
    model_data = '' #The quantized model trained in the 1.0-intel-sagemaker-training.ipynb - e.g.: s3://sagemaker-us-west-1-xxxxxxxxx/model/model.tar.gz

#Specify the type of target nodes and number of nodes for the deployment
deploy_instance_type = "ml.c5.xlarge" #default value
num_of_nodes=1                        #default value

model = TensorFlowModel(model_data=model_data, role=sagemaker_role, image_uri=inference_image)

predictor = model.deploy(initial_instance_count=num_of_nodes, instance_type=deploy_instance_type)

# Step 3: Preprocess the input data and send it to the endpoint
The tokenizer has already download offline and put into the container. The choice of it depends on the type of model and the task. Users may feel free to change it or use another pre-process method.

In [None]:
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained('../src/sagemaker_inference_container/bert_uncased_tokenizer') #model dependent. User may switch to the one that suit for their use case.
sentence1 = 'Sheena Young of Child, the national infertility support network, hoped the guidelines would lead to a more "fair and equitable" service for infertility sufferers.'
sentence2 = 'Sheena Young, a spokesman for Child, the national infertility support network, said the proposed guidelines should lead to a more "fair and equitable" service for infertility sufferers.'
processed_input = tokenizer(sentence1, sentence2, padding=True, truncation=True)
batch = [dict(processed_input)]
input_data = {"instances": batch}

#Use JSON format to call the prediction
result = predictor.predict(input_data)
prediction = np.argmax(result['predictions'][0])
prediction

# Clean up

In [None]:
predictor.delete_endpoint()