# Optimized & deploy GPT-J on AWS inferentia2 with Amazon SageMaker

In this end-to-end tutorial, you will learn how to speed up BERT inference down to `1ms` latency for text classification with Hugging Face Transformers, Amazon SageMaker, and AWS Inferentia2.

You will learn how to: 

1. Convert BERT to AWS Neuron (Inferentia2) with `optimum-neuron`
2. Create a custom `inference.py` script for `text-classification`
3. Upload the neuron model and inference script to Amazon S3
4. Deploy a Real-time Inference Endpoint on Amazon SageMaker
5. Run and evaluate Inference performance of BERT on Inferentia2

Let's get started! 🚀

---

*If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.*

In [1]:
!pip install git+https://github.com/aws-neuron/transformers-neuronx.git --upgrade
!pip install git+https://github.com/huggingface/optimum-neuron.git@neuron_model_for_causal_lm --upgrade

Collecting git+https://github.com/aws-neuron/transformers-neuronx.git
  Cloning https://github.com/aws-neuron/transformers-neuronx.git to /tmp/pip-req-build-tebky619
  Running command git clone -q https://github.com/aws-neuron/transformers-neuronx.git /tmp/pip-req-build-tebky619
Building wheels for collected packages: transformers-neuronx
  Building wheel for transformers-neuronx (setup.py) ... [?25ldone
[?25h  Created wheel for transformers-neuronx: filename=transformers_neuronx-0.4.20230629-py3-none-any.whl size=122672 sha256=66e4cc9f0757da9ccfa7786eb8944dff2d11d4ca04c90a870e4dc17750bcc09b
  Stored in directory: /tmp/pip-ephem-wheel-cache-abivopgt/wheels/a8/cd/08/7e54ef998d43ebf4954c9c66f5667a9801fec18049af641371
Successfully built transformers-neuronx
Installing collected packages: transformers-neuronx
  Attempting uninstall: transformers-neuronx
    Found existing installation: transformers-neuronx 0.4.20230629
    Uninstalling transformers-neuronx-0.4.20230629:
      Successfull

In [2]:
import os

import torch
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM


os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference"

# Compilation does not work with batch_size = 1
batch_size = 2
seq_length = 128

# Load and convert the Hub model to Neuron format
model_neuron = NeuronModelForCausalLM.from_pretrained(
    "gpt2", batch_size=batch_size, sequence_length=seq_length, export=True, tp_degree=2, amp="f32"
)

print("HF model converted to Neuron")

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained("gpt2")
prompt_text = "Hello, I'm a language model,"
# We need to replicate the text because batch_size is not 1
prompts = [prompt_text for _ in range(batch_size)]

# Encode text and generate using AWS sampling loop
encoded_text = tokenizer(prompts, return_tensors='pt')
with torch.inference_mode():
    generated_sequence = model_neuron.model.sample(encoded_text.input_ids, sequence_length=seq_length)
    print([tokenizer.decode(tok) for tok in generated_sequence])

print("Outputs generated using AWS sampling loop")

# Specifiy padding options
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

# Encode tokens and generate using temperature
tokens = tokenizer(prompts, padding=True, return_tensors='pt')
model_neuron.reset_generation() # Need to check if this can be automated
sample_output = model_neuron.generate(
    **tokens,
    do_sample=True,
    max_length=seq_length,
    temperature=0.7,
)
print([tokenizer.decode(tok) for tok in sample_output])

print("Outputs generated using HF generate")

  from .autonotebook import tqdm as notebook_tqdm


..
Compiler status PASS
2023-Jun-29 12:05:43.0738 2074:2074 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-Jun-29 12:05:43.0738 2074:2074 [0] init.cc:99 CCOM WARN OFI plugin initNet() failed is EFA enabled?
HF model converted to Neuron


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Hello, I\'m a language model, and I\'d love to introduce you to my upcoming writing course," she said, speaking of her next writing book, "Beyond Borders: A New Translation of Shakespeare\'s Old Folio Series." "I\'m really looking forward to this learning time as well," she added. "I have no plans to become a linguistic learner anytime soon."\n\nAlfred Wiesel, an English professor and linguist at the University of Virginia, agrees: "There is nothing in American studies quite like watching a Shakespeare play as a teacher or as a citizen." However, he added: "American authors frequently', "Hello, I'm a language model, so I wanted to make a code base for working with C#. I think it's incredibly powerful and straightforward. So my thought was:\n\nA single implementation is possible…\n\n…but let's be bold and say we are working with C# and using something like C# 5.10. I want a single instance.\n\nWe know that we need a single instance, so I will create one.\n\nIn this case, I will create

## 1. Convert BERT to AWS Neuron (Inferentia2) with `optimum-neuron`

We are going to use the [optimum-neuron](https://huggingface.co/docs/optimum-neuron/index). 🤗 Optimum Neuron is the interface between the 🤗 Transformers library and AWS Accelerators including [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/?nc1=h_ls) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/?nc1=h_ls). It provides a set of tools enabling easy model loading, training and inference on single- and multi-Accelerator settings for different downstream tasks. 

As a first step, we need to install the `optimum-neuron` and other required packages.

*Tip: If you are using Amazon SageMaker Notebook Instances or Studio you can go with the `conda_python3` conda kernel.*


In [19]:
!python -m pip install "git+https://github.com/aws/sagemaker-python-sdk.git"  --upgrade


Collecting git+https://github.com/aws/sagemaker-python-sdk.git
  Cloning https://github.com/aws/sagemaker-python-sdk.git to /tmp/pip-req-build-t4tkqum2
  Running command git clone -q https://github.com/aws/sagemaker-python-sdk.git /tmp/pip-req-build-t4tkqum2
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.168.1.dev0-py2.py3-none-any.whl size=1152959 sha256=36980350bb08cf9ba0efca10ac90bec3877c3c892e81fe377044a848c334eda9
  Stored in directory: /tmp/pip-ephem-wheel-cache-tyn1ws_0/wheels/86/90/ca/c446e4ac09f7ad1b813fe4ab437ffc09821067a2de18621ac6
Successfully built sagemaker
Installing collected packages: sagemaker
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.168.0
    Uninstalling sagemaker-2.168.0:
      Successfully uninstalled sagemaker-2.168.0
Successfully installed sagemaker-2.168.1.dev0


In [None]:
# Install the required packages
# !pip install "optimum-neuron[neuronx]==0.0.6"  --upgrade
!pip install "git+https://github.com/huggingface/optimum-neuron.git@b94d534cc0160f1e199fae6ae3a1c7b804b49e30"  --upgrade

# !python -m pip install "sagemaker==2.169.0"  --upgrade
!python -m pip install "git+https://github.com/aws/sagemaker-python-sdk.git"  --upgrade
# pip install sagemaker from github

After we have installed the `optimum-neuron` we can convert load and convert our model.

We are going to use the [yiyanghkust/finbert-tone](https://huggingface.co/yiyanghkust/finbert-tone) model. FinBERT is a BERT model pre-trained on financial communication text. The purpose is to enhance financial NLP research and practice. It is trained on the following three financial communication corpus. The total corpora size is 4.9B tokens. This released finbert-tone model is the FinBERT model fine-tuned on 10,000 manually annotated (positive, negative, neutral) sentences from analyst reports.

In [None]:
model_id = "yiyanghkust/finbert-tone"

At the time of writing, the [AWS Inferentia2 does not support dynamic shapes for inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/dynamic-shapes.html?highlight=dynamic%20shapes#), which means that the input size needs to be static for compiling and inference. 

In simpler terms, this means when the model is converted with a sequence length of 16. The model can only run inference on inputs with the same shape. We are going to use the `optimum-cli` to convert our model with a sequence length of 128 and a batch size of 1. 

_When using a `t2.medium` instance the compiling takes around 2-3 minutes_ 

In [None]:
%%bash -s "$model_id"
MODEL_ID=$1
SEQUENCE_LENGTH=128
BATCH_SIZE=1
OUTPUT_DIR=tmp/ # used to store temproary files
echo "Model ID: $MODEL_ID"

# exporting model
optimum-cli export neuron \
  --model $MODEL_ID \
  --sequence_length $SEQUENCE_LENGTH \
  --batch_size $BATCH_SIZE \
  $OUTPUT_DIR

## 2. Create a custom `inference.py` script for `text-classification`

The [Hugging Face Inference Toolkit](https://github.com/aws/sagemaker-huggingface-inference-toolkit) supports zero-code deployments on top of the [pipeline feature](https://huggingface.co/transformers/main_classes/pipelines.html) from 🤗 Transformers. This allows users to deploy Hugging Face transformers without an inference script [[Example](https://github.com/huggingface/notebooks/blob/master/sagemaker/11_deploy_model_from_hf_hub/deploy_transformer_model_from_hf_hub.ipynb)]. 

Currently is this feature not supported with AWS Inferentia2, which means we need to provide an `inference.py` for running inference. But `optimum-neuron` has integrated support for the 🤗 Transformers pipeline feature. That way we can use the `optimum-neuron` to create a pipeline for our model.

If you want to know more about the `inference.py` script check out this [example](https://github.com/huggingface/notebooks/blob/master/sagemaker/17_custom_inference_script/sagemaker-notebook.ipynb). It explains amongst other things what the `model_fn` and `predict_fn` are. 

In [4]:
!mkdir code

In addition to our `inference.py` script we need to provide a `requirements.txt`, which installs the latest version of the `optimum-neuron` package, which comes with `pipeline` support for AWS Inferentia2. 
_Note: This is a temporary solution until the `optimum-neuron` package is updated inside the DLC._

In [5]:
%%writefile code/requirements.txt
git+https://github.com/aws-neuron/transformers-neuronx.git

Writing code/requirements.txt


We are using the `NEURON_RT_NUM_CORES=1` to make sure that each HTTP worker uses 1 Neuron core to maximize throughput.

In [6]:
%%writefile code/inference.py
import os
from transformers_neuronx.gptj.model import GPTJForSampling
from transformers_neuronx.generation_utils import HuggingFaceGenerationModelAdapter
from transformers_neuronx.module import save_pretrained_split
from transformers import AutoModelForCausalLM, AutoTokenizer
os.environ['NEURON_CC_FLAGS'] = '--model-type=transformer-inference'
# Load and save the CPU model
split_dir='gptj-split'
model_id='EleutherAI/gpt-j-6b'
revision='sharded'

####### LOAD AND COMPILE THE MODEL #######
model_cpu = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, revision=revision)
save_pretrained_split(model_cpu, split_dir)

# Create and compile the Neuron model
model = GPTJForSampling.from_pretrained(split_dir, batch_size=1, tp_degree=2, n_positions=512, amp='f32', unroll=None)
model.to_neuron()
model = HuggingFaceGenerationModelAdapter(model_cpu.config, model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
# https://huggingface.co/amazon/LightGPT


def model_fn(model_dir):
    return model, tokenizer

def predict_fn(data, model_tokenizer):
    model, tokenizer = model_tokenizer
    inputs = data.pop("inputs", data)
    parameters = data.pop("parameters", None)

    # preprocess
    input_ids = tokenizer(inputs, return_tensors="pt").input_ids

    # pass inputs with all kwargs in data
    model.reset_generation()
    if parameters is not None:
        outputs = model.generate(input_ids, **parameters)
    else:
        outputs = model.generate(input_ids, do_sample=True, temperature=0.7)

    # postprocess the prediction
    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return [{"generated_text": prediction}]

Writing code/inference.py


## 3. Upload the neuron model and inference script to Amazon S3

Before we can deploy our neuron model to Amazon SageMaker we need to create a `model.tar.gz` archive with all our model artifacts saved into, e.g. `model.neuron` and upload this to Amazon S3.

To do this we need to set up our permissions. Currently `inf2` instances are only available in the `us-east-2` region [[REF](https://aws.amazon.com/de/about-aws/whats-new/2023/05/sagemaker-ml-inf2-ml-trn1-instances-model-deployment/)]. Therefore we need to force the region to us-east-2.

In [7]:
import os 

os.environ["AWS_DEFAULT_REGION"] = "us-east-2" # need to set to ohio region

Now lets create our SageMaker session and upload our model to Amazon S3.

In [8]:
os.environ["AWS_ACCESS_KEY_ID"] = "AKIAYD4NIUHMQLQQYXHF"
os.environ["AWS_SECRET_ACCESS_KEY"] = "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"

In [9]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
assert sess.boto_region_name == "us-east-2", "region must be us-east-2"

Couldn't call 'get_role' to get Role ARN from role name philippschmid to get Role path.


sagemaker role arn: arn:aws:iam::558105141721:role/sagemaker_execution_role
sagemaker bucket: sagemaker-us-east-2-558105141721
sagemaker session region: us-east-2


Next, we create our `model.tar.gz`.The `inference.py` script will be placed into a `code/` folder.

In [10]:
%cd gptj-transformers

[Errno 2] No such file or directory: 'gptj-transformers'
/home/ubuntu/huggingface-inferentia2-samples/gptj-transformers


In [11]:
# copy inference.py into the code/ directory of the model directory.
!mkdir -p tmp
!cp -r code/ tmp/code/
# create a model.tar.gz archive with all the model artifacts and the inference.py script.
%cd tmp
!tar zcvf model.tar.gz *
%cd ..

/home/ubuntu/huggingface-inferentia2-samples/gptj-transformers/tmp
code/
code/inference.py
code/requirements.txt
/home/ubuntu/huggingface-inferentia2-samples/gptj-transformers


Now we can upload our `model.tar.gz` to our session S3 bucket with `sagemaker`.

In [12]:
from sagemaker.s3 import S3Uploader

# create s3 uri
s3_model_path = f"s3://{sess.default_bucket()}/neuronx/gptj"

# upload model.tar.gz
s3_model_uri = S3Uploader.upload(local_path="tmp/model.tar.gz",desired_s3_uri=s3_model_path)
print(f"model artifcats uploaded to {s3_model_uri}")

model artifcats uploaded to s3://sagemaker-us-east-2-558105141721/neuronx/gptj/model.tar.gz


In [13]:
# clean tmp directory after uploading
# !rm -rf tmp

## 4. Deploy a Real-time Inference Endpoint on Amazon SageMaker

After we have uploaded our `model.tar.gz` to Amazon S3 can we create a custom `HuggingfaceModel`. This class will be used to create and deploy our real-time inference endpoint on Amazon SageMaker.

The `inf2.xlarge` instance type is the smallest instance type with AWS Inferentia2 support. It comes with 1 Inferentia2 chip with 2 Neuron Cores. This means we can use 2 Model server workers to maximize throughput and run 2 inferences in parallel.

In [14]:
from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=s3_model_uri,        # path to your model and script
   role=role,                      # iam role with permissions to create an Endpoint
   transformers_version="4.28.1",  # transformers version used
   pytorch_version="1.13.0",       # pytorch version used
   py_version='py38',              # python version used
   model_server_workers=1,         # number of workers for the model server
)

# Let SageMaker know that we've already compiled the model
huggingface_model._is_compiled_model = True

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,      # number of instances
    instance_type="ml.inf2.8xlarge" # AWS Inferentia Instance
    timeout
)

-------------!

# 5. Run and evaluate Inference performance of BERT on Inferentia

The `.deploy()` returns an `HuggingFacePredictor` object which can be used to request inference.

In [17]:
data = {
  "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = predictor.predict(data=data)
res

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logEventViewer:group=/aws/sagemaker/Endpoints/huggingface-pytorch-inference-neuronx-m-2023-06-29-14-46-49-807 in account 558105141721 for more information.

We managed to deploy our neuron compiled BERT to AWS Inferentia on Amazon SageMaker. Now, let's test its performance of it. As a dummy load test will we use threading to send 10000 requests to our endpoint with 10 threads.

_Note: When running the load test we environment was based in europe and the endpoint is deployed in us-east-2._

### Delete model and endpoint

To clean up, we can delete the model and endpoint.

In [18]:
predictor.delete_model()
predictor.delete_endpoint()