# Deploy GPTJ with Elastic Inference on Amazon SageMaker


# Setup

To start, we import some Python libraries and initialize a SageMaker session, S3 bucket and prefix, and IAM role.

In [None]:
# need torch 1.3.1 for elastic inference
!pip install torch==1.3.1
!pip install transformers

In [None]:
import os
import numpy as np
import pandas as pd
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/pytorch-gptj"

role = sagemaker.get_execution_role()

## Use a pretrained model

## Elastic Inference

Selecting the right instance type for inference requires deciding between different amounts of GPU, CPU, and memory resources, and optimizing for one of these resources on a standalone GPU instance usually leads to under-utilization of other resources. [Amazon Elastic Inference](https://aws.amazon.com/machine-learning/elastic-inference/) solves this problem by enabling us to attach the right amount of GPU-powered inference acceleration to our endpoint. In March 2020, [Elastic Inference support for PyTorch became available](https://aws.amazon.com/blogs/machine-learning/reduce-ml-inference-costs-on-amazon-sagemaker-for-pytorch-models-using-amazon-elastic-inference/) for both Amazon SageMaker and Amazon EC2.

To use Elastic Inference, we must convert our trained model to TorchScript. The location of the model artifacts is `estimator.model_data`. 

First we create a folder to save model trained model, and download the `model.tar.gz` file to local directory. 

In [None]:
%%sh -s $estimator.model_data
pip install git+https://github.com/finetuneanon/transformers@gpt-j
mkdir model
curl https://gist.githubusercontent.com/finetuneanon/a55bdb3f5881e361faef0e96e1d41f09/raw/e5a38dad34ff42bbad188afd5e4fdb2ab2eacb6d/gpt-j-6b.json > model/config.json
curl https://zhisu-nlp.s3.us-west-2.amazonaws.com/gpt-j-hf/pytorch_model.bin  > model/pytorch_model.bin

In [None]:
import transformers
from transformers import GPT2Tokenizer, GPTNeoForCausalLM
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
model = GPTNeoForCausalLM.from_pretrained("./model")


input_text = "The following is a never ending conversation with an AI assistant. The assistant is Dyson 360 Heurist vacuum cleaner. The assistant is always helpful, creative, clever, positive, intelligent and very friendly. The assistant enjoys being a vacuum cleaner. The assistant has recently read and enjoyed lots of books. The assistant has enjoyed seeing lots of movies. The assistant has loved visiting many countries. Human: Hello, who are you? AI: I am vacuum cleaner. How can I help you today? Human: I want to buy a vacuum cleaner AI: You have come to the right place! I'm a great vacuum cleaner and I can give detailed answers to any enquiry you may have. Human: Please tell me everything you know about climate change. AI:"
input_ids = tokenizer.encode(str(input_text), return_tensors='pt')
output = model.generate(
    input_ids,
    do_sample=True,
    max_length=300,
    top_p=0.2,
    top_k=0,
    temperature=0.1,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

In [None]:
model.save_pretrained("./newmodel")

The following code converts our model into the TorchScript format:

In [None]:
import subprocess
import torch
from transformers import GPTNeoForCausalLM

model_torchScript = GPTNeoForCausalLM.from_pretrained("newmodel/", torchscript=True)
device = "cpu"
# max length for the sentences: 256
max_len = 256

for_jit_trace_input_ids = [0] * max_len
for_jit_trace_attention_masks = [0] * max_len
for_jit_trace_input = torch.tensor([for_jit_trace_input_ids])
for_jit_trace_masks = torch.tensor([for_jit_trace_input_ids])

traced_model = torch.jit.trace(
    model_torchScript, [for_jit_trace_input.to(device), for_jit_trace_masks.to(device)]
)
torch.jit.save(traced_model, "traced_gptj.pt")

subprocess.call(["tar", "-czvf", "traced_gptj.tar.gz", "traced_gptj.pt"])

Loading the TorchScript model and using it for prediction require small changes in our model loading and prediction functions. We create a new script `deploy_ei.py` that is slightly different from `train_deploy.py` script.

In [None]:
!pygmentize code/deploy_ei.py

Next we upload TorchScript model to S3 and deploy using Elastic Inference. The accelerator_type=`ml.eia2.xlarge` parameter is how we attach the Elastic Inference accelerator to our endpoint.

In [None]:
from sagemaker.pytorch import PyTorchModel

instance_type = 'ml.r5d.12xlarge'
accelerator_type = 'ml.eia2.xlarge'

# TorchScript model
tar_filename = 'traced_gptj.tar.gz'

# Returns S3 bucket URL
print('Upload tarball to S3')
model_data = sagemaker_session.upload_data(path=tar_filename, bucket=bucket, key_prefix=prefix)

In [None]:
import time

endpoint_name = 'bert-ei-traced-{}-{}-{}'.format(instance_type, 
                                                 accelerator_type, time.time()).replace('.', '').replace('_', '')

pytorch = PyTorchModel(
    model_data=model_data,
    role=role,
    entry_point='deploy_ei.py',
    source_dir='code',
    framework_version='1.3.1',
    py_version='py3',
    sagemaker_session=sagemaker_session
)

# Function will exit before endpoint is finished creating
predictor = pytorch.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    accelerator_type=accelerator_type,
    endpoint_name=endpoint_name,
    wait=True,
)

In [None]:
predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

In [None]:
res = predictor.predict('Please remember to delete me when you are done.')
print("Predicted class:", np.argmax(res, axis=1))

# Cleanup

Lastly, please remember to delete the Amazon SageMaker endpoint to avoid charges:

In [None]:
predictor.delete_endpoint()