# Fine-tuning and Deplying PyTorch BERT model with Amazon Elastic Inference on AWS SageMaker

Text classification is a technique for putting text into different categories, and has a wide range of applications: email providers use text classification to detect spam emails, marketing agencies use it for sentiment analysis of customer reviews, and discussion forum moderators use it to detect inappropriate comments.

In the past, data scientists used methods such as tf-idf, word2vec, or bag-of-words (BOW) to generate features for training classification models. Although these techniques have been very successful in many natural language processing (NLP) tasks, they don’t always capture the meanings of words accurately when they appear in different contexts. Recently, we see increasing interest in using Bidirectional Encoder Representations from Transformers (BERT) to achieve better results in text classification tasks, due to its ability to encode the meaning of words in different contexts more accurately.

Amazon SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy machine learning (ML) models quickly. Amazon SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models. The Amazon SageMaker Python SDK provides open-source APIs and containers that make it easy to train and deploy models in Amazon SageMaker with several different ML and deep learning frameworks.

Our customers often ask for quick fine-tuning and easy deployment of their NLP models. Furthermore, customers prefer low inference latency and low model inference cost. Amazon Elastic Inference enables attaching GPU-powered inference acceleration to endpoints, which reduces the cost of deep learning inference without sacrificing performance.

This post demonstrates how to use Amazon SageMaker to fine-tune a PyTorch BERT model and deploy it with Elastic Inference. The code from this post is available in the GitHub repo. For more information about BERT fine-tuning, see BERT Fine-Tuning Tutorial with PyTorch.

I'm not going to go through the details of BERT model developed by Google. But you can access paper and look for its detail on [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf).

### BERT fine-tuning
One of the biggest challenges data scientists face for NLP projects is lack of training data; you often have only a few thousand pieces of human-labeled text data for your model training. However, modern deep learning NLP tasks require a large amount of labeled data. One way to solve this problem is to use transfer learning.

Transfer learning is an ML method where a pretrained model, such as a pretrained ResNet model for image classification, is reused as the starting point for a different but related problem. By reusing parameters from pretrained models, you can save significant amounts of training time and cost.

BERT was trained on BookCorpus and English Wikipedia data, which contains 800 million words and 2,500 million words, respectively [1]. Training BERT from scratch would be prohibitively expensive. By taking advantage of transfer learning, you can quickly fine-tune BERT for another use case with a relatively small amount of training data to achieve state-of-the-art results for common NLP tasks, such as text classification and question answering.

In [1]:
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import boto3
import botocore
from botocore.exceptions import ClientError

import csv
import io
import re
import s3fs


import sagemaker                                 
from sagemaker.predictor import csv_serializer 
from sagemaker.predictor import json_deserializer
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role

### 1. Preparation (Specifying Sagemaker roles)

In [2]:
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()                     
prefix = 'pytorch-bert'
region = boto3.Session().region_name
role = 'arn:aws:iam::570447867175:role/SageMakerNotebookRole' # pass your IAM role name

print('Sagemaker session :', sagemaker_session)
print('S3 bucket :', bucket)
print('Prefix :', prefix)
print('Region selected :', region)
print('IAM role :', role)

Sagemaker session : <sagemaker.session.Session object at 0x0000023BF7DD3508>
S3 bucket : sagemaker-us-west-2-570447867175
Prefix : pytorch-bert
Region selected : us-west-2
IAM role : arn:aws:iam::570447867175:role/SageMakerNotebookRole


### 2. Load Data

We use Corpus of Linguistic Acceptability (CoLA) (https://nyu-mll.github.io/CoLA/), a dataset of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. We download and unzip the data using the following code:

In [3]:
# Run below for CURL
#if not os.path.exists("./data/cola_public_1.1.zip"):
#    !curl -o ./cola_public_1.1.zip https://nyu-mll.github.io/CoLA/cola_public_1.1.zip
#if not os.path.exists("./cola_public/"):
#    !unzip cola_public_1.1.zip

In the training data (we use 'in_domain_train.tsv' file), the only two columns we need are the sentence and its label:

In [12]:
df = pd.read_csv(
    "./data/in_domain_train.tsv",
    sep="\t",
    header=None,
    usecols=[1, 3],
    names=["label", "sentence"],
)
sentences = df.sentence.values
labels = df.label.values

In [13]:
print(sentences[20:25])
print(labels[20:25])

['The professor talked us.' 'We yelled ourselves hoarse.'
 'We yelled ourselves.' 'We yelled Harry hoarse.'
 'Harry coughed himself into a fit.']
[0 1 0 0 1]


### 3. Data Pre-processing

In [6]:
# We then split the dataset for training and testing before uploading both to S3 Bucket
from sklearn.model_selection import train_test_split

train, test = train_test_split(df)
train.to_csv("./data/CoLA_train.csv", index=False)
test.to_csv("./data/CoLA_test.csv", index=False)

##### Add Data to S3 Bucket

In [8]:
inputs_train = sagemaker_session.upload_data("./data/CoLA_train.csv", bucket=bucket, key_prefix=prefix)
inputs_test = sagemaker_session.upload_data("./data/CoLA_test.csv", bucket=bucket, key_prefix=prefix)

print('train input path :', inputs_train)
print('test input path :', inputs_test)

train input path : s3://sagemaker-us-west-2-570447867175/pytorch-bert/CoLA_train.csv
test input path : s3://sagemaker-us-west-2-570447867175/pytorch-bert/CoLA_test.csv


### 4. Start Training

**Training script**
We use the PyTorch-Transformers library, which contains PyTorch implementations and pre-trained model weights for many NLP models, including BERT.

Our training script should save model artifacts learned during training to a file path called model_dir, as stipulated by the SageMaker PyTorch image. Upon completion of training, model artifacts saved in model_dir will be uploaded to S3 by SageMaker and will become available in S3 for deployment.

We save this script in a file named train_deploy.py, and put the file in a directory named code/. The full training script can be viewed under code/.

In [14]:
!pygmentize tools/train_deploy.py

import argparse
import json
import logging
import os
import sys

import numpy as np
import pandas as pd
import torch
import torch.distributed as dist
import torch.utils.data
import torch.utils.data.distributed
from torch.utils.data import DataLoader, RandomSampler, TensorDataset
from transformers import AdamW, BertForSequenceClassification, BertTokenizer

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

MAX_LEN = 64  # this is the max length of the sentence

print("Loading BERT tokenizer...")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)


def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)


def _get_train_data_loader(batch_size, training_dir, is_distributed):
    logger.info("Get train data loader")

    dataset = pd.read_csv(os.path.join(training

This training script should save model artifacts learned during training to a file path called model_dir, as stipulated by the Amazon SageMaker PyTorch image. Upon completion of training, Amazon SageMaker uploads model artifacts saved in model_dir to Amazon S3 so they are available for deployment. The following code is used in the script to save trained model artifacts:

We use Amazon SageMaker to train and deploy a model using our custom PyTorch code. The Amazon SageMaker Python SDK makes it easier to run a PyTorch script in Amazon SageMaker using its PyTorch estimator. After that, we can use the SageMaker Python SDK to deploy the trained model and run predictions. For more information about using this SDK with PyTorch, see Using PyTorch with the SageMaker Python SDK.

To start, we use the PyTorch estimator class to train our model. When creating the estimator, we make sure to specify the following:

- entry_point – The name of the PyTorch script
- source_dir – The location of the training script and requirements.txt file
- framework_version: The PyTorch version we want to use

The PyTorch estimator supports multi-machine, distributed PyTorch training. To use this, we just set train_instance_count to be greater than 1. Our training script supports distributed training for only GPU instances.

After creating the estimator, we call fit(), which launches a training job. We use the Amazon S3 URIs we uploaded the training data to earlier. See the following code:

In [15]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="train_deploy.py", # the name of our PyTorch script. It contains our training script, which loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model. 
    source_dir="tools", # the location of our training scripts and requirements.txt file
    role=role,
    framework_version="1.3.1",
    py_version="py3",
    train_instance_count=1,  # this script only support distributed training for GPU instances.
    train_instance_type="ml.p2.xlarge",
    hyperparameters={
        "epochs": 1,
        "num_labels": 2,
        "backend": "gloo",
    }
)

In [16]:
estimator.fit({"training": inputs_train, "testing": inputs_test})

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-07-22 05:22:22 Starting - Starting the training job...
2020-07-22 05:22:24 Starting - Launching requested ML instances.........
2020-07-22 05:24:22 Starting - Preparing the instances for training.........
2020-07-22 05:25:48 Downloading - Downloading input data......
2020-07-22 05:26:37 Training - Downloading the training image........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-07-22 05:28:16,655 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-07-22 05:28:16,681 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-07-22 05:28:22,906 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-07-22 05:28:23,283 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-07-22 05:28:23

### 5. Deploy the model

After training our model, we host it on an Amazon SageMaker endpoint by calling deploy on the PyTorch estimator. The endpoint runs an Amazon SageMaker PyTorch model server. We need to configure two components of the server: model loading and model serving. We implement these two components in our inference script train_deploy.py. 

'model_fn()' is the function defined to load the saved model and return a model object that can be used for model serving. The SageMaker PyTorch model server loads our model by invoking model_fn

In [None]:
def model_fn(model_dir):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = BertForSequenceClassification.from_pretrained(model_dir)
    return model.to(device)

'input_fn()' deserializes and prepares the prediction input. In this use case, our request body is first serialized to JSON and then sent to model serving endpoint. Therefore, in input_fn(), we first deserialize the JSON-formatted request body and return the input as a torch.tensor, as required for BERT: 

In [None]:
def input_fn(request_body, request_content_type):
    if request_content_type == "application/json":
        sentence = json.loads(request_body)
        
        input_ids = []
        encoded_sent = tokenizer.encode(sentence,add_special_tokens = True)
        input_ids.append(encoded_sent)
    
        # pad shorter sentences
        input_ids_padded =[]
        for i in input_ids:
            while len(i) < MAX_LEN:
                i.append(0)
            input_ids_padded.append(i)
        input_ids = input_ids_padded
    
        # mask; 0: added, 1: otherwise
        [int(token_id > 0) for token_id in sent] for sent in input_ids

        # convert to PyTorch data types.
        train_inputs = torch.tensor(input_ids)
        train_masks = torch.tensor(attention_masks)
    
        # train_data = TensorDataset(train_inputs, train_masks)
        return train_inputs, train_masks

predict_fn() performs the prediction and returns the result. See the following code:

In [None]:
def predict_fn(input_data, model):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()
    input_id, input_mask = input_data
    input_id.to(device)
    input_mask.to(device)
    with torch.no_grad():
        return model(input_id, token_type_ids=None,attention_mask=input_mask)[0]

##### Deploy the model with deploy() function

In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.p2.xlarge")

We then configure the predictor to use application/json for the content type when sending requests to our endpoint:

In [19]:
from sagemaker.predictor import json_deserializer, json_serializer

predictor.content_type = "application/json"
predictor.accept = "application/json"
predictor.serializer = json_serializer
predictor.deserializer = json_deserializer

Finally, we use the returned predictor object to call the endpoint:

In [20]:
result = predictor.predict("Somebody just left - guess who.")
print(np.argmax(result, axis=1))

[1]


In [21]:
predictor.delete_endpoint()

### 6. (Optional) Use a pretrained model

If you want to reuse pretrained model, you can create a PyTorchModel from existing model artifacts in S3 Bucket

In [None]:
from sagemaker.pytorch.model import PyTorchModel 

pytorch_model = PyTorchModel(model_data="<S3 location>/model.tar.gz",
                             role=role,
                             framework_version="1.3.1",
                             source_dir="tools",
                             entry_point="train_deploy.py")

predictor = pytorch_model.deploy(initial_instance_count=1, instance_type="ml.p2.xlarge")

predictor.delete_endpoint()

### 7. Deploy the model (endpoint) with Elastic Inference

Selecting the right instance type for inference requires deciding between different amounts of GPU, CPU, and memory resources. Optimizing for one of these resources on a standalone GPU instance usually leads to underutilization of other resources. Elastic Inference solves this problem by enabling you to attach the right amount of GPU-powered inference acceleration to your endpoint. In March 2020, Elastic Inference support for PyTorch became available for both Amazon SageMaker and Amazon EC2.
[Link](https://aws.amazon.com/ko/blogs/machine-learning/reduce-ml-inference-costs-on-amazon-sagemaker-for-pytorch-models-using-amazon-elastic-inference/)

To use Elastic Inference, we must first convert our trained model to TorchScript. For more information, see Reduce ML inference costs on Amazon SageMaker for PyTorch models using Amazon Elastic Inference.

In [22]:
estimator.model_data

's3://sagemaker-us-west-2-570447867175/pytorch-training-2020-07-22-05-22-19-755/output/model.tar.gz'

First we create a folder to save model trained model, and download the model.tar.gz file to local directory.

In [None]:
%%bash
%%sh -s $estimator.model_data
mkdir model
aws s3 cp $1 model/ 
tar xvzf model/model.tar.gz --directory .

Download the trained model artifacts from Amazon S3. The location of the model artifacts is estimator.model_data. We then convert the model to TorchScript using the following code:

In [25]:
import subprocess
import torch
from transformers import BertForSequenceClassification

model_torchScript = BertForSequenceClassification.from_pretrained("./model/", torchscript=True)
device = "cpu"
for_jit_trace_input_ids = [0] * 64
for_jit_trace_attention_masks = [0] * 64
for_jit_trace_input = torch.tensor([for_jit_trace_input_ids])
for_jit_trace_masks = torch.tensor([for_jit_trace_input_ids])

traced_model = torch.jit.trace(
    model_torchScript, [for_jit_trace_input.to(device), for_jit_trace_masks.to(device)]
)
torch.jit.save(traced_model, "traced_bert.pt")

subprocess.call(["tar", "-czvf", "traced_bert.tar.gz", "traced_bert.pt"])

I0721 23:19:17.148522 24952 configuration_utils.py:283] loading configuration file ./model/config.json
I0721 23:19:17.150515 24952 configuration_utils.py:321] Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "torchscript": true,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

I0721 23:19:17.152509 24952 modeling_utils.py:615] loading weights file ./model/pytorch_model.bin


0

Loading the TorchScript model and using it for prediction require small changes in our model loading and prediction functions. We create a new script deploy_ei.py that is slightly different from train_deploy.py script.

In [26]:
!pygmentize tools/deploy_ei.py

import json
import logging
import os
import sys

import torch
import torch.utils.data
import torch.utils.data.distributed
from transformers import BertTokenizer

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

MAX_LEN = 64  # this is the max length of the sentence

print("Loading BERT tokenizer...")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)


def model_fn(model_dir):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    loaded_model = torch.jit.load(os.path.join(model_dir, "traced_bert.pt"))
    return loaded_model.to(device)


def input_fn(request_body, request_content_type):
    """An input_fn that loads a pickled tensor"""
    if request_content_type == "application/json":
        sentence = json.loads(request_body)

        input_ids = []
        encoded_sent = tokenizer.encode(sentence, add_special_tokens=True)
        input_ids.append(encoded

Next we upload TorchScript model to S3 and deploy using Elastic Inference. The accelerator_type=ml.eia2.xlarge parameter is how we attach the Elastic Inference accelerator to our endpoint.

In [27]:
from sagemaker.pytorch import PyTorchModel

instance_type = 'ml.p2.large'
accelerator_type = 'ml.eia2.xlarge'

# TorchScript model
tar_filename = 'traced_bert.tar.gz'

# Returns S3 bucket URL
print('Upload tarball to S3')
model_data = sagemaker_session.upload_data(path=tar_filename, bucket=bucket, key_prefix=prefix)

endpoint_name = 'bert-ei-traced-{}-{}'.format(instance_type, accelerator_type).replace('.', '').replace('_', '')
print('endpoint name :', endpoint_name)

pytorch = PyTorchModel(
    model_data=model_data,
    role=role,
    entry_point='deploy_ei.py',
    source_dir='tools',
    framework_version='1.3.1',
    py_version='py3',
    sagemaker_session=sagemaker_session
)

Upload tarball to S3


W0721 23:26:41.529985 24952 model.py:111] Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


endpoint name : bert-ei-traced-mlp2large-mleia2xlarge


In [None]:
# Function will exit before endpoint is finished creating
predictor = pytorch.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    accelerator_type=accelerator_type,
    endpoint_name=endpoint_name,
    wait=False
)

### 7. Close the SageMaker Instance

In [None]:
print(endpoint_name)
predictor.delete_endpoint(endpoint_name)

In this post, we used Amazon SageMaker to take BERT as a starting point and train a model for labeling sentences on their grammatical completeness. We then deployed the model to an Amazon SageMaker endpoint, both with and without Elastic Inference acceleration. You can use this solution to tune BERT in other ways, or use other pretrained models provided by PyTorch-Transformers. For more about using PyTorch with Amazon SageMaker, see Using PyTorch with the SageMaker Python SDK.

Reference:
>https://aws.amazon.com/ko/blogs/machine-learning/fine-tuning-a-pytorch-bert-model-and-deploying-it-with-amazon-elastic-inference-on-amazon-sagemaker/