# Hands-on: Deploying Question Answering with BERT

Pre-trained language representations have been shown to improve many downstream NLP tasks such as question answering, and natural language inference. Devlin, Jacob, et al proposed BERT [1] (Bidirectional Encoder Representations from Transformers), which fine-tunes deep bidirectional representations on a wide range of tasks with minimal task-specific parameters, and obtained state- of-the-art results.

After finishing training QA with BERT (the previous notebook "QA_Training.ipydb"), let us load a trained model to perform inference on the SQuAD dataset

### A quick overview: an example from SQuAD dataset is like below:

    (2, 
    '56be4db0acb8001400a502ee', 
    'Where did Super Bowl 50 take place?', 

    'Super Bowl 50 was an American football game to determine the champion of the National 
    Football League (NFL) for the 2015 season. The American Football Conference (AFC) 
    champion Denver Broncos defeated the National Football Conference (NFC) champion 
    Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played 
    on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, 
    California. As this was the 50th Super Bowl, the league emphasized the "golden 
    anniversary" with various gold-themed initiatives, as well as temporarily suspending 
    the tradition of naming each Super Bowl game with Roman numerals (under which the 
    game would have been known as "Super Bowl L"), so that the logo could prominently 
    feature the Arabic numerals 50.', 

    ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium 
    in the San Francisco Bay Area at Santa Clara, California."], 

    [403, 355, 355])

## Deploy on SageMaker

1. Preparing functions for inference 
2. Saving the model parameters
3. Building a docker container with dependencies installed
4. Launching a serving end-point with SageMaker SDK

### 1. Preparing functions for inference

Two functions: 
1. ```model_fn``` to load model parameters
2. ```transform_fn(``` to run model inference given an input

In [1]:
%%writefile code/serve.py

import collections, json, logging, warnings
import multiprocessing as mp
from functools import partial

import gluonnlp as nlp
import mxnet as mx
from mxnet.gluon import Block, nn
# import bert 
from qa import preprocess_dataset, SQuADTransform
import bert_qa_evaluate



class BertForQA(Block):
    """Model for SQuAD task with BERT.
    The model feeds token ids and token type ids into BERT to get the
    pooled BERT sequence representation, then apply a Dense layer for QA task.
    Parameters
    ----------
    bert: BERTModel
        Bidirectional encoder with transformer.
    prefix : str or None
        See document of `mx.gluon.Block`.
    params : ParameterDict or None
        See document of `mx.gluon.Block`.
    """

    def __init__(self, bert, prefix=None, params=None):
        super(BertForQA, self).__init__(prefix=prefix, params=params)
        self.bert = bert
        with self.name_scope():
            self.span_classifier = nn.Dense(units=2, flatten=False)

    def forward(self, inputs, token_types, valid_length=None):  # pylint: disable=arguments-differ
        """Generate the unnormalized score for the given the input sequences.
        Parameters
        ----------
        inputs : NDArray, shape (batch_size, seq_length)
            Input words for the sequences.
        token_types : NDArray, shape (batch_size, seq_length)
            Token types for the sequences, used to indicate whether the word belongs to the
            first sentence or the second one.
        valid_length : NDArray or None, shape (batch_size,)
            Valid length of the sequence. This is used to mask the padded tokens.
        Returns
        -------
        outputs : NDArray
            Shape (batch_size, seq_length, 2)
        """
        bert_output = self.bert(inputs, token_types, valid_length)
        output = self.span_classifier(bert_output)
        return output
    
    
def get_all_results(net, vocab, squadTransform, test_dataset, ctx = mx.cpu()):
    all_results = collections.defaultdict(list)
    
    def _vocab_lookup(example_id, subwords, type_ids, length, start, end):
        indices = vocab[subwords]
        return example_id, indices, type_ids, length, start, end
    
    dev_data_transform, _ = preprocess_dataset(test_dataset, squadTransform)
    dev_data_transform = dev_data_transform.transform(_vocab_lookup, lazy=False)
    dev_dataloader = mx.gluon.data.DataLoader(dev_data_transform, batch_size=1, shuffle=False)
    
    for data in dev_dataloader:
        example_ids, inputs, token_types, valid_length, _, _ = data
        batch_size = inputs.shape[0]
        output = net(inputs.astype('float32').as_in_context(ctx),
                     token_types.astype('float32').as_in_context(ctx),
                     valid_length.astype('float32').as_in_context(ctx))
        pred_start, pred_end = mx.nd.split(output, axis=2, num_outputs=2)
        example_ids = example_ids.asnumpy().tolist()
        pred_start = pred_start.reshape(batch_size, -1).asnumpy()
        pred_end = pred_end.reshape(batch_size, -1).asnumpy()

        for example_id, start, end in zip(example_ids, pred_start, pred_end):
            all_results[example_id].append(bert_qa_evaluate.PredResult(start=start, end=end))
    return(all_results)


def _test_example_transform(test_examples):
    """
    Change test examples to a format like SQUAD data.
    Parameters
    ---------- 
    test_examples: a list of (question, context) tuple. 
        Example: [('Which NFL team represented the AFC at Super Bowl 50?',
                 'Super Bowl 50 was an American football game ......),
                  ('Where did Super Bowl 50 take place?',,
                 'Super Bowl 50 was ......),
                 ......]
    Returns
    ----------
    test_examples_tuples : a list of SQUAD tuples
    """
    test_examples_tuples = []
    i = 0
    for test in test_examples:
        question, context = test[0], test[1]  # test.split(" [CONTEXT] ")
        tup = (i, "", question, context, [], [])
        test_examples_tuples.append(tup)
        i += 1
    return(test_examples_tuples)


def model_fn(model_dir = "", params_path = "bert_qa-7eb11865.params"):
    """
    Load the gluon model. Called once when hosting service starts.
    :param: model_dir The directory where model files are stored.
    :return: a Gluon model, and the vocabulary
    """
    bert_model, vocab = nlp.model.get_model('bert_12_768_12',
                                        dataset_name='book_corpus_wiki_en_uncased',
                                        use_classifier=False,
                                        use_decoder=False,
                                        use_pooler=False,
                                        pretrained=False)
    net = BertForQA(bert_model)
    if len(model_dir) > 0:
        params_path = model_dir + "/" +params_path
    net.load_parameters(params_path, ctx=mx.cpu())
    
    tokenizer = nlp.data.BERTTokenizer(vocab,  lower=True)
    transform = SQuADTransform(tokenizer, is_pad=False, is_training=False, do_lookup=False)
    return net, vocab, transform


def transform_fn(model, input_data, input_content_type=None, output_content_type=None):
    """
    Transform a request using the Gluon model. Called once per request.
    :param model: The Gluon model and the vocab
    :param dataset: The request payload
    
        Example:
        ## (example_id, [question, content], ques_cont_token_types, valid_length, _, _)


        (2, 
        '56be4db0acb8001400a502ee', 
        'Where did Super Bowl 50 take place?', 
        
        'Super Bowl 50 was an American football game to determine the champion of the National 
        Football League (NFL) for the 2015 season. The American Football Conference (AFC) 
        champion Denver Broncos defeated the National Football Conference (NFC) champion 
        Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played 
        on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, 
        California. As this was the 50th Super Bowl, the league emphasized the "golden 
        anniversary" with various gold-themed initiatives, as well as temporarily suspending 
        the tradition of naming each Super Bowl game with Roman numerals (under which the 
        game would have been known as "Super Bowl L"), so that the logo could prominently 
        feature the Arabic numerals 50.', 
        
        ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium 
        in the San Francisco Bay Area at Santa Clara, California."], 
        
        [403, 355, 355])

    :param input_content_type: The request content type, assume json
    :param output_content_type: The (desired) response content type, assume json
    :return: response payload and content type.
    """
    net, vocab, squadTransform = model
#     data = input_data
    data = json.loads(input_data)
#     test_examples_tuples = [(i, "", question, content, [], [])]
#     question, context = data #.split(" [CONTEXT] ")
#     tup = (0, "", question, context, [], [])
    test_examples_tuples = _test_example_transform(data)
    test_dataset = mx.gluon.data.SimpleDataset(test_examples_tuples)  # [tup]
    all_results = get_all_results(net, vocab, squadTransform, test_dataset, ctx=mx.cpu())
    all_predictions = collections.defaultdict(list) # collections.OrderedDict()
    data_transform = test_dataset.transform(squadTransform._transform)
    for features in data_transform:
        f_id = features[0].example_id
        results = all_results[f_id]
        prediction, nbest = bert_qa_evaluate.predict(
            features=features,
            results=results,
            tokenizer=nlp.data.BERTBasicTokenizer(vocab))        
        nbest_prediction = [] 
        for i in range(3):
            nbest_prediction.append('%.2f%% \t %s'%(nbest[i][1] * 100, nbest[i][0]))
        all_predictions[f_id] = nbest_prediction
    response_body = json.dumps(all_predictions)
    return response_body, output_content_type

Overwriting code/serve.py


### 2. Saving the model parameters

We are going to zip the BERT model parameters, vocabulary file, and all the inference files (```code/serve.py```, ```bert/data/qa.py```, ```bert_qa_evaluate.py```) to a ```model.tar.gz``` file. (Note that the ```serve.py``` is the "entry_point" for Sagemaker to do the inference, and it needs to be under ```code/``` directory.)

In [2]:
import tarfile
with tarfile.open("model.tar.gz", "w:gz") as tar:
#     tar.add("code/serve.py")
#     tar.add("bert/data/qa.py")
#     tar.add("bert_qa_evaluate.py")
#     tar.add("bert_qa-7eb11865.params")
#     tar.add("vocab.json")
    tar.add("net.params")

### 3. Building a docker container with dependencies installed

Let's prepare a docker container with all the dependencies required for model inference. Here we build a docker container based on the SageMaker MXNet inference container, and you can find the list of all available inference containers at https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html

Here we use local mode for demonstration purpose. To deploy on actual instances, you need to login into AWS elastic container registry (ECR) service, and push the container to ECR. 

```
docker build -t $YOUR_EDR_DOCKER_TAG . -f Dockerfile
$(aws ecr get-login --no-include-email --region $YOUR_REGION)
docker push $YOUR_EDR_DOCKER_TAG
```

In [8]:
%%writefile Dockerfile

FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu16.04

LABEL maintainer="Amazon AI"

# Specify accept-bind-to-port LABEL for inference pipelines to use SAGEMAKER_BIND_TO_PORT
# https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipeline-real-time.html
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

ARG MMS_VERSION=1.0.8
ARG MX_URL=https://aws-mxnet-pypi.s3-us-west-2.amazonaws.com/1.6.0/aws_mxnet_cu101mkl-1.6.0-py2.py3-none-manylinux1_x86_64.whl
ARG PYTHON=python3
ARG PYTHON_PIP=python3-pip
ARG PIP=pip3
ARG PYTHON_VERSION=3.6.8

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib" \
    PYTHONIOENCODING=UTF-8 \
    LANG=C.UTF-8 \
    LC_ALL=C.UTF-8 \
    TEMP=/home/model-server/tmp \
    CLOUD_PATH="/opt/ml/code"

RUN apt-get update \
 && apt-get -y install --no-install-recommends \
    build-essential \
    ca-certificates \
    curl \
    git \
    libopencv-dev \
    openjdk-8-jdk-headless \
    vim \
    wget \
    zlib1g-dev \
 && apt-get clean \
 && rm -rf /var/lib/apt/lists/*

RUN wget https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tgz \
 && tar -xvf Python-$PYTHON_VERSION.tgz \
 && cd Python-$PYTHON_VERSION \
 && ./configure \
 && make \
 && make install \
 && apt-get update \
 && apt-get install -y --no-install-recommends \
    libreadline-gplv2-dev \
    libncursesw5-dev \
    libssl-dev \
    libsqlite3-dev \
    tk-dev \
    libgdbm-dev \
    libc6-dev \
    libbz2-dev \
 && make \
 && make install \
 && rm -rf ../Python-$PYTHON_VERSION* \
 && ln -s /usr/local/bin/pip3 /usr/bin/pip

RUN ln -s $(which ${PYTHON}) /usr/local/bin/python

RUN ${PIP} --no-cache-dir install --upgrade \
    pip \
    setuptools

WORKDIR /

RUN ${PIP} install --no-cache-dir \
    ${MX_URL} \
    git+git://github.com/dmlc/gluon-nlp.git@v0.9.0 \
#     gluoncv==0.6.0 \
    mxnet-model-server==$MMS_VERSION \
    keras-mxnet==2.2.4.1 \
    numpy==1.17.4 \
    onnx==1.4.1 \
    "sagemaker-mxnet-inference<2"


RUN useradd -m model-server \
 && mkdir -p /home/model-server/tmp \
 && chown -R model-server /home/model-server

COPY mms-entrypoint.py /usr/local/bin/dockerd-entrypoint.py
COPY config.properties /home/model-server
COPY code/serve.py $CLOUD_PATH/serve.py
COPY bert_qa_evaluate.py $CLOUD_PATH/bert_qa_evaluate.py
COPY qa.py $CLOUD_PATH/qa.py
RUN chmod +x /usr/local/bin/dockerd-entrypoint.py
RUN curl https://aws-dlc-licenses.s3.amazonaws.com/aws-mxnet-1.6.0/license.txt -o /license.txt


EXPOSE 8080 8081
ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]
CMD ["mxnet-model-server", "--start", "--mms-config", "/home/model-server/config.properties"]


Overwriting Dockerfile


In [9]:
%%writefile build.sh

#!/usr/bin/env bash

# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.

# The arguments to this script are the image name and application name
image=$1
app=$2

chmod +x $app/train
chmod +x $app/serve

# Get the account number associated with the current IAM credentials
account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration
region=$(aws configure get region)
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${image}:latest"


# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${image}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${image}" > /dev/null
fi


# Edit ECR policy permission rights
aws ecr set-repository-policy --repository-name "${image}" --policy-text ecr_policy.json

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.
docker build  -t ${image} --build-arg APP=$app .
docker tag ${image} ${fullname}

Overwriting build.sh


Set `image_name` as "kdd2020nlp", and application name as "question_answering"

In [12]:
!bash build.sh kdd2020nlp question_answering


An error occurred (InvalidParameterException) when calling the SetRepositoryPolicy operation: Invalid parameter at 'PolicyText' failed to satisfy constraint: 'Invalid repository policy provided'
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Sending build context to Docker daemon   1.28GB
Step 1/27 : FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu16.04
 ---> e11e11484e2e
Step 2/27 : LABEL maintainer="Amazon AI"
 ---> Using cache
 ---> 565649be145a
Step 3/27 : LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
 ---> Using cache
 ---> bba5a1546099
Step 4/27 : ARG MMS_VERSION=1.0.8
 ---> Using cache
 ---> 2034f1d30c2e
Step 5/27 : ARG MX_URL=https://aws-mxnet-pypi.s3-us-west-2.amazonaws.com/1.6.0/aws_mxnet_cu101mkl-1.6.0-py2.py3-none-manylinux1_x86_64.whl
 ---> Using cache
 ---> 08afaf84a69c
Step 6/27 : ARG PYTHON=python3
 ---> Using cache
 ---> f0f7453df285
Step 7/27 : ARG PYTHON_PIP=python3-pip
 ---> Using cache
 ---> 4af5eabe

### 4. Launching a serving end-point with SageMaker SDK

We create a MXNet model which can be deployed later, by specifying the docker image, and entry point for the inference code. If ```serve.py``` does not work, use ```dummy_hosting_module.py``` for debugging purpose. 

#### Creating the Session

The session remembers our connection parameters to Amazon SageMaker. We'll use it to perform all of our SageMaker operations.

In [14]:
import sagemaker as sage

sess = sage.Session()

#### Defining the account, region and ECR address


In [15]:

account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image_name = "kdd2020nlp"
ecr_image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, image_name)

#### Uploading model

We can upload the trained model to the corresponding S3 bucket: https://s3.console.aws.amazon.com/s3/buckets/sagemaker-us-east-1-383827541835/sagemaker-deploy-gluoncv/data/?region=us-east-1

In [17]:
sess.default_bucket()
sess.update_endpoint

'sagemaker-us-east-1-383827541835'

In [None]:
s3_bucket_name = "kdd2020"
model_path = "s3://{}/{}/model".format(sess.default_bucket(), s3_bucket_name)
os.path.join(model_path, "model.tar.gz")
model_prefix = s3_bucket + "/model"
train_data_local = "./data/minc-2500/train"
train_data_dir_prefix = s3_bucket + "/data/train"


# model_local_path = "model_output"
train_data_upload = sess.upload_data(path=train_data_local, 
#                                 bucket=s3_bucket, 
                                key_prefix=train_data_dir_prefix)

In [None]:
import sagemaker
from sagemaker.mxnet.model import MXNetModel
sagemaker_model = MXNetModel(model_data='file:///home/ec2-user/SageMaker/ako2020-bert/tutorial/model.tar.gz',
                             image=ecr_image,
                             role=sagemaker.get_execution_role(), 
                             py_version='py3',            # python version
                             entry_point='serve.py',
                             source_dir='.')

We use 'local' mode to test our deployment code, where the inference happens on the current instance.
If you are ready to deploy the model on a new instance, change the `instance_type` argument to values such as `ml.c4.xlarge`.

Here we use 'local' mode for testing, for real instances use c5.2xlarge, p2.xlarge, etc. **The following line will start docker container building.**

In [None]:
predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type='local')

Now let us try to submit a inference job. Here we simply grab two datapoints from the SQuAD dataset and pass the examples to our predictor by calling ```predictor.predict```

In [None]:
## test
my_test_example_0 = ('Which NFL team represented the AFC at Super Bowl 50?',
 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.')

my_test_example_1 = ('Where did Super Bowl 50 take place?',
 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.')

my_test_examples = (my_test_example_0, my_test_example_1)

# mymodel = model_fn(params_path = "bert_qa-7eb11865.params")
# transform_fn(mymodel, my_test_examples)
output = predictor.predict(my_test_examples)  

In [None]:
print("\nPrediction output: \n\n")

for k in output.keys():
    print('{}\n\n'.format(output[k]))

### Clean Up

Remove the endpoint after we are done. 

In [None]:
predictor.delete_endpoint()