# Hands-on: Training and deploying Question Answering with BERT

Pre-trained language representations have been shown to improve many downstream NLP tasks such as question answering, and natural language inference. Devlin, Jacob, et al proposed BERT [1] (Bidirectional Encoder Representations from Transformers), which fine-tunes deep bidirectional representations on a wide range of tasks with minimal task-specific parameters, and obtained state- of-the-art results.

In this tutorial, we will focus on adapting the BERT model for the question answering task on the SQuAD dataset. Specifically, we will:

- understanding how to pre-process the SQuAD dataset to leverage the learnt representation in BERT,
- adapting the BERT model to the question answering task, and
- loading a trained model to perform inference on the SQuAD dataset

## Sagemaker configuration

This notebook requires mxnet-cu101 >= 1.6.0b20191102, gluonnlp >= 0.8.1
We can create a sagemaker notebook instance with the lifecycle configuration file: sagemaker-lifecycle.config

In [1]:
# One time script
# !bash sagemaker-lifecycle.config

In [1]:
!pip list | grep mxnet
!pip list | grep gluonnlp

aws-mxnet-cu101mkl                 1.6.0              
keras-mxnet                        2.2.4.2            
mxnet-cu101                        1.6.0              
mxnet-model-server                 1.0.8              
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/mxnet_p36/bin/python -m pip install --upgrade pip' command.[0m
gluonnlp                           0.9.2              
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/mxnet_p36/bin/python -m pip install --upgrade pip' command.[0m


## Loading MXNet and GluonNLP

We first import the libraries:

In [2]:
import argparse
import collections
import copy
import io
import json
import logging
import os
import random
import time
import warnings

import numpy as np
import gluonnlp as nlp
import mxnet as mx

from gluonnlp.data import SQuAD
from bert.data.qa import SQuADTransform, preprocess_dataset
import bert_qa_evaluate


## Inspecting the SQuAD Dataset

Then we take a look at the Stanford Question Answering Dataset (SQuAD). The dataset can be downloaded using the `nlp.data.SQuAD` API. In this tutorial, we create a small dataset with 3 samples from the SQuAD dataset for demonstration purpose.

The question answering task on the SQuAD dataset is setup the following way. For each sample in the dataset, a context is provided. The context is usually a long paragraph which contains lots of information. Then a question asked based on the context. The goal is to find the text span in the context that answers the question in the sample.

In [3]:
full_data = nlp.data.SQuAD(segment='dev', version='1.1')
# loading a subset of the dev set of SQuAD
num_target_samples = 3
target_samples = [full_data[i] for i in range(num_target_samples)]
dataset = mx.gluon.data.SimpleDataset(target_samples)
print('Number of samples in the created dataset subsampled from SQuAD = %d'%len(dataset))

Downloading /home/ec2-user/SageMaker/datasets/squad/tmpbagvu_rv/dev-v1.1.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/squad/dev-v1.1.zip...
Number of samples in the created dataset subsampled from SQuAD = 3


In [4]:
target_samples[0]

(0,
 '56be4db0acb8001400a502ec',
 'Which NFL team represented the AFC at Super Bowl 50?',
 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],
 [177, 177, 177])

As we can see, the above example is the structure of the SQuAD dataset. Here, the question index is 2, the context index is 3, the answer index is 4, and the answer position index is 5.

In [5]:
question_idx = 2
context_idx = 3
answer_idx = 4
answer_pos_idx = 5

Let's take a look at a sample from the dataset. In this sample, the question is about the location of the game, with a description about the Super Bowl 50 game as the context. Note that three different answer spans are correct for this question, and they start from index 403, 355 and 355 in the context respectively.

In [6]:
sample = dataset[2]
print('\nContext:\n')
print(sample[context_idx])
print("\nQuestion")
print(sample[question_idx])
print("\nCorrect Answer Spans")
print(sample[answer_idx])
print("\nAnswer Span Start Indices:")
print(sample[answer_pos_idx])


Context:

Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.

Question
Where did Super Bowl 50 take place?

Correct Answer Spans
['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."]

Answer Span Start Indi

## Data Pre-processing for QA with BERT

Recall that during BERT pre-training, it takes a sentence pair as the input, separated by the 'SEP' special token. For SQuAD, we can feed the context-question pair as the sentence pair input. To use BERT to predict the starting and ending span of the answer, we can add a classification layer for each token in the context texts, to predict if a token is the start or the end of the answer span. 

![qa](resources/qa.png)

In the next few code blocks, we will work on pre-processing the samples in the SQuAD dataset in the desired format with these special separators. 


### Getting Pre-trained BERT Model

First, let's use the *get_model* API in GluonNLP to get the model definition for BERT, and the vocabulary used for the BERT model. Note that we discard the pooler and classifier layers used for the next sentence prediction task, as well as the decoder layers for the masked language model task during the BERT pre-training phase. These layers are not useful for predicting the starting and ending indices of the answer span.

The list of pre-trained BERT models available in GluonNLP can be found [here](http://gluon-nlp.mxnet.io/model_zoo/bert/index.html).

In [7]:
bert_model, vocab = nlp.model.get_model('bert_12_768_12',
                                        dataset_name='book_corpus_wiki_en_uncased',
                                        use_classifier=False,
                                        use_decoder=False,
                                        use_pooler=False,
                                        pretrained=False)
with open('vocab.json', 'w') as f:
    f.write(vocab.to_json())

Vocab file is not found. Downloading.
Downloading /home/ec2-user/SageMaker/models/7148235391387426985/7148235391387426985_book_corpus_wiki_en_uncased-a6607397.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/book_corpus_wiki_en_uncased-a6607397.zip...


Note that there are several special tokens in the vocabulary for BERT. In particular, the `[SEP]` token is used for separating the sentence pairs, and the `[CLS]` token is added at the beginning of the sentence pairs. They will be used to pre-process the SQuAD dataset later.

In [8]:
print(vocab)

Vocab(size=30522, unk="[UNK]", reserved="['[CLS]', '[SEP]', '[MASK]', '[PAD]']")


### Tokenization

The second step is to process the samples using the same tokenizer used for BERT, which is provided as the `BERTTokenizer` API in GluonNLP. Note that instead of word level and character level representation, BERT uses subwords to represent a word, separated `##`. 

In the following example, the word `suspending` is tokenized as two subwords (`suspend` and `##ing`), and `numerals` is tokenized as three subwords (`nu`, `##meral`, `##s`).

In [9]:
tokenizer = nlp.data.BERTTokenizer(vocab=vocab, lower=True)

tokenizer("as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals")

['as',
 'well',
 'as',
 'temporarily',
 'suspend',
 '##ing',
 'the',
 'tradition',
 'of',
 'naming',
 'each',
 'super',
 'bowl',
 'game',
 'with',
 'roman',
 'nu',
 '##meral',
 '##s']

### Sentence Pair Composition

With the tokenizer inplace, we are ready to process the question-context texts and compose sentence pairs. The functionality is available via the `SQuADTransform` API. 

In [10]:
transform = SQuADTransform(tokenizer, is_pad=False, is_training=False, do_lookup=False)
dev_data_transform, _ = preprocess_dataset(dataset, transform)
logging.info('The number of examples after preprocessing:{}'.format(len(dev_data_transform)))

Done! Transform dataset costs 0.10 seconds.


Let's take a look at the sample after the transformation:

In [11]:
sample = dev_data_transform[2]
print('\nsegment type: \n' + str(sample[2]))
print('\ntext length: ' + str(sample[3]))
print('\nsentence pair: \n' + str(sample[1]))


segment type: 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

text length: 168

sentence pair: 
['[CLS]', 'where', 'did', 'super', 'bowl', '50', 'take', 'place', '?', '[SEP]', 'super', 'bowl', '50', 'was', 'an', 'american', 'football', 'game', 'to', 'determine', 'the', 'champion', 'of', 'the', 'national', 'football', 'league', '(', 'nfl', ')', 'for', 'the', '2015', 'season', '.', 'the', 'american', 'football', 'conference', '(', 'afc', ')', 'champion', 'denver', 'broncos', 'defeated', 'the', 'national', 'football', 'conference', '(', 

### Vocabulary Lookup

Finally, we convert the transformed texts to subword indices, which are used to contructor NDArrays as the inputs to the model.

In [12]:
def vocab_lookup(example_id, subwords, type_ids, length, start, end):
    indices = vocab[subwords]
    return example_id, indices, type_ids, length, start, end

dev_data_transform = dev_data_transform.transform(vocab_lookup, lazy=False)
print(dev_data_transform[2][1])

[2, 2073, 2106, 3565, 4605, 2753, 2202, 2173, 1029, 3, 3565, 4605, 2753, 2001, 2019, 2137, 2374, 2208, 2000, 5646, 1996, 3410, 1997, 1996, 2120, 2374, 2223, 1006, 5088, 1007, 2005, 1996, 2325, 2161, 1012, 1996, 2137, 2374, 3034, 1006, 10511, 1007, 3410, 7573, 14169, 3249, 1996, 2120, 2374, 3034, 1006, 22309, 1007, 3410, 3792, 12915, 2484, 1516, 2184, 2000, 7796, 2037, 2353, 3565, 4605, 2516, 1012, 1996, 2208, 2001, 2209, 2006, 2337, 1021, 1010, 2355, 1010, 2012, 11902, 1005, 1055, 3346, 1999, 1996, 2624, 3799, 3016, 2181, 2012, 4203, 10254, 1010, 2662, 1012, 2004, 2023, 2001, 1996, 12951, 3565, 4605, 1010, 1996, 2223, 13155, 1996, 1000, 3585, 5315, 1000, 2007, 2536, 2751, 1011, 11773, 11107, 1010, 2004, 2092, 2004, 8184, 28324, 2075, 1996, 4535, 1997, 10324, 2169, 3565, 4605, 2208, 2007, 3142, 16371, 28990, 2015, 1006, 2104, 2029, 1996, 2208, 2052, 2031, 2042, 2124, 2004, 1000, 3565, 4605, 1048, 1000, 1007, 1010, 2061, 2008, 1996, 8154, 2071, 14500, 3444, 1996, 5640, 16371, 28990, 2015

## Model Definition

After the data is processed, we can define the model that uses the representation produced by BERT for predicting the starting and ending positions of the answer span. 

We download a BERT model trained on the SQuAD dataset, prepare the dataloader.

In [13]:
net = bert_qa_evaluate.BertForQA(bert_model)
ctx = mx.gpu(0)
ckpt = bert_qa_evaluate.download_qa_ckpt()
net.load_parameters(ckpt, ctx=ctx)

batch_size = 1
dev_dataloader = mx.gluon.data.DataLoader(
    dev_data_transform, batch_size=batch_size, shuffle=False)

Downloading ./bert_qa-7eb11865.zip4bd0e039-7638-47ed-b4f0-a76918fcb001 from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/bert_qa-7eb11865.zip...
Downloaded checkpoint to ./bert_qa-7eb11865.params


In [14]:
all_results = collections.defaultdict(list)

total_num = 0
for data in dev_dataloader:
    example_ids, inputs, token_types, valid_length, _, _ = data
    total_num += len(inputs)
    batch_size = inputs.shape[0]
    output = net(inputs.astype('float32').as_in_context(ctx),
                               token_types.astype('float32').as_in_context(ctx),
                               valid_length.astype('float32').as_in_context(ctx))
    pred_start, pred_end = mx.nd.split(output, axis=2, num_outputs=2)
    example_ids = example_ids.asnumpy().tolist()
    pred_start = pred_start.reshape(batch_size, -1).asnumpy()
    pred_end = pred_end.reshape(batch_size, -1).asnumpy()
    
    for example_id, start, end in zip(example_ids, pred_start, pred_end):
        all_results[example_id].append(bert_qa_evaluate.PredResult(start=start, end=end))

In [15]:
bert_qa_evaluate.simple_predict(dataset, all_results, vocab)


Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.

Question: which nfl team represented the afc at super bowl 50 ?

Top predictions: 
99.51% 	 Denver Broncos
0.18% 	 The American Football Conference (AFC) champion Denver Broncos
0.11% 	 Broncos


Context: Super Bo

### Model Training

Now we can put all the pieces together, and start fine-tuning the model with a few epochs.

The full training script is provided in ```finetune_squad.py```, with 20+ hyperparameters to be setted up (such as batch_size, debug, epochs=1, gpu, log_interval, lr, etc.). Let us firstly take a look of this training script before running the next line.

In [18]:
import time
start = time.time()
!python finetune_squad.py --epochs 1 --batch_size 4 --bert_model 'bert_12_768_12' --gpu 0

INFO:gluonnlp:05:43:27 Namespace(accumulate=None, batch_size=4, bert_dataset='book_corpus_wiki_en_uncased', bert_model='bert_12_768_12', debug=False, doc_stride=128, epochs=1, gpu=0, log_interval=50, lr=5e-05, max_answer_length=30, max_query_length=64, max_seq_length=384, model_parameters=None, n_best_size=20, null_score_diff_threshold=0.0, only_predict=False, optimizer='bertadam', output_dir='./output_dir', pretrained_bert_parameters=None, sentencepiece=None, test_batch_size=24, uncased=True, version_2=False, warmup_ratio=0.1)
INFO:gluonnlp:05:43:31 Loading train data...
INFO:gluonnlp:05:43:32 Number of records in Train data:87599
Done! Transform dataset costs 49.24 seconds.
INFO:gluonnlp:05:44:22 The number of examples after preprocessing:88641
INFO:gluonnlp:05:44:22 Start Training
INFO:gluonnlp:05:44:28 Epoch: 0, Batch: 49/22161, Loss=5.8097, lr=0.0000011 Time cost=6.5 Thoughput=30.88 samples/s
INFO:gluonnlp:05:44:34 Epoch: 0, Batch: 99/22161, Loss=5.1750, lr=0.0000023 Time cost=5.8

INFO:gluonnlp:05:50:35 Epoch: 0, Batch: 3199/22161, Loss=1.5063, lr=0.0000475 Time cost=5.8 Thoughput=34.21 samples/s
INFO:gluonnlp:05:50:41 Epoch: 0, Batch: 3249/22161, Loss=1.6394, lr=0.0000474 Time cost=5.9 Thoughput=33.85 samples/s
INFO:gluonnlp:05:50:47 Epoch: 0, Batch: 3299/22161, Loss=1.4939, lr=0.0000473 Time cost=5.9 Thoughput=33.66 samples/s
INFO:gluonnlp:05:50:53 Epoch: 0, Batch: 3349/22161, Loss=1.5644, lr=0.0000472 Time cost=5.8 Thoughput=34.57 samples/s
INFO:gluonnlp:05:50:58 Epoch: 0, Batch: 3399/22161, Loss=1.4531, lr=0.0000470 Time cost=5.9 Thoughput=34.16 samples/s
INFO:gluonnlp:05:51:04 Epoch: 0, Batch: 3449/22161, Loss=1.7608, lr=0.0000469 Time cost=5.8 Thoughput=34.67 samples/s
INFO:gluonnlp:05:51:10 Epoch: 0, Batch: 3499/22161, Loss=1.7769, lr=0.0000468 Time cost=5.8 Thoughput=34.33 samples/s
INFO:gluonnlp:05:51:16 Epoch: 0, Batch: 3549/22161, Loss=1.5787, lr=0.0000467 Time cost=5.8 Thoughput=34.52 samples/s
INFO:gluonnlp:05:51:22 Epoch: 0, Batch: 3599/22161, Loss

INFO:gluonnlp:05:57:23 Epoch: 0, Batch: 6699/22161, Loss=1.2483, lr=0.0000388 Time cost=5.8 Thoughput=34.45 samples/s
INFO:gluonnlp:05:57:29 Epoch: 0, Batch: 6749/22161, Loss=1.2119, lr=0.0000386 Time cost=5.8 Thoughput=34.60 samples/s
INFO:gluonnlp:05:57:35 Epoch: 0, Batch: 6799/22161, Loss=1.3646, lr=0.0000385 Time cost=5.9 Thoughput=34.06 samples/s
INFO:gluonnlp:05:57:41 Epoch: 0, Batch: 6849/22161, Loss=1.5034, lr=0.0000384 Time cost=5.7 Thoughput=34.84 samples/s
INFO:gluonnlp:05:57:46 Epoch: 0, Batch: 6899/22161, Loss=1.3373, lr=0.0000383 Time cost=5.8 Thoughput=34.29 samples/s
INFO:gluonnlp:05:57:52 Epoch: 0, Batch: 6949/22161, Loss=1.3457, lr=0.0000381 Time cost=5.9 Thoughput=34.15 samples/s
INFO:gluonnlp:05:57:58 Epoch: 0, Batch: 6999/22161, Loss=1.4209, lr=0.0000380 Time cost=5.8 Thoughput=34.21 samples/s
INFO:gluonnlp:05:58:04 Epoch: 0, Batch: 7049/22161, Loss=1.3403, lr=0.0000379 Time cost=5.8 Thoughput=34.38 samples/s
INFO:gluonnlp:05:58:10 Epoch: 0, Batch: 7099/22161, Loss

INFO:gluonnlp:06:04:12 Epoch: 0, Batch: 10199/22161, Loss=1.1220, lr=0.0000300 Time cost=5.8 Thoughput=34.64 samples/s
INFO:gluonnlp:06:04:18 Epoch: 0, Batch: 10249/22161, Loss=1.2764, lr=0.0000299 Time cost=5.8 Thoughput=34.33 samples/s
INFO:gluonnlp:06:04:24 Epoch: 0, Batch: 10299/22161, Loss=1.3541, lr=0.0000297 Time cost=5.9 Thoughput=34.06 samples/s
INFO:gluonnlp:06:04:30 Epoch: 0, Batch: 10349/22161, Loss=1.3388, lr=0.0000296 Time cost=5.8 Thoughput=34.55 samples/s
INFO:gluonnlp:06:04:36 Epoch: 0, Batch: 10399/22161, Loss=1.1667, lr=0.0000295 Time cost=5.9 Thoughput=33.80 samples/s
INFO:gluonnlp:06:04:41 Epoch: 0, Batch: 10449/22161, Loss=1.3826, lr=0.0000294 Time cost=5.8 Thoughput=34.59 samples/s
INFO:gluonnlp:06:04:47 Epoch: 0, Batch: 10499/22161, Loss=1.3370, lr=0.0000292 Time cost=5.9 Thoughput=34.13 samples/s
INFO:gluonnlp:06:04:53 Epoch: 0, Batch: 10549/22161, Loss=1.1427, lr=0.0000291 Time cost=5.9 Thoughput=33.83 samples/s
INFO:gluonnlp:06:04:59 Epoch: 0, Batch: 10599/22

INFO:gluonnlp:06:10:55 Epoch: 0, Batch: 13649/22161, Loss=1.2725, lr=0.0000213 Time cost=5.9 Thoughput=34.03 samples/s
INFO:gluonnlp:06:11:01 Epoch: 0, Batch: 13699/22161, Loss=1.2749, lr=0.0000212 Time cost=5.8 Thoughput=34.47 samples/s
INFO:gluonnlp:06:11:07 Epoch: 0, Batch: 13749/22161, Loss=1.0370, lr=0.0000211 Time cost=5.8 Thoughput=34.46 samples/s
INFO:gluonnlp:06:11:13 Epoch: 0, Batch: 13799/22161, Loss=1.1329, lr=0.0000210 Time cost=5.8 Thoughput=34.33 samples/s
INFO:gluonnlp:06:11:19 Epoch: 0, Batch: 13849/22161, Loss=1.2716, lr=0.0000208 Time cost=5.8 Thoughput=34.54 samples/s
INFO:gluonnlp:06:11:24 Epoch: 0, Batch: 13899/22161, Loss=1.2175, lr=0.0000207 Time cost=5.8 Thoughput=34.41 samples/s
INFO:gluonnlp:06:11:30 Epoch: 0, Batch: 13949/22161, Loss=1.2390, lr=0.0000206 Time cost=5.9 Thoughput=34.10 samples/s
INFO:gluonnlp:06:11:36 Epoch: 0, Batch: 13999/22161, Loss=0.9471, lr=0.0000205 Time cost=5.8 Thoughput=34.57 samples/s
INFO:gluonnlp:06:11:42 Epoch: 0, Batch: 14049/22

INFO:gluonnlp:06:17:38 Epoch: 0, Batch: 17099/22161, Loss=1.0581, lr=0.0000127 Time cost=5.9 Thoughput=34.02 samples/s
INFO:gluonnlp:06:17:44 Epoch: 0, Batch: 17149/22161, Loss=0.9211, lr=0.0000126 Time cost=5.9 Thoughput=34.05 samples/s
INFO:gluonnlp:06:17:50 Epoch: 0, Batch: 17199/22161, Loss=1.1125, lr=0.0000124 Time cost=6.0 Thoughput=33.50 samples/s
INFO:gluonnlp:06:17:56 Epoch: 0, Batch: 17249/22161, Loss=1.0050, lr=0.0000123 Time cost=5.9 Thoughput=33.63 samples/s
INFO:gluonnlp:06:18:02 Epoch: 0, Batch: 17299/22161, Loss=1.0041, lr=0.0000122 Time cost=5.8 Thoughput=34.26 samples/s
INFO:gluonnlp:06:18:08 Epoch: 0, Batch: 17349/22161, Loss=1.0046, lr=0.0000121 Time cost=5.8 Thoughput=34.43 samples/s
INFO:gluonnlp:06:18:14 Epoch: 0, Batch: 17399/22161, Loss=0.9258, lr=0.0000119 Time cost=5.8 Thoughput=34.32 samples/s
INFO:gluonnlp:06:18:19 Epoch: 0, Batch: 17449/22161, Loss=1.1690, lr=0.0000118 Time cost=5.8 Thoughput=34.35 samples/s
INFO:gluonnlp:06:18:25 Epoch: 0, Batch: 17499/22

INFO:gluonnlp:06:24:22 Epoch: 0, Batch: 20549/22161, Loss=1.2698, lr=0.0000040 Time cost=5.9 Thoughput=34.00 samples/s
INFO:gluonnlp:06:24:28 Epoch: 0, Batch: 20599/22161, Loss=0.8955, lr=0.0000039 Time cost=5.9 Thoughput=34.14 samples/s
INFO:gluonnlp:06:24:34 Epoch: 0, Batch: 20649/22161, Loss=0.9238, lr=0.0000038 Time cost=5.8 Thoughput=34.19 samples/s
INFO:gluonnlp:06:24:39 Epoch: 0, Batch: 20699/22161, Loss=0.9906, lr=0.0000037 Time cost=5.8 Thoughput=34.46 samples/s
INFO:gluonnlp:06:24:45 Epoch: 0, Batch: 20749/22161, Loss=1.0895, lr=0.0000035 Time cost=5.8 Thoughput=34.78 samples/s
INFO:gluonnlp:06:24:51 Epoch: 0, Batch: 20799/22161, Loss=1.0938, lr=0.0000034 Time cost=5.8 Thoughput=34.60 samples/s
INFO:gluonnlp:06:24:57 Epoch: 0, Batch: 20849/22161, Loss=1.0016, lr=0.0000033 Time cost=5.9 Thoughput=34.05 samples/s
INFO:gluonnlp:06:25:03 Epoch: 0, Batch: 20899/22161, Loss=1.0262, lr=0.0000032 Time cost=5.8 Thoughput=34.28 samples/s
INFO:gluonnlp:06:25:09 Epoch: 0, Batch: 20949/22

In [19]:
end = time.time()
print(end - start)

2770.2639062404633


### Further Resources

- Dive into Deep Learning http://d2l.ai/

- GluonNLP http://gluon-nlp.mxnet.io/
- GluonCV http://gluon-cv.mxnet.io/
- GluonTS https://gluon-ts.mxnet.io/
- Deep Graph Libray https://www.dgl.ai/
- MXNet Forum https://discuss.mxnet.io/

- Amazon SageMaker https://aws.amazon.com/sagemaker/
- Amazon SageMaker Python SDK https://sagemaker.readthedocs.io/
- Amazon SageMaker Developer Guide https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf


### References

[1] Devlin, Jacob, et al. "Bert:
Pre-training of deep
bidirectional transformers for language understanding."
arXiv preprint
arXiv:1810.04805 (2018).

[2] Peters,
Matthew E., et al. "Deep contextualized word representations." arXiv
preprint
arXiv:1802.05365 (2018).

[3] Hendrycks, Dan, and Kevin Gimpel. "Gaussian error linear units (gelus)." arXiv preprint arXiv:1606.08415 (2016).