# Hands-on: Training and deploying Question Answering with BERT

Pre-trained language representations have been shown to improve many downstream NLP tasks such as question answering, and natural language inference. Devlin, Jacob, et al proposed BERT [1] (Bidirectional Encoder Representations from Transformers), which fine-tunes deep bidirectional representations on a wide range of tasks with minimal task-specific parameters, and obtained state- of-the-art results.

In this tutorial, we will focus on adapting the BERT model for the question answering task on the SQuAD dataset. Specifically, we will:

- understand how to pre-process the SQuAD dataset to leverage the learnt representation in BERT,
- adapt the BERT model to the question answering task, and
- load a trained model to perform inference on the SQuAD dataset

In [1]:
# this notebook requires mxnet-cu101 >= 1.6.0b20191102, gluonnlp >= 0.8.1
# we can create a sagemaker notebook instance with the lifecycle configuration file: sagemaker-lifecycle.config
!pip list | grep mxnet
!pip list | grep gluonnlp

keras-mxnet                        2.2.4.2       
mxnet-cu101                        1.6.0b20191122
mxnet-model-server                 1.0.5         
[33mYou are using pip version 10.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
gluonnlp                           0.9.0.dev0    
[33mYou are using pip version 10.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## Load MXNet and GluonNLP

We first import the libraries:

In [2]:
import argparse, collections, time, logging
import json
import os
import io
import copy
import random
import warnings

import numpy as np
import gluonnlp as nlp
import mxnet as mx
import bert
import qa_utils

from gluonnlp.data import SQuAD
from bert.model.qa import BertForQALoss, BertForQA
from bert.data.qa import SQuADTransform, preprocess_dataset
from bert.bert_qa_evaluate import get_F1_EM, predict, PredResult

# Hyperparameters
parser = argparse.ArgumentParser('BERT finetuning')
parser.add_argument('--epochs', type=int, default=3)
parser.add_argument('--batch_size', default=32)
parser.add_argument('--num_epochs', default=1)
parser.add_argument('--lr', default=5e-5)

parser.add_argument('--output_dir',
                    type=str,
                    default='./output_dir',
                    help='The output directory where the model params will be written.'
                    ' default is ./output_dir')
parser.add_argument('--test_batch_size',
                    type=int,
                    default=24,
                    help='Test batch size. default is 24')
parser.add_argument('--optimizer',
                    type=str,
                    default='bertadam',
                    help='optimization algorithm. default is bertadam')
parser.add_argument('--accumulate',
                    type=int,
                    default=None,
                    help='The number of batches for '
                    'gradients accumulation to simulate large batch size. Default is None')
parser.add_argument('--warmup_ratio',
                    type=float,
                    default=0.1,
                    help='ratio of warmup steps that linearly increase learning rate from '
                    '0 to target learning rate. default is 0.1')
parser.add_argument('--log_interval',
                    type=int,
                    default=50,
                    help='report interval. default is 50')
parser.add_argument('--max_seq_length',
                    type=int,
                    default=384,
                    help='The maximum total input sequence length after WordPiece tokenization.'
                    'Sequences longer than this will be truncated, and sequences shorter '
                    'than this will be padded. default is 384')
parser.add_argument('--doc_stride',
                    type=int,
                    default=128,
                    help='When splitting up a long document into chunks, how much stride to '
                    'take between chunks. default is 128')
parser.add_argument('--max_query_length',
                    type=int,
                    default=64,
                    help='The maximum number of tokens for the question. Questions longer than '
                    'this will be truncated to this length. default is 64')
parser.add_argument('--n_best_size',
                    type=int,
                    default=20,
                    help='The total number of n-best predictions to generate in the '
                    'nbest_predictions.json output file. default is 20')
parser.add_argument('--max_answer_length',
                    type=int,
                    default=30,
                    help='The maximum length of an answer that can be generated. This is needed '
                    'because the start and end predictions are not conditioned on one another.'
                    ' default is 30')
# parser.add_argument('--version_2',
#                     action='store_true',
#                     help='SQuAD examples whether contain some that do not have an answer.')
parser.add_argument('--null_score_diff_threshold',
                    type=float,
                    default=0.0,
                    help='If null_score - best_non_null is greater than the threshold predict null.'
                    'Typical values are between -1.0 and -5.0. default is 0.0')
parser.add_argument('--sentencepiece',
                    type=str,
                    default=None,
                    help='Path to the sentencepiece .model file for both tokenization and vocab.')
# parser.add_argument('--debug',
#                     action='store_true',
#                     help='Run the example in test mode for sanity checks')


args = parser.parse_args([])


epochs = args.epochs
batch_size = args.batch_size
num_epochs = args.num_epochs
lr = args.lr

output_dir = args.output_dir
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
test_batch_size = args.test_batch_size
optimizer = args.optimizer
accumulate = args.accumulate
warmup_ratio = args.warmup_ratio
log_interval = args.log_interval
max_seq_length = args.max_seq_length
doc_stride = args.doc_stride
max_query_length = args.max_query_length
n_best_size = args.n_best_size

## Inspect the SQuAD Dataset

Then we take a look at the Stanford Question Answering Dataset (SQuAD). The dataset can be downloaded using the `nlp.data.SQuAD` API. In this tutorial, we create a small dataset with 3 samples from the SQuAD dataset for demonstration purpose.

The question answering task on the SQuAD dataset is setup the following way. For each sample in the dataset, a context is provided. The context is usually a long paragraph which contains lots of information. Then a question asked based on the context. The goal is to find the text span in the context that answers the question in the sample.

In [3]:
full_data = nlp.data.SQuAD(segment='dev', version='1.1')
# loading a subset of the dev set of SQuAD
num_target_samples = 3
target_samples = [full_data[i] for i in range(num_target_samples)]
dataset = mx.gluon.data.SimpleDataset(target_samples)
print('Number of samples in the created dataset subsampled from SQuAD = %d'%len(dataset))

Number of samples in the created dataset subsampled from SQuAD = 3


Let's take a look at a sample from the dataset. In this sample, the question is about the location of the game, with a description about the Super Bowl 50 game as the context. Note that three different answer spans are correct for this question, and they start from index 403, 355 and 355 in the context respectively.

In [4]:
sample = dataset[2]

context_idx = 3

print('\nContext:\n')
print(sample[context_idx])


Context:

Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.


In [5]:
question_idx = 2
answer_idx = 4
answer_pos_idx = 5

print("\nQuestion")
print(sample[question_idx])
print("\nCorrect Answer Spans")
print(sample[answer_idx])
print("\nAnswer Span Start Indices:")
print(sample[answer_pos_idx])


Question
Where did Super Bowl 50 take place?

Correct Answer Spans
['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."]

Answer Span Start Indices:
[403, 355, 355]


## Data Pre-processing for QA with BERT

Recall that during BERT pre-training, it takes a sentence pair as the input, separated by the 'SEP' special token. For SQuAD, we can feed the context-question pair as the sentence pair input. To use BERT to predict the starting and ending span of the answer, we can add a classification layer for each token in the context texts, to predict if a token is the start or the end of the answer span. 

![qa](natural_language_understanding/qa.png)

In the next few code blocks, we will work on pre-processing the samples in the SQuAD dataset in the desired format with these special separators. 


### Get Pre-trained BERT Model

First, let's use the *get_model* API in GluonNLP to get the model definition for BERT, and the vocabulary used for the BERT model. Note that we discard the pooler and classifier layers used for the next sentence prediction task, as well as the decoder layers for the masked language model task during the BERT pre-training phase. These layers are not useful for predicting the starting and ending indices of the answer span.

The list of pre-trained BERT models available in GluonNLP can be found [here](http://gluon-nlp.mxnet.io/model_zoo/bert/index.html).

In [6]:
bert_model, vocab = nlp.model.get_model('bert_12_768_12',
                                        dataset_name='book_corpus_wiki_en_uncased',
                                        use_classifier=False,
                                        use_decoder=False,
                                        use_pooler=False,
                                        pretrained=False)

Note that there are several special tokens in the vocabulary for BERT. In particular, the `[SEP]` token is used for separating the sentence pairs, and the `[CLS]` token is added at the beginning of the sentence pairs. They will be used to pre-process the SQuAD dataset later.

In [7]:
print(vocab)

Vocab(size=30522, unk="[UNK]", reserved="['[CLS]', '[SEP]', '[MASK]', '[PAD]']")


### Tokenization

The second step is to process the samples using the same tokenizer used for BERT, which is provided as the `BERTTokenizer` API in GluonNLP. Note that instead of word level and character level representation, BERT uses subwords to represent a word, separated `##`. 

In the following example, the word `suspending` is tokenized as two subwords (`suspend` and `##ing`), and `numerals` is tokenized as three subwords (`nu`, `##meral`, `##s`).

In [8]:
tokenizer = nlp.data.BERTTokenizer(vocab=vocab, lower=True)

tokenizer("as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals")

['as',
 'well',
 'as',
 'temporarily',
 'suspend',
 '##ing',
 'the',
 'tradition',
 'of',
 'naming',
 'each',
 'super',
 'bowl',
 'game',
 'with',
 'roman',
 'nu',
 '##meral',
 '##s']

### Sentence Pair Composition

With the tokenizer inplace, we are ready to process the question-context texts and compose sentence pairs. The functionality is available via the `SQuADTransform` API. 

In [9]:
transform = bert.data.qa.SQuADTransform(tokenizer, is_pad=False, is_training=False, do_lookup=False)
dev_data_transform, _ = bert.data.qa.preprocess_dataset(dataset, transform)
logging.info('The number of examples after preprocessing:{}'.format(len(dev_data_transform)))

Done! Transform dataset costs 0.17 seconds.


Let's take a look at the sample after the transformation:

In [10]:
sample = dev_data_transform[2]
print('\nsegment type: \n' + str(sample[2]))
print('\ntext length: ' + str(sample[3]))
print('\nsentence pair: \n' + str(sample[1]))


segment type: 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

text length: 168

sentence pair: 
['[CLS]', 'where', 'did', 'super', 'bowl', '50', 'take', 'place', '?', '[SEP]', 'super', 'bowl', '50', 'was', 'an', 'american', 'football', 'game', 'to', 'determine', 'the', 'champion', 'of', 'the', 'national', 'football', 'league', '(', 'nfl', ')', 'for', 'the', '2015', 'season', '.', 'the', 'american', 'football', 'conference', '(', 'afc', ')', 'champion', 'denver', 'broncos', 'defeated', 'the', 'national', 'football', 'conference', '(', 

### Vocabulary Lookup

Finally, we convert the transformed texts to subword indices, which are used to contructor NDArrays as the inputs to the model.

In [11]:
def vocab_lookup(example_id, subwords, type_ids, length, start, end):
    indices = vocab[subwords]
    return example_id, indices, type_ids, length, start, end

dev_data_transform = dev_data_transform.transform(vocab_lookup, lazy=False)
print(dev_data_transform[2][1])

[2, 2073, 2106, 3565, 4605, 2753, 2202, 2173, 1029, 3, 3565, 4605, 2753, 2001, 2019, 2137, 2374, 2208, 2000, 5646, 1996, 3410, 1997, 1996, 2120, 2374, 2223, 1006, 5088, 1007, 2005, 1996, 2325, 2161, 1012, 1996, 2137, 2374, 3034, 1006, 10511, 1007, 3410, 7573, 14169, 3249, 1996, 2120, 2374, 3034, 1006, 22309, 1007, 3410, 3792, 12915, 2484, 1516, 2184, 2000, 7796, 2037, 2353, 3565, 4605, 2516, 1012, 1996, 2208, 2001, 2209, 2006, 2337, 1021, 1010, 2355, 1010, 2012, 11902, 1005, 1055, 3346, 1999, 1996, 2624, 3799, 3016, 2181, 2012, 4203, 10254, 1010, 2662, 1012, 2004, 2023, 2001, 1996, 12951, 3565, 4605, 1010, 1996, 2223, 13155, 1996, 1000, 3585, 5315, 1000, 2007, 2536, 2751, 1011, 11773, 11107, 1010, 2004, 2092, 2004, 8184, 28324, 2075, 1996, 4535, 1997, 10324, 2169, 3565, 4605, 2208, 2007, 3142, 16371, 28990, 2015, 1006, 2104, 2029, 1996, 2208, 2052, 2031, 2042, 2124, 2004, 1000, 3565, 4605, 1048, 1000, 1007, 1010, 2061, 2008, 1996, 8154, 2071, 14500, 3444, 1996, 5640, 16371, 28990, 2015

## Model Definition

After the data is processed, we can define the model that uses the representation produced by BERT for predicting the starting and ending positions of the answer span.

We download a BERT model trained on the SQuAD dataset, prepare the dataloader.

In [12]:
net = BertForQA(bert_model)

ctx = mx.gpu(0)
## multi-gpu training
# GPU_COUNT = 4 # increase if you have more
# ctx = [mx.gpu(i) for i in range(GPU_COUNT)]

ckpt = qa_utils.download_qa_ckpt()
net.load_parameters(ckpt, ctx=ctx)

batch_size = 1
dev_dataloader = mx.gluon.data.DataLoader(
    dev_data_transform, batch_size=batch_size, shuffle=False)

Downloaded checkpoint to ./temp/bert_qa-7eb11865.params


In [13]:
# all_results = collections.defaultdict(list)

# total_num = 0
# for data in dev_dataloader:
#     example_ids, inputs, token_types, valid_length, _, _ = data
#     total_num += len(inputs)
#     batch_size = inputs.shape[0]
#     pred_start, pred_end = net(inputs.astype('float32').as_in_context(ctx),
#                                token_types.astype('float32').as_in_context(ctx),
#                                valid_length.astype('float32').as_in_context(ctx))

#     example_ids = example_ids.asnumpy().tolist()
#     pred_start = pred_start.reshape(batch_size, -1).asnumpy()
#     pred_end = pred_end.reshape(batch_size, -1).asnumpy()
    
#     for example_id, start, end in zip(example_ids, pred_start, pred_end):
#         all_results[example_id].append(PredResult(start=start, end=end))

In [14]:
# qa_utils.predict(dataset, all_results, vocab)

### Let's Train the Model

Now we can put all the pieces together, and start fine-tuning the model with a few epochs.

In [15]:
# net = BertForQA(bert=bert_model)
# nlp.utils.load_parameters(net, pretrained_bert_parameters, ctx=ctx,
#                           ignore_extra=True, cast_dtype=True)
net.span_classifier.initialize(init=mx.init.Normal(0.02), ctx=ctx)
net.hybridize(static_alloc=True)

loss_function = BertForQALoss()
loss_function.hybridize(static_alloc=True)

  v.initialize(None, ctx, init, force_reinit=force_reinit)
  v.initialize(None, ctx, init, force_reinit=force_reinit)


In [16]:

batchify_fn = nlp.data.batchify.Tuple(
    nlp.data.batchify.Stack(),
    nlp.data.batchify.Pad(axis=0, pad_val=vocab[vocab.padding_token]),
    nlp.data.batchify.Pad(axis=0, pad_val=vocab[vocab.padding_token]),
    nlp.data.batchify.Stack('float32'),
    nlp.data.batchify.Stack('float32'),
    nlp.data.batchify.Stack('float32'))

np.random.seed(6)
random.seed(6)
mx.random.seed(6)

log = logging.getLogger('gluonnlp')
log.setLevel(logging.DEBUG)
formatter = logging.Formatter(
    fmt='%(levelname)s:%(name)s:%(asctime)s %(message)s', datefmt='%H:%M:%S')


segment = 'train'  # if not args.debug else 'dev'
log.info('Loading %s data...', segment)
#     if version_2:
#         train_data = SQuAD(segment, version='2.0')
#     else:
train_data = SQuAD(segment, version='1.1')
#     if args.debug:
#         sampled_data = [train_data[i] for i in range(1000)]
#         train_data = mx.gluon.data.SimpleDataset(sampled_data)
log.info('Number of records in Train data:{}'.format(len(train_data)))

train_data_transform, _ = preprocess_dataset(
    train_data, SQuADTransform(
        copy.copy(tokenizer),
        max_seq_length=max_seq_length,
        doc_stride=doc_stride,
        max_query_length=max_query_length,
        is_pad=True,
        is_training=True))
log.info('The number of examples after preprocessing:{}'.format(
    len(train_data_transform)))

train_dataloader = mx.gluon.data.DataLoader(
    train_data_transform, batchify_fn=batchify_fn,
    batch_size=batch_size, num_workers=4, shuffle=True)

INFO:gluonnlp:Loading train data...
INFO:gluonnlp:Number of records in Train data:87599
INFO:gluonnlp:The number of examples after preprocessing:88641


Done! Transform dataset costs 57.50 seconds.


In [None]:

def train(log, train_data_transform, train_dataloader):
    """Training function."""

    log.info('Start Training')

    optimizer_params = {'learning_rate': lr}
    trainer = mx.gluon.Trainer(net.collect_params(), optimizer,
                               optimizer_params, update_on_kvstore=False)

    num_train_examples = len(train_data_transform)
    step_size = batch_size * accumulate if accumulate else batch_size
    num_train_steps = int(num_train_examples / step_size * epochs)
    num_warmup_steps = int(num_train_steps * warmup_ratio)
    step_num = 0
    
    def set_new_lr(step_num, batch_id):
        """set new learning rate"""
        # set grad to zero for gradient accumulation
        if accumulate:
            if batch_id % accumulate == 0:
                net.collect_params().zero_grad()
                step_num += 1
        else:
            step_num += 1
        # learning rate schedule
        # Notice that this learning rate scheduler is adapted from traditional linear learning
        # rate scheduler where step_num >= num_warmup_steps, new_lr = 1 - step_num/num_train_steps
        if step_num < num_warmup_steps:
            new_lr = lr * step_num / num_warmup_steps
        else:
            offset = (step_num - num_warmup_steps) * lr / \
                (num_train_steps - num_warmup_steps)
            new_lr = lr - offset
        trainer.set_learning_rate(new_lr)
        return step_num

    # Do not apply weight decay on LayerNorm and bias terms
    for _, v in net.collect_params('.*beta|.*gamma|.*bias').items():
        v.wd_mult = 0.0
    # Collect differentiable parameters
    params = [p for p in net.collect_params().values()
              if p.grad_req != 'null']
    # Set grad_req if gradient accumulation is required
    if accumulate:
        for p in params:
            p.grad_req = 'add'

    epoch_tic = time.time()
    total_num = 0
    log_num = 0
    for epoch_id in range(epochs):
        step_loss = 0.0
        tic = time.time()
        for batch_id, data in enumerate(train_dataloader):
            # set new lr
            step_num = set_new_lr(step_num, batch_id)
            # forward and backward
            with mx.autograd.record():
                _, inputs, token_types, valid_length, start_label, end_label = data

                log_num += len(inputs)
                total_num += len(inputs)

                out = net(inputs.astype('float32').as_in_context(ctx),
                          token_types.astype('float32').as_in_context(ctx),
                          valid_length.astype('float32').as_in_context(ctx))

                ls = loss_function(out, [
                    start_label.astype('float32').as_in_context(ctx),
                    end_label.astype('float32').as_in_context(ctx)]).mean()

                if accumulate:
                    ls = ls / accumulate
            ls.backward()
            # update
            if not accumulate or (batch_id + 1) % accumulate == 0:
                trainer.allreduce_grads()
                nlp.utils.clip_grad_global_norm(params, 1)
                trainer.update(1)

            step_loss += ls.asscalar()
        
#         for batch_id, data in enumerate(train_dataloader):
#             # set new lr
#             step_num = set_new_lr(step_num, batch_id)
#             # forward and backward
# #             with mx.autograd.record():
#             _, inputs, token_types, valid_length, start_label, end_label = data

#             log_num += len(inputs)
#             total_num += len(inputs)

#             def split_and_load(data, ctx):
#                 n, k = data.shape[0], len(ctx)
#                 print(n, k)
#                 if (n//k)*k != n:
#                     drop = n - (n//k)*k
#                     data = data[:-drop]
#                     n, k = data.shape[0], len(ctx)
#                 assert (n//k)*k == n, '# examples is not divided by # devices'
#                 idx = list(range(0, n+1, n//k))
#                 return [data[idx[i]:idx[i+1]].as_in_context(ctx[i]) for i in range(k)]
            
# #                 def train_batch(inputs, params, ctx, lr):
#                     # split the data batch and load them on GPUs
#             print(len(inputs[0]), len(token_types[0]), len(valid_length[0]), len(start_label[0]), len(end_label[0]))
#             inputs = split_and_load(inputs[0], ctx)
#             token_types = split_and_load(token_types[0], ctx)
# #             valid_length = split_and_load(valid_length, ctx)
# #             start_label = split_and_load(start_label[0], ctx)
# #             end_label = split_and_load(end_label[0], ctx)

#             # run forward on each GPU
#             with mx.autograd.record():
#                 losses = [loss_function(net(X, Y, W), [U, V])
#                           for X, Y, W, U, V in zip(inputs, token_types, valid_length, start_label, end_label)]
#             # run backward on each gpu
#             for ls in losses:
#                 ls.backward()
#                 step_loss += ls.asscalar()
#             # aggregate gradient over GPUs
#             for i in range(len(params[0])):
#                 allreduce([params[c][i].grad for c in range(len(ctx))])
#             # update parameters with SGD on each GPU
#             for p in params:
#                 nlp.utils.clip_grad_global_norm(P, 1)
#                 trainer.update(1)
            if (batch_id + 1) % log_interval == 0: 
                toc = time.time()
                log.info('Epoch: {}, Batch: {}/{}, Loss={:.4f}, lr={:.7f} Time cost={:.1f} Thoughput={:.2f} samples/s'  # pylint: disable=line-too-long
                         .format(epoch_id, batch_id, len(train_dataloader),
                                 step_loss / log_interval,
                                 trainer.learning_rate, toc - tic, log_num/(toc - tic)))
                tic = time.time()
                step_loss = 0.0
                log_num = 0
        epoch_toc = time.time()
        log.info('Time cost={:.2f} s, Thoughput={:.2f} samples/s'.format(
            epoch_toc - epoch_tic, total_num/(epoch_toc - epoch_tic)))

    net.save_parameters(os.path.join(output_dir, 'net.params'))


train(log, train_data_transform, train_dataloader)

INFO:gluonnlp:Start Training
INFO:gluonnlp:Epoch: 0, Batch: 49/88641, Loss=0.6069, lr=0.0000001 Time cost=10.9 Thoughput=4.59 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 99/88641, Loss=0.6334, lr=0.0000002 Time cost=4.1 Thoughput=12.11 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 149/88641, Loss=0.7261, lr=0.0000003 Time cost=4.1 Thoughput=12.06 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 199/88641, Loss=0.5254, lr=0.0000004 Time cost=4.3 Thoughput=11.62 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 249/88641, Loss=0.5394, lr=0.0000005 Time cost=4.1 Thoughput=12.08 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 299/88641, Loss=0.4510, lr=0.0000006 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 349/88641, Loss=0.7645, lr=0.0000007 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 399/88641, Loss=0.6211, lr=0.0000008 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 449/88641, Loss=0.6223, lr=0.0000008 Time cost=4.1 Thoughput=12.25 samples/s
I

INFO:gluonnlp:Epoch: 0, Batch: 3849/88641, Loss=1.1156, lr=0.0000072 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 3899/88641, Loss=0.9664, lr=0.0000073 Time cost=4.1 Thoughput=12.13 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 3949/88641, Loss=0.4874, lr=0.0000074 Time cost=4.1 Thoughput=12.09 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 3999/88641, Loss=0.7005, lr=0.0000075 Time cost=4.1 Thoughput=12.17 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 4049/88641, Loss=1.0819, lr=0.0000076 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 4099/88641, Loss=0.6048, lr=0.0000077 Time cost=4.1 Thoughput=12.14 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 4149/88641, Loss=0.8377, lr=0.0000078 Time cost=4.1 Thoughput=12.15 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 4199/88641, Loss=0.6581, lr=0.0000079 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 4249/88641, Loss=0.9878, lr=0.0000080 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch

INFO:gluonnlp:Epoch: 0, Batch: 7649/88641, Loss=1.2989, lr=0.0000144 Time cost=4.1 Thoughput=12.10 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 7699/88641, Loss=0.7474, lr=0.0000145 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 7749/88641, Loss=0.8412, lr=0.0000146 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 7799/88641, Loss=1.1350, lr=0.0000147 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 7849/88641, Loss=0.9285, lr=0.0000148 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 7899/88641, Loss=1.0477, lr=0.0000149 Time cost=4.2 Thoughput=11.99 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 7949/88641, Loss=0.8970, lr=0.0000149 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 7999/88641, Loss=0.9430, lr=0.0000150 Time cost=4.1 Thoughput=12.30 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 8049/88641, Loss=0.8372, lr=0.0000151 Time cost=4.1 Thoughput=12.30 samples/s
INFO:gluonnlp:Epoch

INFO:gluonnlp:Epoch: 0, Batch: 11399/88641, Loss=1.2425, lr=0.0000214 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 11449/88641, Loss=1.1879, lr=0.0000215 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 11499/88641, Loss=1.3414, lr=0.0000216 Time cost=4.1 Thoughput=12.29 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 11549/88641, Loss=1.2133, lr=0.0000217 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 11599/88641, Loss=0.6397, lr=0.0000218 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 11649/88641, Loss=1.3349, lr=0.0000219 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 11699/88641, Loss=1.9567, lr=0.0000220 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 11749/88641, Loss=0.9104, lr=0.0000221 Time cost=4.1 Thoughput=12.17 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 11799/88641, Loss=1.2024, lr=0.0000222 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 15149/88641, Loss=2.1788, lr=0.0000285 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 15199/88641, Loss=1.1626, lr=0.0000286 Time cost=4.1 Thoughput=12.06 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 15249/88641, Loss=1.7322, lr=0.0000287 Time cost=4.2 Thoughput=12.02 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 15299/88641, Loss=1.3035, lr=0.0000288 Time cost=4.2 Thoughput=12.01 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 15349/88641, Loss=0.9746, lr=0.0000289 Time cost=4.2 Thoughput=12.00 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 15399/88641, Loss=1.1667, lr=0.0000290 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 15449/88641, Loss=1.1696, lr=0.0000291 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 15499/88641, Loss=1.2949, lr=0.0000291 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 15549/88641, Loss=1.4785, lr=0.0000292 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 18899/88641, Loss=1.6568, lr=0.0000355 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 18949/88641, Loss=1.9663, lr=0.0000356 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 18999/88641, Loss=1.1927, lr=0.0000357 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 19049/88641, Loss=1.3611, lr=0.0000358 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 19099/88641, Loss=1.4417, lr=0.0000359 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 19149/88641, Loss=1.5373, lr=0.0000360 Time cost=4.1 Thoughput=12.14 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 19199/88641, Loss=1.2282, lr=0.0000361 Time cost=4.1 Thoughput=12.12 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 19249/88641, Loss=1.6397, lr=0.0000362 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 19299/88641, Loss=1.1250, lr=0.0000363 Time cost=4.1 Thoughput=12.16 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 22649/88641, Loss=2.0125, lr=0.0000426 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 22699/88641, Loss=1.6710, lr=0.0000427 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 22749/88641, Loss=1.8365, lr=0.0000428 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 22799/88641, Loss=1.6007, lr=0.0000429 Time cost=4.1 Thoughput=12.14 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 22849/88641, Loss=1.2620, lr=0.0000430 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 22899/88641, Loss=1.5063, lr=0.0000431 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 22949/88641, Loss=1.2442, lr=0.0000432 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 22999/88641, Loss=1.2672, lr=0.0000432 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 23049/88641, Loss=1.6810, lr=0.0000433 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 26399/88641, Loss=2.4368, lr=0.0000496 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 26449/88641, Loss=1.7032, lr=0.0000497 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 26499/88641, Loss=2.0771, lr=0.0000498 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 26549/88641, Loss=1.8851, lr=0.0000499 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 26599/88641, Loss=1.9358, lr=0.0000500 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 26649/88641, Loss=1.3837, lr=0.0000500 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 26699/88641, Loss=2.0859, lr=0.0000500 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 26749/88641, Loss=1.7014, lr=0.0000500 Time cost=4.2 Thoughput=11.94 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 26799/88641, Loss=1.7931, lr=0.0000500 Time cost=4.2 Thoughput=11.93 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 30149/88641, Loss=2.0343, lr=0.0000493 Time cost=4.1 Thoughput=12.08 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 30199/88641, Loss=1.9917, lr=0.0000492 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 30249/88641, Loss=1.7873, lr=0.0000492 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 30299/88641, Loss=1.6265, lr=0.0000492 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 30349/88641, Loss=2.1594, lr=0.0000492 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 30399/88641, Loss=2.5669, lr=0.0000492 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 30449/88641, Loss=1.5214, lr=0.0000492 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 30499/88641, Loss=2.2081, lr=0.0000492 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 30549/88641, Loss=2.3224, lr=0.0000492 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 33899/88641, Loss=1.8574, lr=0.0000485 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 33949/88641, Loss=1.9037, lr=0.0000485 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 33999/88641, Loss=1.7972, lr=0.0000485 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 34049/88641, Loss=1.3396, lr=0.0000484 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 34099/88641, Loss=1.9498, lr=0.0000484 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 34149/88641, Loss=2.2625, lr=0.0000484 Time cost=4.2 Thoughput=11.99 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 34199/88641, Loss=1.4953, lr=0.0000484 Time cost=4.1 Thoughput=12.16 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 34249/88641, Loss=1.9169, lr=0.0000484 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 34299/88641, Loss=2.8526, lr=0.0000484 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 37649/88641, Loss=1.7576, lr=0.0000477 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 37699/88641, Loss=2.1445, lr=0.0000477 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 37749/88641, Loss=2.1158, lr=0.0000477 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 37799/88641, Loss=2.0637, lr=0.0000477 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 37849/88641, Loss=2.1947, lr=0.0000476 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 37899/88641, Loss=1.8968, lr=0.0000476 Time cost=4.3 Thoughput=11.62 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 37949/88641, Loss=2.3330, lr=0.0000476 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 37999/88641, Loss=2.0611, lr=0.0000476 Time cost=4.1 Thoughput=12.15 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 38049/88641, Loss=1.7274, lr=0.0000476 Time cost=4.1 Thoughput=12.09 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 41399/88641, Loss=2.9395, lr=0.0000469 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 41449/88641, Loss=2.2784, lr=0.0000469 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 41499/88641, Loss=1.7505, lr=0.0000469 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 41549/88641, Loss=1.8032, lr=0.0000469 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 41599/88641, Loss=2.0588, lr=0.0000469 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 41649/88641, Loss=2.0898, lr=0.0000469 Time cost=4.2 Thoughput=11.77 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 41699/88641, Loss=2.4149, lr=0.0000468 Time cost=4.1 Thoughput=12.06 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 41749/88641, Loss=1.9368, lr=0.0000468 Time cost=4.1 Thoughput=12.15 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 41799/88641, Loss=2.1717, lr=0.0000468 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 45149/88641, Loss=2.2712, lr=0.0000461 Time cost=4.1 Thoughput=12.11 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 45199/88641, Loss=1.6673, lr=0.0000461 Time cost=4.1 Thoughput=12.17 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 45249/88641, Loss=2.5072, lr=0.0000461 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 45299/88641, Loss=2.1205, lr=0.0000461 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 45349/88641, Loss=1.7288, lr=0.0000461 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 45399/88641, Loss=1.6642, lr=0.0000461 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 45449/88641, Loss=2.2498, lr=0.0000461 Time cost=4.1 Thoughput=12.29 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 45499/88641, Loss=1.9381, lr=0.0000460 Time cost=4.1 Thoughput=12.10 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 45549/88641, Loss=2.1085, lr=0.0000460 Time cost=4.2 Thoughput=11.94 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 48899/88641, Loss=1.5594, lr=0.0000453 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 48949/88641, Loss=1.8686, lr=0.0000453 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 48999/88641, Loss=2.7171, lr=0.0000453 Time cost=4.1 Thoughput=12.30 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 49049/88641, Loss=1.8645, lr=0.0000453 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 49099/88641, Loss=2.1759, lr=0.0000453 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 49149/88641, Loss=2.5194, lr=0.0000453 Time cost=4.1 Thoughput=12.29 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 49199/88641, Loss=1.8235, lr=0.0000453 Time cost=4.1 Thoughput=12.30 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 49249/88641, Loss=2.3215, lr=0.0000453 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 49299/88641, Loss=1.8541, lr=0.0000453 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 52649/88641, Loss=1.5604, lr=0.0000446 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 52699/88641, Loss=1.1615, lr=0.0000445 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 52749/88641, Loss=2.0803, lr=0.0000445 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 52799/88641, Loss=2.1189, lr=0.0000445 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 52849/88641, Loss=1.9833, lr=0.0000445 Time cost=4.1 Thoughput=12.13 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 52899/88641, Loss=2.1129, lr=0.0000445 Time cost=4.1 Thoughput=12.13 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 52949/88641, Loss=2.2757, lr=0.0000445 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 52999/88641, Loss=2.0414, lr=0.0000445 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 53049/88641, Loss=2.1663, lr=0.0000445 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 56399/88641, Loss=2.1098, lr=0.0000438 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 56449/88641, Loss=2.3146, lr=0.0000438 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 56499/88641, Loss=1.8491, lr=0.0000438 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 56549/88641, Loss=2.5193, lr=0.0000437 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 56599/88641, Loss=2.2013, lr=0.0000437 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 56649/88641, Loss=2.4448, lr=0.0000437 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 56699/88641, Loss=1.9879, lr=0.0000437 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 56749/88641, Loss=2.1284, lr=0.0000437 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 56799/88641, Loss=1.8980, lr=0.0000437 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 60149/88641, Loss=1.8385, lr=0.0000430 Time cost=4.1 Thoughput=12.29 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 60199/88641, Loss=1.9119, lr=0.0000430 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 60249/88641, Loss=2.3997, lr=0.0000430 Time cost=4.1 Thoughput=12.30 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 60299/88641, Loss=1.4336, lr=0.0000430 Time cost=4.1 Thoughput=12.33 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 60349/88641, Loss=1.8182, lr=0.0000429 Time cost=4.2 Thoughput=11.99 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 60399/88641, Loss=1.7104, lr=0.0000429 Time cost=4.2 Thoughput=12.00 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 60449/88641, Loss=1.6095, lr=0.0000429 Time cost=4.1 Thoughput=12.13 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 60499/88641, Loss=2.1098, lr=0.0000429 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 60549/88641, Loss=1.8987, lr=0.0000429 Time cost=4.1 Thoughput=12.06 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 63899/88641, Loss=2.3191, lr=0.0000422 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 63949/88641, Loss=2.4351, lr=0.0000422 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 63999/88641, Loss=2.0314, lr=0.0000422 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 64049/88641, Loss=1.1592, lr=0.0000422 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 64099/88641, Loss=2.5617, lr=0.0000422 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 64149/88641, Loss=1.5760, lr=0.0000422 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 64199/88641, Loss=1.9523, lr=0.0000421 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 64249/88641, Loss=2.0378, lr=0.0000421 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 64299/88641, Loss=2.8015, lr=0.0000421 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 67649/88641, Loss=1.4990, lr=0.0000414 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 67699/88641, Loss=2.5816, lr=0.0000414 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 67749/88641, Loss=1.8354, lr=0.0000414 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 67799/88641, Loss=1.8789, lr=0.0000414 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 67849/88641, Loss=2.1095, lr=0.0000414 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 67899/88641, Loss=2.4119, lr=0.0000414 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 67949/88641, Loss=2.0322, lr=0.0000414 Time cost=4.1 Thoughput=12.30 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 67999/88641, Loss=2.3928, lr=0.0000413 Time cost=4.1 Thoughput=12.30 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 68049/88641, Loss=1.3924, lr=0.0000413 Time cost=4.1 Thoughput=12.30 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 71399/88641, Loss=2.3167, lr=0.0000406 Time cost=4.1 Thoughput=12.13 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 71449/88641, Loss=1.8355, lr=0.0000406 Time cost=4.1 Thoughput=12.17 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 71499/88641, Loss=1.9428, lr=0.0000406 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 71549/88641, Loss=2.0944, lr=0.0000406 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 71599/88641, Loss=1.8491, lr=0.0000406 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 71649/88641, Loss=1.7407, lr=0.0000406 Time cost=4.2 Thoughput=12.04 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 71699/88641, Loss=1.8197, lr=0.0000406 Time cost=4.2 Thoughput=12.03 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 71749/88641, Loss=1.5723, lr=0.0000406 Time cost=4.2 Thoughput=12.02 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 71799/88641, Loss=2.0590, lr=0.0000406 Time cost=4.2 Thoughput=12.00 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 75149/88641, Loss=1.5743, lr=0.0000399 Time cost=4.1 Thoughput=12.15 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 75199/88641, Loss=1.7381, lr=0.0000398 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 75249/88641, Loss=1.4470, lr=0.0000398 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 75299/88641, Loss=1.8967, lr=0.0000398 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 75349/88641, Loss=2.2144, lr=0.0000398 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 75399/88641, Loss=1.9514, lr=0.0000398 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 75449/88641, Loss=2.0400, lr=0.0000398 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 75499/88641, Loss=2.3377, lr=0.0000398 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 75549/88641, Loss=1.9035, lr=0.0000398 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 78899/88641, Loss=1.7973, lr=0.0000391 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 78949/88641, Loss=2.3347, lr=0.0000391 Time cost=4.1 Thoughput=12.09 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 78999/88641, Loss=1.5988, lr=0.0000391 Time cost=4.2 Thoughput=11.98 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 79049/88641, Loss=1.2543, lr=0.0000390 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 79099/88641, Loss=1.6230, lr=0.0000390 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 79149/88641, Loss=1.9872, lr=0.0000390 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 79199/88641, Loss=2.4216, lr=0.0000390 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 79249/88641, Loss=2.4642, lr=0.0000390 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 79299/88641, Loss=2.2782, lr=0.0000390 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 82649/88641, Loss=2.2721, lr=0.0000383 Time cost=4.7 Thoughput=10.53 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 82699/88641, Loss=1.6509, lr=0.0000383 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 82749/88641, Loss=2.2947, lr=0.0000383 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 82799/88641, Loss=1.8063, lr=0.0000383 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 82849/88641, Loss=1.2470, lr=0.0000382 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 82899/88641, Loss=2.0231, lr=0.0000382 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 82949/88641, Loss=1.9765, lr=0.0000382 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 82999/88641, Loss=1.9351, lr=0.0000382 Time cost=4.1 Thoughput=12.05 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 83049/88641, Loss=2.2368, lr=0.0000382 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 0, Batch: 86399/88641, Loss=1.0634, lr=0.0000375 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 86449/88641, Loss=2.5137, lr=0.0000375 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 86499/88641, Loss=1.9312, lr=0.0000375 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 86549/88641, Loss=2.1348, lr=0.0000375 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 86599/88641, Loss=1.9851, lr=0.0000375 Time cost=4.2 Thoughput=12.01 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 86649/88641, Loss=1.4952, lr=0.0000375 Time cost=4.1 Thoughput=12.13 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 86699/88641, Loss=2.3255, lr=0.0000374 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 86749/88641, Loss=2.3959, lr=0.0000374 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 0, Batch: 86799/88641, Loss=1.7481, lr=0.0000374 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 1549/88641, Loss=1.6124, lr=0.0000367 Time cost=4.1 Thoughput=12.12 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 1599/88641, Loss=1.4201, lr=0.0000367 Time cost=4.2 Thoughput=11.98 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 1649/88641, Loss=2.0306, lr=0.0000367 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 1699/88641, Loss=1.5982, lr=0.0000367 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 1749/88641, Loss=1.9433, lr=0.0000367 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 1799/88641, Loss=2.1873, lr=0.0000367 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 1849/88641, Loss=1.1639, lr=0.0000367 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 1899/88641, Loss=2.3087, lr=0.0000366 Time cost=4.1 Thoughput=12.06 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 1949/88641, Loss=1.7556, lr=0.0000366 Time cost=4.1 Thoughput=12.05 samples/s
INFO:gluonnlp:Epoch

INFO:gluonnlp:Epoch: 1, Batch: 5349/88641, Loss=1.6326, lr=0.0000359 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 5399/88641, Loss=1.1341, lr=0.0000359 Time cost=4.1 Thoughput=12.31 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 5449/88641, Loss=1.9105, lr=0.0000359 Time cost=4.1 Thoughput=12.12 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 5499/88641, Loss=1.6433, lr=0.0000359 Time cost=4.2 Thoughput=11.99 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 5549/88641, Loss=1.7207, lr=0.0000359 Time cost=4.2 Thoughput=12.02 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 5599/88641, Loss=1.8039, lr=0.0000359 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 5649/88641, Loss=1.6858, lr=0.0000359 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 5699/88641, Loss=1.7810, lr=0.0000358 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 5749/88641, Loss=1.8577, lr=0.0000358 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch

INFO:gluonnlp:Epoch: 1, Batch: 9149/88641, Loss=1.8170, lr=0.0000351 Time cost=4.1 Thoughput=12.12 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 9199/88641, Loss=1.5920, lr=0.0000351 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 9249/88641, Loss=1.9912, lr=0.0000351 Time cost=4.2 Thoughput=12.02 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 9299/88641, Loss=1.8384, lr=0.0000351 Time cost=4.2 Thoughput=11.99 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 9349/88641, Loss=1.7075, lr=0.0000351 Time cost=4.1 Thoughput=12.15 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 9399/88641, Loss=2.0798, lr=0.0000351 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 9449/88641, Loss=1.9728, lr=0.0000351 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 9499/88641, Loss=1.5132, lr=0.0000351 Time cost=4.1 Thoughput=12.15 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 9549/88641, Loss=2.0049, lr=0.0000350 Time cost=4.1 Thoughput=12.29 samples/s
INFO:gluonnlp:Epoch

INFO:gluonnlp:Epoch: 1, Batch: 12899/88641, Loss=1.7406, lr=0.0000343 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 12949/88641, Loss=1.4065, lr=0.0000343 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 12999/88641, Loss=1.9523, lr=0.0000343 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 13049/88641, Loss=2.1559, lr=0.0000343 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 13099/88641, Loss=1.4147, lr=0.0000343 Time cost=4.1 Thoughput=12.08 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 13149/88641, Loss=2.2314, lr=0.0000343 Time cost=4.2 Thoughput=12.04 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 13199/88641, Loss=1.5547, lr=0.0000343 Time cost=4.1 Thoughput=12.16 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 13249/88641, Loss=1.4226, lr=0.0000343 Time cost=4.2 Thoughput=11.81 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 13299/88641, Loss=1.8570, lr=0.0000343 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 16649/88641, Loss=1.8064, lr=0.0000336 Time cost=4.1 Thoughput=12.10 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 16699/88641, Loss=1.8855, lr=0.0000335 Time cost=4.2 Thoughput=11.86 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 16749/88641, Loss=1.4715, lr=0.0000335 Time cost=4.2 Thoughput=11.87 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 16799/88641, Loss=1.6430, lr=0.0000335 Time cost=4.2 Thoughput=11.87 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 16849/88641, Loss=2.2912, lr=0.0000335 Time cost=4.1 Thoughput=12.15 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 16899/88641, Loss=1.7829, lr=0.0000335 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 16949/88641, Loss=1.6851, lr=0.0000335 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 16999/88641, Loss=1.9656, lr=0.0000335 Time cost=4.6 Thoughput=10.97 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 17049/88641, Loss=2.4471, lr=0.0000335 Time cost=4.1 Thoughput=12.30 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 20399/88641, Loss=1.7502, lr=0.0000328 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 20449/88641, Loss=1.5152, lr=0.0000328 Time cost=4.1 Thoughput=12.29 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 20499/88641, Loss=1.5260, lr=0.0000328 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 20549/88641, Loss=2.3974, lr=0.0000327 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 20599/88641, Loss=2.2021, lr=0.0000327 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 20649/88641, Loss=2.8181, lr=0.0000327 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 20699/88641, Loss=1.8726, lr=0.0000327 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 20749/88641, Loss=1.7084, lr=0.0000327 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 20799/88641, Loss=1.4892, lr=0.0000327 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 24149/88641, Loss=2.2064, lr=0.0000320 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 24199/88641, Loss=1.7776, lr=0.0000320 Time cost=4.2 Thoughput=11.99 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 24249/88641, Loss=1.8201, lr=0.0000320 Time cost=4.1 Thoughput=12.05 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 24299/88641, Loss=1.3753, lr=0.0000320 Time cost=4.2 Thoughput=12.04 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 24349/88641, Loss=1.8014, lr=0.0000319 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 24399/88641, Loss=1.7468, lr=0.0000319 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 24449/88641, Loss=1.7242, lr=0.0000319 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 24499/88641, Loss=1.5061, lr=0.0000319 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 24549/88641, Loss=1.5261, lr=0.0000319 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 27899/88641, Loss=1.9201, lr=0.0000312 Time cost=4.1 Thoughput=12.11 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 27949/88641, Loss=1.3815, lr=0.0000312 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 27999/88641, Loss=2.3085, lr=0.0000312 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 28049/88641, Loss=1.3472, lr=0.0000312 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 28099/88641, Loss=1.5617, lr=0.0000312 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 28149/88641, Loss=1.0740, lr=0.0000312 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 28199/88641, Loss=1.7873, lr=0.0000311 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 28249/88641, Loss=1.6035, lr=0.0000311 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 28299/88641, Loss=1.6008, lr=0.0000311 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 31649/88641, Loss=1.6721, lr=0.0000304 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 31699/88641, Loss=1.6107, lr=0.0000304 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 31749/88641, Loss=1.9971, lr=0.0000304 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 31799/88641, Loss=1.7955, lr=0.0000304 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 31849/88641, Loss=2.2350, lr=0.0000304 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 31899/88641, Loss=1.5034, lr=0.0000304 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 31949/88641, Loss=2.1068, lr=0.0000304 Time cost=4.1 Thoughput=12.29 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 31999/88641, Loss=1.9314, lr=0.0000304 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 32049/88641, Loss=1.9600, lr=0.0000303 Time cost=4.1 Thoughput=12.17 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 35399/88641, Loss=2.2046, lr=0.0000296 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 35449/88641, Loss=1.3273, lr=0.0000296 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 35499/88641, Loss=1.8215, lr=0.0000296 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 35549/88641, Loss=1.5036, lr=0.0000296 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 35599/88641, Loss=1.8759, lr=0.0000296 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 35649/88641, Loss=1.7733, lr=0.0000296 Time cost=4.1 Thoughput=12.17 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 35699/88641, Loss=1.5864, lr=0.0000296 Time cost=4.1 Thoughput=12.10 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 35749/88641, Loss=1.7584, lr=0.0000296 Time cost=4.2 Thoughput=12.03 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 35799/88641, Loss=2.4436, lr=0.0000296 Time cost=4.2 Thoughput=12.01 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 39149/88641, Loss=1.4587, lr=0.0000289 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 39199/88641, Loss=2.1625, lr=0.0000288 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 39249/88641, Loss=2.1335, lr=0.0000288 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 39299/88641, Loss=1.9244, lr=0.0000288 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 39349/88641, Loss=1.3417, lr=0.0000288 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 39399/88641, Loss=1.7993, lr=0.0000288 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 39449/88641, Loss=1.8923, lr=0.0000288 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 39499/88641, Loss=1.8757, lr=0.0000288 Time cost=4.1 Thoughput=12.15 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 39549/88641, Loss=1.7658, lr=0.0000288 Time cost=4.2 Thoughput=12.04 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 42899/88641, Loss=1.3324, lr=0.0000281 Time cost=4.1 Thoughput=12.14 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 42949/88641, Loss=1.8475, lr=0.0000281 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 42999/88641, Loss=1.4718, lr=0.0000281 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 43049/88641, Loss=1.7915, lr=0.0000280 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 43099/88641, Loss=2.0647, lr=0.0000280 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 43149/88641, Loss=1.4715, lr=0.0000280 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 43199/88641, Loss=1.4985, lr=0.0000280 Time cost=4.6 Thoughput=10.88 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 43249/88641, Loss=2.0297, lr=0.0000280 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 43299/88641, Loss=1.5787, lr=0.0000280 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 46649/88641, Loss=1.4870, lr=0.0000273 Time cost=4.2 Thoughput=11.89 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 46699/88641, Loss=1.3658, lr=0.0000273 Time cost=4.2 Thoughput=11.93 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 46749/88641, Loss=1.5041, lr=0.0000273 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 46799/88641, Loss=2.1536, lr=0.0000273 Time cost=4.1 Thoughput=12.14 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 46849/88641, Loss=1.3602, lr=0.0000272 Time cost=4.2 Thoughput=11.98 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 46899/88641, Loss=1.5989, lr=0.0000272 Time cost=4.2 Thoughput=11.81 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 46949/88641, Loss=1.4885, lr=0.0000272 Time cost=4.1 Thoughput=12.16 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 46999/88641, Loss=1.0135, lr=0.0000272 Time cost=4.1 Thoughput=12.17 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 47049/88641, Loss=2.1725, lr=0.0000272 Time cost=4.6 Thoughput=10.82 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 50399/88641, Loss=2.5777, lr=0.0000265 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 50449/88641, Loss=1.5659, lr=0.0000265 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 50499/88641, Loss=2.3836, lr=0.0000265 Time cost=4.1 Thoughput=12.05 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 50549/88641, Loss=1.3364, lr=0.0000265 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 50599/88641, Loss=1.6311, lr=0.0000265 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 50649/88641, Loss=2.0004, lr=0.0000265 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 50699/88641, Loss=1.4358, lr=0.0000264 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 50749/88641, Loss=1.7844, lr=0.0000264 Time cost=4.1 Thoughput=12.07 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 50799/88641, Loss=1.1062, lr=0.0000264 Time cost=4.1 Thoughput=12.18 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 54149/88641, Loss=1.3993, lr=0.0000257 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 54199/88641, Loss=1.2691, lr=0.0000257 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 54249/88641, Loss=1.2348, lr=0.0000257 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 54299/88641, Loss=1.6117, lr=0.0000257 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 54349/88641, Loss=1.4235, lr=0.0000257 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 54399/88641, Loss=0.9643, lr=0.0000257 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 54449/88641, Loss=2.1576, lr=0.0000257 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 54499/88641, Loss=1.3076, lr=0.0000257 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 54549/88641, Loss=1.7657, lr=0.0000256 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 57899/88641, Loss=2.4120, lr=0.0000249 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 57949/88641, Loss=1.3040, lr=0.0000249 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 57999/88641, Loss=1.6165, lr=0.0000249 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 58049/88641, Loss=1.6051, lr=0.0000249 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 58099/88641, Loss=1.4662, lr=0.0000249 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 58149/88641, Loss=1.8499, lr=0.0000249 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 58199/88641, Loss=1.8868, lr=0.0000249 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 58249/88641, Loss=1.5480, lr=0.0000249 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 58299/88641, Loss=2.2630, lr=0.0000249 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 61649/88641, Loss=0.9391, lr=0.0000242 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 61699/88641, Loss=1.3298, lr=0.0000241 Time cost=4.1 Thoughput=12.12 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 61749/88641, Loss=1.4386, lr=0.0000241 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 61799/88641, Loss=1.8105, lr=0.0000241 Time cost=4.2 Thoughput=12.03 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 61849/88641, Loss=1.0077, lr=0.0000241 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 61899/88641, Loss=1.3393, lr=0.0000241 Time cost=4.1 Thoughput=12.11 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 61949/88641, Loss=1.5552, lr=0.0000241 Time cost=4.2 Thoughput=11.77 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 61999/88641, Loss=1.4533, lr=0.0000241 Time cost=4.2 Thoughput=11.83 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 62049/88641, Loss=1.2557, lr=0.0000241 Time cost=4.2 Thoughput=11.98 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 65399/88641, Loss=2.3966, lr=0.0000234 Time cost=4.1 Thoughput=12.28 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 65449/88641, Loss=1.9131, lr=0.0000234 Time cost=4.1 Thoughput=12.29 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 65499/88641, Loss=1.6839, lr=0.0000234 Time cost=4.1 Thoughput=12.30 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 65549/88641, Loss=1.4222, lr=0.0000233 Time cost=4.1 Thoughput=12.29 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 65599/88641, Loss=0.9255, lr=0.0000233 Time cost=4.1 Thoughput=12.17 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 65649/88641, Loss=1.4318, lr=0.0000233 Time cost=4.2 Thoughput=11.97 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 65699/88641, Loss=2.0611, lr=0.0000233 Time cost=4.2 Thoughput=11.99 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 65749/88641, Loss=1.2486, lr=0.0000233 Time cost=4.2 Thoughput=12.02 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 65799/88641, Loss=1.5025, lr=0.0000233 Time cost=4.2 Thoughput=11.95 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 69149/88641, Loss=1.6482, lr=0.0000226 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 69199/88641, Loss=1.5573, lr=0.0000226 Time cost=4.1 Thoughput=12.15 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 69249/88641, Loss=2.0132, lr=0.0000226 Time cost=4.2 Thoughput=12.03 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 69299/88641, Loss=1.8906, lr=0.0000226 Time cost=4.1 Thoughput=12.27 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 69349/88641, Loss=1.8081, lr=0.0000225 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 69399/88641, Loss=2.0015, lr=0.0000225 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 69449/88641, Loss=1.1883, lr=0.0000225 Time cost=4.6 Thoughput=10.94 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 69499/88641, Loss=1.3868, lr=0.0000225 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 69549/88641, Loss=1.7349, lr=0.0000225 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 72899/88641, Loss=1.5161, lr=0.0000218 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 72949/88641, Loss=1.9171, lr=0.0000218 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 72999/88641, Loss=1.3566, lr=0.0000218 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 73049/88641, Loss=1.6701, lr=0.0000218 Time cost=4.1 Thoughput=12.15 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 73099/88641, Loss=1.6624, lr=0.0000218 Time cost=4.1 Thoughput=12.20 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 73149/88641, Loss=1.5490, lr=0.0000218 Time cost=4.1 Thoughput=12.14 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 73199/88641, Loss=1.1134, lr=0.0000217 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 73249/88641, Loss=1.3910, lr=0.0000217 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 73299/88641, Loss=1.6503, lr=0.0000217 Time cost=4.6 Thoughput=10.90 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 76649/88641, Loss=1.2488, lr=0.0000210 Time cost=4.2 Thoughput=12.04 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 76699/88641, Loss=1.2278, lr=0.0000210 Time cost=4.5 Thoughput=11.20 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 76749/88641, Loss=1.2164, lr=0.0000210 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 76799/88641, Loss=1.7373, lr=0.0000210 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 76849/88641, Loss=1.8412, lr=0.0000210 Time cost=4.1 Thoughput=12.14 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 76899/88641, Loss=1.2768, lr=0.0000210 Time cost=4.2 Thoughput=12.00 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 76949/88641, Loss=1.5269, lr=0.0000210 Time cost=4.2 Thoughput=12.02 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 76999/88641, Loss=1.5091, lr=0.0000210 Time cost=4.1 Thoughput=12.24 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 77049/88641, Loss=1.3345, lr=0.0000209 Time cost=4.1 Thoughput=12.11 samples/s
INFO:gluon

INFO:gluonnlp:Epoch: 1, Batch: 80399/88641, Loss=1.2037, lr=0.0000202 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 80449/88641, Loss=1.4740, lr=0.0000202 Time cost=4.1 Thoughput=12.19 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 80499/88641, Loss=1.7257, lr=0.0000202 Time cost=4.2 Thoughput=12.04 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 80549/88641, Loss=1.5663, lr=0.0000202 Time cost=4.1 Thoughput=12.23 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 80599/88641, Loss=1.6288, lr=0.0000202 Time cost=4.1 Thoughput=12.26 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 80649/88641, Loss=1.5895, lr=0.0000202 Time cost=4.1 Thoughput=12.25 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 80699/88641, Loss=1.7532, lr=0.0000202 Time cost=4.1 Thoughput=12.21 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 80749/88641, Loss=1.3047, lr=0.0000202 Time cost=4.1 Thoughput=12.22 samples/s
INFO:gluonnlp:Epoch: 1, Batch: 80799/88641, Loss=1.6022, lr=0.0000202 Time cost=4.3 Thoughput=11.70 samples/s
INFO:gluon

In [20]:
def evaluate():
    """Evaluate the model on validation dataset.
    """
    log.info('Loading dev data...')
#     if version_2:
#         dev_data = SQuAD('dev', version='2.0')
#     else:
    dev_data = SQuAD('dev', version='1.1')
    if args.debug:
        sampled_data = [dev_data[0], dev_data[1], dev_data[2]]
        dev_data = mx.gluon.data.SimpleDataset(sampled_data)
    log.info('Number of records in dev data:{}'.format(len(dev_data)))

    dev_dataset = dev_data.transform(
        SQuADTransform(
            copy.copy(tokenizer),
            max_seq_length=max_seq_length,
            doc_stride=doc_stride,
            max_query_length=max_query_length,
            is_pad=False,
            is_training=False)._transform, lazy=False)
    
    dev_data_transform, _ = preprocess_dataset(
        dev_data, SQuADTransform(
            copy.copy(tokenizer),
            max_seq_length=max_seq_length,
            doc_stride=doc_stride,
            max_query_length=max_query_length,
            is_pad=False,
            is_training=False))
    log.info('The number of examples after preprocessing:{}'.format(
        len(dev_data_transform)))

    dev_dataloader = mx.gluon.data.DataLoader(
        dev_data_transform,
        batchify_fn=batchify_fn,
        num_workers=4, batch_size=test_batch_size,
        shuffle=False, last_batch='keep')

    log.info('start prediction')

    all_results = collections.defaultdict(list)

    epoch_tic = time.time()
    total_num = 0
    for data in dev_dataloader:
        example_ids, inputs, token_types, valid_length, _, _ = data
        total_num += len(inputs)
        out = net(inputs.astype('float32').as_in_context(ctx),
                  token_types.astype('float32').as_in_context(ctx),
                  valid_length.astype('float32').as_in_context(ctx))

        output = mx.nd.split(out, axis=2, num_outputs=2)
        example_ids = example_ids.asnumpy().tolist()
        pred_start = output[0].reshape((0, -3)).asnumpy()
        pred_end = output[1].reshape((0, -3)).asnumpy()

        for example_id, start, end in zip(example_ids, pred_start, pred_end):
            all_results[example_id].append(PredResult(start=start, end=end))
            
    epoch_toc = time.time()
    log.info('Time cost={:.2f} s, Thoughput={:.2f} samples/s'.format(
        epoch_toc - epoch_tic, total_num/(epoch_toc - epoch_tic)))

    log.info('Get prediction results...')

    all_predictions = collections.OrderedDict()

    for features in dev_dataset:
        results = all_results[features[0].example_id]
        example_qas_id = features[0].qas_id

        prediction, _ = predict(
            features=features,
            results=results,
            tokenizer=nlp.data.BERTBasicTokenizer(lower=lower),
            max_answer_length=max_answer_length,
            null_score_diff_threshold=null_score_diff_threshold,
            n_best_size=n_best_size,
#             version_2=version_2
        )

        all_predictions[example_qas_id] = prediction

    with io.open(os.path.join(output_dir, 'predictions.json'),
                 'w', encoding='utf-8') as fout:
        data = json.dumps(all_predictions, ensure_ascii=False)
        fout.write(data)

#     if version_2:
#         log.info('Please run evaluate-v2.0.py to get evaluation results for SQuAD 2.0')
#     else:
    F1_EM = get_F1_EM(dev_data, all_predictions)
    log.info(F1_EM)

In [21]:
evaluate()

INFO:gluonnlp:Loading dev data...


AttributeError: 'Namespace' object has no attribute 'debug'

## Deploy on SageMaker

1. Saving the model parameters
2. Preparing functions for inference 
3. Building a docker container with dependencies installed
4. Launch a serving end-point with SageMaker SDK

### 1. Save the model parameters

In [27]:
## save parameters, model definition and vocabulary in a zip file

# net.export('checkpoint')
with open('vocab.json', 'w') as f:
    f.write(vocab.to_json())
import tarfile
with tarfile.open("model.tar.gz", "w:gz") as tar:
    tar.add("output_dir/checkpoint-0000.params") 
    tar.add("output_dir/checkpoint-symbol.json") 
    tar.add("output_dir/vocab.json")

### 2. Preparing functions for inference

Two functions: 
1. model_fn() to load model parameters
2. transform_fn() to run model inference given an input

### 3. Building a docker container with dependencies installed

Let's prepare a docker container with all the dependencies required for model inference. Here we build a docker container based on the SageMaker MXNet inference container, and you can find the list of all available inference containers at https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html

Here we use local mode for demonstration purpose. To deploy on actual instances, you need to login into AWS elastic container registry (ECR) service, and push the container to ECR. 

```
docker build -t $YOUR_EDR_DOCKER_TAG . -f Dockerfile
$(aws ecr get-login --no-include-email --region $YOUR_REGION)
docker push $YOUR_EDR_DOCKER_TAG
```

In [42]:
%%writefile Dockerfile

ARG REGION
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/mxnet-inference:1.6.0-gpu-py3

RUN pip install --upgrade --user --pre 'mxnet-mkl' 'https://github.com/dmlc/gluon-nlp/tarball/v0.9.x'

RUN pip list | grep mxnet

COPY *.py /opt/ml/model/code/

Overwriting Dockerfile


In [58]:
## Docker login cmd
!$(aws ecr get-login --no-include-email --region us-east-1 --registry-ids 763104351884)

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [60]:
!export REGION=$(wget -qO- http://169.254.169.254/latest/meta-data/placement/availability-zone) &&\
 docker build --no-cache --build-arg REGION=${REGION::-1} -t my-docker:inference . -f Dockerfile

Sending build context to Docker daemon  2.127GB
Step 1/5 : ARG REGION
Step 2/5 : FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/mxnet-inference:1.6.0-gpu-py3
1.6.0-gpu-py3: Pulling from mxnet-inference

[1B7927d38a: Pulling fs layer 
[1Bac894db4: Pulling fs layer 
[1B2af6d627: Pulling fs layer 
[1B86211d23: Pulling fs layer 
[1B603ff777: Pulling fs layer 
[1B7165632f: Pulling fs layer 
[1B96e40dcf: Pulling fs layer 
[1B91ff3706: Pulling fs layer 
[1B02a4385b: Pulling fs layer 
[1Be229cfdb: Pulling fs layer 
[1B0e6ed5b5: Pulling fs layer 
[1Bc8e328fe: Pulling fs layer 
[1Bbb20abb1: Pulling fs layer 
[1B0702cb67: Pulling fs layer 
[1Bd6c2671b: Pulling fs layer 
[1B486e676d: Pulling fs layer 
[1Ba8b75933: Pulling fs layer 
[1B7d871d5a: Pulling fs layer 
[1Bc8e48618: Pulling fs layer 
[1B9ef7425f: Pulling fs layer 
[1Bc02fa024: Pulling fs layer 
[9B0702cb67: Downloading  686.7MB/686.7MBK[19A[1K[K[18A[1K[K[22A[1K[K[17A[1K[K[22A[1K[K[18A[1K[K[16A[1K

In [30]:
!wget -qO- http://169.254.169.254/latest/meta-data/placement/availability-zone

us-east-1a

In [56]:
# !aws ecr get-login --no-include-email --registry-ids 763104351884
!aws ecr get-login --no-include-email --region us-east-1 --registry-ids 763104351884

docker login -u AWS -p eyJwYXlsb2FkIjoiV3FDZFFmeFhrOXZ2aVhsSzZ4YUFmU2U0aWJsRHp5ZEJLRjhNd3hXcW5KREwveEtDcjdUaUhRZlJ0NEg2ZmR2ZWswcVQ2eGFjMUswVWFtblhwOWNTVUROUkloRWZaL1M3YWU1K01JZEdUbFAvMmtWV09nWW1LUExMVFArV2JPblhxR0dSVGFFQjA1RHdrcGFEcThtVFFidjU0WWtpRkpqRkttRzAvVzNUTWxUM3VqRWp3ZkppVWlIMnBZU3JYY3lqWHBtUlRsUXNJbzRRbDJkVG5TM25FVkRlRm5jckFYOHpZUWNtY2RtcmxqSHo3UEhSZ2h5RVcxWjI3eU9Yd2FGckVDRTk3VXhYVy9aVHB0N2RSb21HVERCVGcrRVJ1RDVCb0hyVGF0SGk3RC9ROEs2d0ZaUVE0YXYrT1NQbUJQSUViNFFBOSswMG5tWXR4Rmk3K3c2eU8vSjl6eXdtRW9nQzA3ejdiTHpIRjduZDRTeHlVK2hzTjJkV1ZjWVBpVXI3aU05d21ESEdJZWkxWkJjWnk5eW9CWGNwcE00OXhGb0NBRE80b3J3L1NRNmpxdk5NSi9sU1FYb01yNU1GV0d2Sy9QdTNXUnlXcEpobjU5WDUrOEsxalBxWnhuTVJrTXdxb1A5N2RYQzdZUFF6NCtFYXVvNSszMHROck9QZktiMmQyYk9US1htOHhIa2NsblJhVm84aVlpaWl3QnhVSWJHalhIOHB0eVVrSVUzTlo5Nkw5TVpTc0YzdmdhTzZVQytPSFFXbTdsQU9obk15Wm1GeVVoQlA2OTFDNHJlY1lGTEIxc1IwZVJtQzhRMXRZMW45NThzMDhCK2F6VVlHYWZtbE9NSjdraStjMTB6SnA1WXhseGRsejdIdGhicVE4WHJrVkJGSHFJaWszRjVuRERBMWd0NGZQdHIzTnBkWnp4YjZjVzRsZ0lQTjBoQVgxYlhsV

In [59]:
!docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.6.0-gpu-py3
#              763104351884.dkr.ecr.<region>.amazonaws.com/mxnet-inference:1.4.1-gpu-py3

1.6.0-gpu-py3: Pulling from mxnet-inference

[1B7927d38a: Pulling fs layer 
[1Bac894db4: Pulling fs layer 
[1B2af6d627: Pulling fs layer 
[1B86211d23: Pulling fs layer 
[1B603ff777: Pulling fs layer 
[1B7165632f: Pulling fs layer 
[1B96e40dcf: Pulling fs layer 
[1B91ff3706: Pulling fs layer 
[1B02a4385b: Pulling fs layer 
[1Be229cfdb: Pulling fs layer 
[1B0e6ed5b5: Pulling fs layer 
[1Bc8e328fe: Pulling fs layer 
[1Bbb20abb1: Pulling fs layer 
[1B0702cb67: Pulling fs layer 
[1Bd6c2671b: Pulling fs layer 
[1B486e676d: Pulling fs layer 
[1Ba8b75933: Pulling fs layer 
[13B165632f: Waiting fs layer 
[1Bc8e48618: Pulling fs layer 
[12B2a4385b: Waiting fs layer 
[1Bc02fa024: Pulling fs layer 
[7B486e676d: Download complete  B/768.9MB0A[1K[K[19A[1K[K[22A[1K[K[18A[1K[K[17A[1K[K[18A[1K[K[16A[1K[K[22A[1K[K[14A[1K[K[22A[1K[K[15A[1K[K[22A[1K[K[15A[1K[K[22A[1K[K[22A[1K[K[14A[1K[K[22A[1K[K[15A[1K[K[22A[1K[K[13A[1K[K[22

In [32]:
!aws ecr get-login --no-include-email --region us-east-1

docker login -u AWS -p eyJwYXlsb2FkIjoidWdBRkZyVERaMGFNRnJzYUFKNHZZQm1oZ0I5eVFJd25qRDBpcit2VU9SU3EwWkg1RTEyaVVTdWJSSmpJMmpzbXVjVnZxYzZVR01CcVpaa0JKb2oreEMycCtMNytWK2ZhNHh3bVlLMnZVWnFWN1ByYVp4Y0FaYjlaSHpWRFpXYXlabHZ4a0NQamVDbE55a3JpSkNrMGNwLzR5TU9ONnRTTHpRWEdGOTFpQWRldzE2Y05SbWtpWXZsVGZZWkJVSjJTemF5MEFLTXcwMHBzZmt5UXM0VkYycTlDcW1oRDZTMmVQWEY3UjNDRjUzSnN2cHpBbDRTRHorQ0dIOEUyaUNETlZobUVCQUpTNFZLR3E4ZGd6MWFGRjRhZkJVc01PUnE4Tm9zc3pvQ3J0b2dsS2o4Qkx4MHc4VXZySUNWODFjNWJnTnM3YlJTUVQzQ2s2enY1NERtRXRDZEk5UEZkdmwrTHZHQjNKTTEzZlQ0UUtmOUU0Y214TzY3ekpwUUNmSHcwTmN5VzQ3bVdVRnJxajhiaTJBNk54SlltRU12K2xQNjVjbHI1UGVZTTBvZTlZc2hDaGJxT2xlOWVuMTBTZlJmMG1DYmNDcDdDRUtaMUEyTUs2a0t5dEJoQWVrN1Y0U0hGRmJKdHJIa01Hb2ZNTGk2L2tmS2QzZ2lySy8weWgxT2VFcnlJaXBNL0Q5MFE0UTBxbHBXSktBVk5BTlhYTFdLVDd2Q2J5OXJOZjJ2aktpRWVuRFo0bUt3YWFSV3YwcmMvT0h1M1BJWFh6ZWJ4QitONVVWM251emgyc3Y3RTFXaWdyeWpYcTUxS1pqUmR4ck9XK2Z4OU94VlM2SzVyU081WVZ6cE9zRVkzMkFKcHdyYlRCblRBZTU0TUJBdWduczRhM3BOdEVQU3E2OU5PYW0yc0RGV2tVTitvL2F0M1dMRUduaURIYmg2c2NTNlZrbXAvd

In [45]:
!docker pull 520713654638.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.6.0-gpu-py3


Error response from daemon: Get https://520713654638.dkr.ecr.us-east-1.amazonaws.com/v2/mxnet-inference/manifests/1.6.0-gpu-py3: no basic auth credentials


In [34]:
import sagemaker

role=sagemaker.get_execution_role()


In [38]:
!aws ecr get-login --no-include-email --registry-ids 520713654638

docker login -u AWS -p eyJwYXlsb2FkIjoiV2Z4Q05ESDd3WW5hYXpVZjc2c3JrM3RBSTZQZUV6NGw2VFRrQ0dLd2E0dVNQTjdkSkdNSFk4Y2dNcGoxcXcrN25ZY0lLcDN4aC9IMzhZb2tqUU1WWld2WGlYc2lhbFdHelppMVB3b3ErcUVnWXVybUJxUjhuZ2ZzcTRnQmtRRFYrZXI1UDJxRExrUHFtVlJ3U2w4eERScFFZVXIwbzF6MGtOQ0dVdXpocDVSZmlXc0hHY29qeXdsaFEwWEpMYkczMGhhUUJGZUVYekRpdEJLL1Ezc0tYUm90djlyRlVLSHdhcG5UOEEzL3JsRDZxNEp6TjFORDI5RlRoVE9QY0g4YXRidExISG9LL3c3RTNXTTNVZERQbTk5NmNsa0hVWFpjM2tFeDhzS1lUM3Z1QzF6T1J1TS9RaGJxdE91Y29ZSXUwaXBPZnJIYUdLRXc3aFBKYktHb1RUR0Z0b0xwZVZ1UWtDV0w1YjVsVDk3VkxJZEU4REZsTkw1bzF0VzU2NWptb3dYc2RpTk5MT3gvdkhtbmFHa0YzVXlnL01EQkcyWUxYMTVqYThiNWV3MlFMUEEzMUJsNkpTQUxaMEhzUFp1M01qQk5XTFN2eXIyeGw2anNZZy9jRzVleVNvbUZGbUo1SVI0WGFXMmZLWnIySDQrYUt4TVk5ODVENUZKYzZ4M3lKNUhHa3JzbVVFb2xXNFd2MG9VVEFSUVVvSTdnQ3VVaUtCRWlvZkpPRk9ublc2L1MxOXgyQ3hOT2d3aDVpTi81MU4zVnEwRlhtZGwwNmlheFB0MWk4cmVkbndNM2Y5R0tMemtSUnR6bTlQTFJNbjFXMTVDVHBwNExSMS9QajVKQzZoTXYrRlJKcXpKenVWbll4dS9rMW5UM0Q2YVAySEIwR0dDLzBEOW5VQUhwRkdic2V5YlJQSXJSMXF1b3lHd1pDMERsMXdLcTdvcVpIK1ZjU