# Character-wise RNN to use as Tensorboard Example

In this notebook, I'll build a character-wise RNN trained on "Anna Karenina". It'll be able to generate new text based on the text from the book. The network's graph, and parameters will be visualized and debugged using Tensorboard

This network is based off of Andrej Karpathy's [post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [implementation in Torch](https://github.com/karpathy/char-rnn). Also, some information [here at r2rt](http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html) and from [Sherjil Ozair](https://github.com/sherjilozair/char-rnn-tensorflow) on GitHub. Below is the general architecture of the character-wise RNN.

<img src="assets/charseq.jpeg" width="500">

In [15]:
import time
from collections import namedtuple

import numpy as np
import tensorflow as tf

First we'll load the text file and convert it into integers for our network to use.

In [16]:
with open('anna.txt', 'r') as f:
    text=f.read()
vocab = set(text)
vocab_to_int = {c: i for i, c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
chars = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

In [17]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

In [18]:
chars[:100]

array([77, 51, 46, 35, 28, 18, 48, 53, 20, 58, 58, 58, 45, 46, 35, 35, 52,
       53, 23, 46, 50, 67, 72, 67, 18,  8, 53, 46, 48, 18, 53, 46, 72, 72,
       53, 46, 72, 67, 17, 18, 68, 53, 18, 44, 18, 48, 52, 53,  1, 37, 51,
       46, 35, 35, 52, 53, 23, 46, 50, 67, 72, 52, 53, 67,  8, 53,  1, 37,
       51, 46, 35, 35, 52, 53, 67, 37, 53, 67, 28,  8, 53,  6, 33, 37, 58,
       33, 46, 52, 39, 58, 58, 60, 44, 18, 48, 52, 28, 51, 67, 37], dtype=int32)

Now I need to split up the data into batches, and into training and validation sets. I should be making a test set here, but I'm not going to worry about that. My test will be if the network can generate new text.

Here I'll make both input and target arrays. The targets are the same as the inputs, except shifted one character over. I'll also drop the last bit of data so that I'll only have completely full batches.

The idea here is to make a 2D matrix where the number of rows is equal to the number of batches. Each row will be one long concatenated string from the character data. We'll split this data into a training set and validation set using the `split_frac` keyword. This will keep 90% of the batches in the training set, the other 10% in the validation set.

In [19]:
def split_data(chars, batch_size, num_steps, split_frac=0.9):
    """ 
    Split character data into training and validation sets, inputs and targets for each set.
    
    Arguments
    ---------
    chars: character array
    batch_size: Size of examples in each of batch
    num_steps: Number of sequence steps to keep in the input and pass to the network
    split_frac: Fraction of batches to keep in the training set
    
    
    Returns train_x, train_y, val_x, val_y
    """
    
    
    slice_size = batch_size * num_steps
    n_batches = int(len(chars) / slice_size)
    
    # Drop the last few characters to make only full batches
    x = chars[: n_batches*slice_size]
    y = chars[1: n_batches*slice_size + 1]
    
    # Split the data into batch_size slices, then stack them into a 2D matrix 
    x = np.stack(np.split(x, batch_size))
    y = np.stack(np.split(y, batch_size))
    
    # Now x and y are arrays with dimensions batch_size x n_batches*num_steps
    
    # Split into training and validation sets, keep the virst split_frac batches for training
    split_idx = int(n_batches*split_frac)
    train_x, train_y= x[:, :split_idx*num_steps], y[:, :split_idx*num_steps]
    val_x, val_y = x[:, split_idx*num_steps:], y[:, split_idx*num_steps:]
    
    return train_x, train_y, val_x, val_y

In [20]:
train_x, train_y, val_x, val_y = split_data(chars, 10, 200)

In [21]:
train_x.shape

(10, 178400)

In [22]:
train_x[:,:10]

array([[77, 51, 46, 35, 28, 18, 48, 53, 20, 58],
       [69, 37, 19, 53, 51, 18, 53, 50,  6, 44],
       [53, 24, 46, 28, 24, 51, 67, 37, 63, 53],
       [ 6, 28, 51, 18, 48, 53, 33,  6,  1, 72],
       [53, 28, 51, 18, 53, 72, 46, 37, 19, 55],
       [53,  5, 51, 48,  6,  1, 63, 51, 53, 72],
       [28, 53, 28,  6, 58, 19,  6, 39, 58, 58],
       [ 6, 53, 51, 18, 48,  8, 18, 72, 23,  0],
       [51, 46, 28, 53, 67,  8, 53, 28, 51, 18],
       [18, 48,  8, 18, 72, 23, 53, 46, 37, 19]], dtype=int32)

I'll write another function to grab batches out of the arrays made by split data. Here each batch will be a sliding window on these arrays with size `batch_size X num_steps`. For example, if we want our network to train on a sequence of 100 characters, `num_steps = 100`. For the next batch, we'll shift this window the next sequence of `num_steps` characters. In this way we can feed batches to the network and the cell states will continue through on each batch.

In [23]:
def get_batch(arrs, num_steps):
    batch_size, slice_size = arrs[0].shape
    
    n_batches = int(slice_size/num_steps)
    for b in range(n_batches):
        yield [x[:, b*num_steps: (b+1)*num_steps] for x in arrs]

In [24]:
def build_rnn(num_classes, batch_size=50, num_steps=50, lstm_size=128, num_layers=2,
              learning_rate=0.001, grad_clip=5, sampling=False):
        
    if sampling == True:
        batch_size, num_steps = 1, 1

    tf.reset_default_graph()
    
    # Declare placeholders we'll feed into the graph
    with tf.name_scope("inputs"):
        inputs = tf.placeholder(tf.int32, [batch_size, num_steps], name='inputs')
        x_one_hot = tf.one_hot(inputs, num_classes, name='x_one_hot')

    
    with tf.name_scope("targets"):
        targets = tf.placeholder(tf.int32, [batch_size, num_steps], name='targets')
        y_one_hot = tf.one_hot(targets, num_classes, name='y_one_hot')
        y_reshaped = tf.reshape(y_one_hot, [-1, num_classes])
    
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    
    # Build the RNN layers
    with tf.name_scope("RNN_layers"):
        cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.DropoutWrapper( tf.contrib.rnn.BasicLSTMCell(lstm_size)) for _ in range(num_layers)])

    with tf.name_scope("RNN_init_state"):
        initial_state = cell.zero_state(batch_size, tf.float32)

    # Run the data through the RNN layers
    with tf.name_scope("RNN_forward"):
        outputs, state = tf.nn.dynamic_rnn(cell, x_one_hot, initial_state=initial_state)
    final_state = state
    
    # Reshape output so it's a bunch of rows, one row for each cell output
    with tf.name_scope("sequence_reshape"):
        seq_output = tf.concat(outputs, axis=1,name='seq_output')
        output = tf.reshape(seq_output, [-1, lstm_size], name='graph_output')
    
    # Now connect the RNN putputs to a softmax layer and calculate the cost
    with tf.name_scope("logits"):
        softmax_w = tf.Variable(tf.truncated_normal((lstm_size, num_classes), stddev=0.1),
                               name='softmax_w')
        softmax_b = tf.Variable(tf.zeros(num_classes), name='softmax_b')
        logits = tf.matmul(output, softmax_w) + softmax_b
        tf.summary.histogram("softmax_w",softmax_w)
        tf.summary.histogram("softmax_b",softmax_b)

    with tf.name_scope("predictions"):
        preds = tf.nn.softmax(logits, name='predictions')
        tf.summary.histogram("predictions",preds)
    
    with tf.name_scope("cost"):
        loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_reshaped, name='loss')
        cost = tf.reduce_mean(loss, name='cost')
        tf.summary.scalar("cost",cost)

    # Optimizer for training, using gradient clipping to control exploding gradients
    with tf.name_scope("train"):
        tvars = tf.trainable_variables()
        grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), grad_clip)
        train_op = tf.train.AdamOptimizer(learning_rate)
        optimizer = train_op.apply_gradients(zip(grads, tvars))
        
    summaries = tf.summary.merge_all()

    # Export the nodes 
    export_nodes = ['inputs', 'targets', 'initial_state', 'final_state',
                    'keep_prob', 'cost', 'preds', 'optimizer','summaries']
    Graph = namedtuple('Graph', export_nodes)
    local_dict = locals()
    graph = Graph(*[local_dict[each] for each in export_nodes])
    
    return graph

## Hyperparameters

Here I'm defining the hyperparameters for the network. The two you probably haven't seen before are `lstm_size` and `num_layers`. These set the number of hidden units in the LSTM layers and the number of LSTM layers, respectively. Of course, making these bigger will improve the network's performance but you'll have to watch out for overfitting. If your validation loss is much larger than the training loss, you're probably overfitting. Decrease the size of the network or decrease the dropout keep probability.

In [25]:
batch_size = 100
num_steps = 100
lstm_size = 512
num_layers = 2
learning_rate = 0.001

## Write out the graph for TensorBoard

In [26]:
model = build_rnn(len(vocab),
                  batch_size=batch_size,
                  num_steps=num_steps,
                  learning_rate=learning_rate,
                  lstm_size=lstm_size,
                  num_layers=num_layers)

with tf.Session() as sess:
    
    sess.run(tf.global_variables_initializer())
    train_writer = tf.summary.FileWriter("./logs/train",sess.graph)
    test_writer = tf.summary.FileWriter("./logs/test")

Type is unsupported, or the types of the items don't match field type in CollectionDef.
'dict' object has no attribute 'name'


## Training

Time for training which is is pretty straightforward. Here I pass in some data, and get an LSTM state back. Then I pass that state back in to the network so the next batch can continue the state from the previous batch. And every so often (set by `save_every_n`) I calculate the validation loss and save a checkpoint.

In [27]:
!mkdir -p checkpoints/anna

In [28]:
epochs = 1
save_every_n = 200
train_x, train_y, val_x, val_y = split_data(chars, batch_size, num_steps)

model = build_rnn(len(vocab), 
                  batch_size=batch_size,
                  num_steps=num_steps,
                  learning_rate=learning_rate,
                  lstm_size=lstm_size,
                  num_layers=num_layers)

saver = tf.train.Saver(max_to_keep=100)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    # Use the line below to load a checkpoint and resume training
    #saver.restore(sess, 'checkpoints/anna20.ckpt')
    
    n_batches = int(train_x.shape[1]/num_steps)
    iterations = n_batches * epochs
    for e in range(epochs):
        
        # Train network
        new_state = sess.run(model.initial_state)
        loss = 0
        for b, (x, y) in enumerate(get_batch([train_x, train_y], num_steps), 1):
            iteration = e*n_batches + b
            start = time.time()
            feed = {model.inputs: x,
                    model.targets: y,
                    model.keep_prob: 0.5,
                    model.initial_state: new_state}
            summaries,batch_loss, new_state, _ = sess.run([model.summaries,model.cost, model.final_state, model.optimizer], 
                                                 feed_dict=feed)
            loss += batch_loss
            end = time.time()
            print('Epoch {}/{} '.format(e+1, epochs),
                  'Iteration {}/{}'.format(iteration, iterations),
                  'Training loss: {:.4f}'.format(loss/b),
                  '{:.4f} sec/batch'.format((end-start)))
        
            train_writer.add_summary(summaries,iteration)
            
            if (iteration%save_every_n == 0) or (iteration == iterations):
                # Check performance, notice dropout has been set to 1
                val_loss = []
                new_state = sess.run(model.initial_state)
                for x, y in get_batch([val_x, val_y], num_steps):
                    feed = {model.inputs: x,
                            model.targets: y,
                            model.keep_prob: 1.,
                            model.initial_state: new_state}
                    summaries,batch_loss, new_state = sess.run([model.summaries,model.cost, model.final_state], feed_dict=feed)
                    val_loss.append(batch_loss)

                test_writer.add_summary(summaries,iteration)
                
                print('Validation loss:', np.mean(val_loss),
                      'Saving checkpoint!')
                saver.save(sess, "checkpoints/anna/i{}_l{}_{:.3f}.ckpt".format(iteration, lstm_size, np.mean(val_loss)))

Epoch 1/1  Iteration 1/178 Training loss: 4.4165 0.1873 sec/batch
Epoch 1/1  Iteration 2/178 Training loss: 4.3668 0.1474 sec/batch
Epoch 1/1  Iteration 3/178 Training loss: 4.1579 0.1476 sec/batch
Epoch 1/1  Iteration 4/178 Training loss: 4.2689 0.1492 sec/batch
Epoch 1/1  Iteration 5/178 Training loss: 4.1726 0.1477 sec/batch
Epoch 1/1  Iteration 6/178 Training loss: 4.0751 0.1513 sec/batch
Epoch 1/1  Iteration 7/178 Training loss: 3.9772 0.1495 sec/batch
Epoch 1/1  Iteration 8/178 Training loss: 3.8872 0.1477 sec/batch
Epoch 1/1  Iteration 9/178 Training loss: 3.8096 0.1510 sec/batch
Epoch 1/1  Iteration 10/178 Training loss: 3.7503 0.1478 sec/batch
Epoch 1/1  Iteration 11/178 Training loss: 3.7003 0.1495 sec/batch
Epoch 1/1  Iteration 12/178 Training loss: 3.6585 0.1474 sec/batch
Epoch 1/1  Iteration 13/178 Training loss: 3.6211 0.1476 sec/batch
Epoch 1/1  Iteration 14/178 Training loss: 3.5895 0.1471 sec/batch
Epoch 1/1  Iteration 15/178 Training loss: 3.5605 0.1476 sec/batch
Epoc

Epoch 1/1  Iteration 125/178 Training loss: 3.1087 0.1485 sec/batch
Epoch 1/1  Iteration 126/178 Training loss: 3.1059 0.1492 sec/batch
Epoch 1/1  Iteration 127/178 Training loss: 3.1033 0.1477 sec/batch
Epoch 1/1  Iteration 128/178 Training loss: 3.1007 0.1477 sec/batch
Epoch 1/1  Iteration 129/178 Training loss: 3.0980 0.1476 sec/batch
Epoch 1/1  Iteration 130/178 Training loss: 3.0952 0.1479 sec/batch
Epoch 1/1  Iteration 131/178 Training loss: 3.0926 0.1489 sec/batch
Epoch 1/1  Iteration 132/178 Training loss: 3.0895 0.1493 sec/batch
Epoch 1/1  Iteration 133/178 Training loss: 3.0867 0.1479 sec/batch
Epoch 1/1  Iteration 134/178 Training loss: 3.0837 0.1569 sec/batch
Epoch 1/1  Iteration 135/178 Training loss: 3.0805 0.1576 sec/batch
Epoch 1/1  Iteration 136/178 Training loss: 3.0772 0.1568 sec/batch
Epoch 1/1  Iteration 137/178 Training loss: 3.0741 0.1564 sec/batch
Epoch 1/1  Iteration 138/178 Training loss: 3.0711 0.1534 sec/batch
Epoch 1/1  Iteration 139/178 Training loss: 3.07

### Hyperparameter selection
Tensorboard can be used for hyperparameter selection, to show this, i will create a train_network function that takes a model, an epoch number and a file_writer and trains(and stores tensorboard summaries) data about the model with that selected hyperparameters,then we use tensorboard to compare results

In [35]:
def train(model, epochs, file_writer):
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        # Use the line below to load a checkpoint and resume training
        #saver.restore(sess, 'checkpoints/anna20.ckpt')

        n_batches = int(train_x.shape[1]/num_steps)
        iterations = n_batches * epochs
        for e in range(epochs):

            # Train network
            new_state = sess.run(model.initial_state)
            loss = 0
            for b, (x, y) in enumerate(get_batch([train_x, train_y], num_steps), 1):
                iteration = e*n_batches + b
                start = time.time()
                feed = {model.inputs: x,
                        model.targets: y,
                        model.keep_prob: 0.5,
                        model.initial_state: new_state}
                summary, batch_loss, new_state, _ = sess.run([model.summaries, model.cost, 
                                                              model.final_state, model.optimizer], 
                                                              feed_dict=feed)
                loss += batch_loss
                end = time.time()
                print('Epoch {}/{} '.format(e+1, epochs),
                      'Iteration {}/{}'.format(iteration, iterations),
                      'Training loss: {:.4f}'.format(loss/b),
                      '{:.4f} sec/batch'.format((end-start)))

                file_writer.add_summary(summary, iteration)

In [None]:
epochs = 20
batch_size = 100
num_steps = 100
train_x, train_y, val_x, val_y = split_data(chars, batch_size, num_steps)

for lstm_size in [128,256,512]:
    for num_layers in [1, 2]:
        for learning_rate in [0.002, 0.001]:
            log_string = 'logs/hp_selection/lr={},rl={},ru={}'.format(learning_rate, num_layers, lstm_size)
            writer = tf.summary.FileWriter(log_string)
            model = build_rnn(len(vocab), 
                    batch_size=batch_size,
                    num_steps=num_steps,
                    learning_rate=learning_rate,
                    lstm_size=lstm_size,
                    num_layers=num_layers)
            
            train(model, epochs, writer)

Epoch 1/20  Iteration 1/3560 Training loss: 4.4229 0.0392 sec/batch
Epoch 1/20  Iteration 2/3560 Training loss: 4.4120 0.0256 sec/batch
Epoch 1/20  Iteration 3/3560 Training loss: 4.3998 0.0267 sec/batch
Epoch 1/20  Iteration 4/3560 Training loss: 4.3825 0.0269 sec/batch
Epoch 1/20  Iteration 5/3560 Training loss: 4.3489 0.0259 sec/batch
Epoch 1/20  Iteration 6/3560 Training loss: 4.2688 0.0261 sec/batch
Epoch 1/20  Iteration 7/3560 Training loss: 4.1706 0.0256 sec/batch
Epoch 1/20  Iteration 8/3560 Training loss: 4.0814 0.0279 sec/batch
Epoch 1/20  Iteration 9/3560 Training loss: 4.0002 0.0281 sec/batch
Epoch 1/20  Iteration 10/3560 Training loss: 3.9280 0.0278 sec/batch
Epoch 1/20  Iteration 11/3560 Training loss: 3.8621 0.0258 sec/batch
Epoch 1/20  Iteration 12/3560 Training loss: 3.8055 0.0276 sec/batch
Epoch 1/20  Iteration 13/3560 Training loss: 3.7563 0.0258 sec/batch
Epoch 1/20  Iteration 14/3560 Training loss: 3.7144 0.0271 sec/batch
Epoch 1/20  Iteration 15/3560 Training loss

Epoch 1/20  Iteration 124/3560 Training loss: 3.0748 0.0354 sec/batch
Epoch 1/20  Iteration 125/3560 Training loss: 3.0723 0.0351 sec/batch
Epoch 1/20  Iteration 126/3560 Training loss: 3.0696 0.0353 sec/batch
Epoch 1/20  Iteration 127/3560 Training loss: 3.0671 0.0407 sec/batch
Epoch 1/20  Iteration 128/3560 Training loss: 3.0647 0.0372 sec/batch
Epoch 1/20  Iteration 129/3560 Training loss: 3.0621 0.0350 sec/batch
Epoch 1/20  Iteration 130/3560 Training loss: 3.0596 0.0350 sec/batch
Epoch 1/20  Iteration 131/3560 Training loss: 3.0572 0.0410 sec/batch
Epoch 1/20  Iteration 132/3560 Training loss: 3.0546 0.0440 sec/batch
Epoch 1/20  Iteration 133/3560 Training loss: 3.0521 0.0367 sec/batch
Epoch 1/20  Iteration 134/3560 Training loss: 3.0496 0.0353 sec/batch
Epoch 1/20  Iteration 135/3560 Training loss: 3.0467 0.0427 sec/batch
Epoch 1/20  Iteration 136/3560 Training loss: 3.0441 0.0367 sec/batch
Epoch 1/20  Iteration 137/3560 Training loss: 3.0414 0.0350 sec/batch
Epoch 1/20  Iteratio

Epoch 2/20  Iteration 242/3560 Training loss: 2.4403 0.0379 sec/batch
Epoch 2/20  Iteration 243/3560 Training loss: 2.4390 0.0428 sec/batch
Epoch 2/20  Iteration 244/3560 Training loss: 2.4383 0.0381 sec/batch
Epoch 2/20  Iteration 245/3560 Training loss: 2.4373 0.0379 sec/batch
Epoch 2/20  Iteration 246/3560 Training loss: 2.4359 0.0380 sec/batch
Epoch 2/20  Iteration 247/3560 Training loss: 2.4347 0.0386 sec/batch
Epoch 2/20  Iteration 248/3560 Training loss: 2.4339 0.0448 sec/batch
Epoch 2/20  Iteration 249/3560 Training loss: 2.4329 0.0428 sec/batch
Epoch 2/20  Iteration 250/3560 Training loss: 2.4321 0.0382 sec/batch
Epoch 2/20  Iteration 251/3560 Training loss: 2.4312 0.0402 sec/batch
Epoch 2/20  Iteration 252/3560 Training loss: 2.4302 0.0397 sec/batch
Epoch 2/20  Iteration 253/3560 Training loss: 2.4292 0.0380 sec/batch
Epoch 2/20  Iteration 254/3560 Training loss: 2.4288 0.0410 sec/batch
Epoch 2/20  Iteration 255/3560 Training loss: 2.4278 0.0373 sec/batch
Epoch 2/20  Iteratio

Epoch 3/20  Iteration 361/3560 Training loss: 2.2407 0.0436 sec/batch
Epoch 3/20  Iteration 362/3560 Training loss: 2.2386 0.0400 sec/batch
Epoch 3/20  Iteration 363/3560 Training loss: 2.2391 0.0390 sec/batch
Epoch 3/20  Iteration 364/3560 Training loss: 2.2402 0.0407 sec/batch
Epoch 3/20  Iteration 365/3560 Training loss: 2.2411 0.0408 sec/batch
Epoch 3/20  Iteration 366/3560 Training loss: 2.2409 0.0396 sec/batch
Epoch 3/20  Iteration 367/3560 Training loss: 2.2391 0.0414 sec/batch
Epoch 3/20  Iteration 368/3560 Training loss: 2.2383 0.0390 sec/batch
Epoch 3/20  Iteration 369/3560 Training loss: 2.2383 0.0401 sec/batch
Epoch 3/20  Iteration 370/3560 Training loss: 2.2401 0.0392 sec/batch
Epoch 3/20  Iteration 371/3560 Training loss: 2.2396 0.0387 sec/batch
Epoch 3/20  Iteration 372/3560 Training loss: 2.2390 0.0402 sec/batch
Epoch 3/20  Iteration 373/3560 Training loss: 2.2387 0.0405 sec/batch
Epoch 3/20  Iteration 374/3560 Training loss: 2.2402 0.0393 sec/batch
Epoch 3/20  Iteratio

Epoch 3/20  Iteration 479/3560 Training loss: 2.1846 0.0404 sec/batch
Epoch 3/20  Iteration 480/3560 Training loss: 2.1843 0.0442 sec/batch
Epoch 3/20  Iteration 481/3560 Training loss: 2.1839 0.0415 sec/batch
Epoch 3/20  Iteration 482/3560 Training loss: 2.1834 0.0399 sec/batch
Epoch 3/20  Iteration 483/3560 Training loss: 2.1830 0.0398 sec/batch
Epoch 3/20  Iteration 484/3560 Training loss: 2.1828 0.0421 sec/batch
Epoch 3/20  Iteration 485/3560 Training loss: 2.1824 0.0415 sec/batch
Epoch 3/20  Iteration 486/3560 Training loss: 2.1822 0.0389 sec/batch
Epoch 3/20  Iteration 487/3560 Training loss: 2.1818 0.0392 sec/batch
Epoch 3/20  Iteration 488/3560 Training loss: 2.1812 0.0394 sec/batch
Epoch 3/20  Iteration 489/3560 Training loss: 2.1808 0.0439 sec/batch
Epoch 3/20  Iteration 490/3560 Training loss: 2.1806 0.0390 sec/batch
Epoch 3/20  Iteration 491/3560 Training loss: 2.1801 0.0391 sec/batch
Epoch 3/20  Iteration 492/3560 Training loss: 2.1798 0.0389 sec/batch
Epoch 3/20  Iteratio

Epoch 4/20  Iteration 600/3560 Training loss: 2.0866 0.0400 sec/batch
Epoch 4/20  Iteration 601/3560 Training loss: 2.0863 0.0401 sec/batch
Epoch 4/20  Iteration 602/3560 Training loss: 2.0856 0.0408 sec/batch
Epoch 4/20  Iteration 603/3560 Training loss: 2.0853 0.0405 sec/batch
Epoch 4/20  Iteration 604/3560 Training loss: 2.0850 0.0412 sec/batch
Epoch 4/20  Iteration 605/3560 Training loss: 2.0849 0.0420 sec/batch
Epoch 4/20  Iteration 606/3560 Training loss: 2.0849 0.0401 sec/batch
Epoch 4/20  Iteration 607/3560 Training loss: 2.0848 0.0403 sec/batch
Epoch 4/20  Iteration 608/3560 Training loss: 2.0843 0.0465 sec/batch
Epoch 4/20  Iteration 609/3560 Training loss: 2.0839 0.0443 sec/batch
Epoch 4/20  Iteration 610/3560 Training loss: 2.0842 0.0436 sec/batch
Epoch 4/20  Iteration 611/3560 Training loss: 2.0838 0.0402 sec/batch
Epoch 4/20  Iteration 612/3560 Training loss: 2.0838 0.0393 sec/batch
Epoch 4/20  Iteration 613/3560 Training loss: 2.0832 0.0443 sec/batch
Epoch 4/20  Iteratio

Epoch 5/20  Iteration 720/3560 Training loss: 2.0192 0.0400 sec/batch
Epoch 5/20  Iteration 721/3560 Training loss: 2.0206 0.0401 sec/batch
Epoch 5/20  Iteration 722/3560 Training loss: 2.0203 0.0473 sec/batch
Epoch 5/20  Iteration 723/3560 Training loss: 2.0183 0.0399 sec/batch
Epoch 5/20  Iteration 724/3560 Training loss: 2.0168 0.0413 sec/batch
Epoch 5/20  Iteration 725/3560 Training loss: 2.0171 0.0443 sec/batch
Epoch 5/20  Iteration 726/3560 Training loss: 2.0194 0.0398 sec/batch
Epoch 5/20  Iteration 727/3560 Training loss: 2.0190 0.0448 sec/batch
Epoch 5/20  Iteration 728/3560 Training loss: 2.0178 0.0397 sec/batch
Epoch 5/20  Iteration 729/3560 Training loss: 2.0175 0.0398 sec/batch
Epoch 5/20  Iteration 730/3560 Training loss: 2.0195 0.0409 sec/batch
Epoch 5/20  Iteration 731/3560 Training loss: 2.0192 0.0447 sec/batch
Epoch 5/20  Iteration 732/3560 Training loss: 2.0186 0.0401 sec/batch
Epoch 5/20  Iteration 733/3560 Training loss: 2.0179 0.0398 sec/batch
Epoch 5/20  Iteratio

Epoch 5/20  Iteration 840/3560 Training loss: 1.9887 0.0403 sec/batch
Epoch 5/20  Iteration 841/3560 Training loss: 1.9886 0.0394 sec/batch
Epoch 5/20  Iteration 842/3560 Training loss: 1.9885 0.0401 sec/batch
Epoch 5/20  Iteration 843/3560 Training loss: 1.9883 0.0399 sec/batch
Epoch 5/20  Iteration 844/3560 Training loss: 1.9879 0.0395 sec/batch
Epoch 5/20  Iteration 845/3560 Training loss: 1.9878 0.0398 sec/batch
Epoch 5/20  Iteration 846/3560 Training loss: 1.9877 0.0400 sec/batch
Epoch 5/20  Iteration 847/3560 Training loss: 1.9875 0.0402 sec/batch
Epoch 5/20  Iteration 848/3560 Training loss: 1.9874 0.0400 sec/batch
Epoch 5/20  Iteration 849/3560 Training loss: 1.9873 0.0408 sec/batch
Epoch 5/20  Iteration 850/3560 Training loss: 1.9872 0.0432 sec/batch
Epoch 5/20  Iteration 851/3560 Training loss: 1.9873 0.0422 sec/batch
Epoch 5/20  Iteration 852/3560 Training loss: 1.9872 0.0399 sec/batch
Epoch 5/20  Iteration 853/3560 Training loss: 1.9872 0.0404 sec/batch
Epoch 5/20  Iteratio

Epoch 6/20  Iteration 960/3560 Training loss: 1.9418 0.0402 sec/batch
Epoch 6/20  Iteration 961/3560 Training loss: 1.9418 0.0423 sec/batch
Epoch 6/20  Iteration 962/3560 Training loss: 1.9419 0.0420 sec/batch
Epoch 6/20  Iteration 963/3560 Training loss: 1.9421 0.0468 sec/batch
Epoch 6/20  Iteration 964/3560 Training loss: 1.9416 0.0414 sec/batch
Epoch 6/20  Iteration 965/3560 Training loss: 1.9414 0.0408 sec/batch
Epoch 6/20  Iteration 966/3560 Training loss: 1.9418 0.0400 sec/batch
Epoch 6/20  Iteration 967/3560 Training loss: 1.9416 0.0421 sec/batch
Epoch 6/20  Iteration 968/3560 Training loss: 1.9417 0.0398 sec/batch
Epoch 6/20  Iteration 969/3560 Training loss: 1.9412 0.0400 sec/batch
Epoch 6/20  Iteration 970/3560 Training loss: 1.9409 0.0406 sec/batch
Epoch 6/20  Iteration 971/3560 Training loss: 1.9402 0.0405 sec/batch
Epoch 6/20  Iteration 972/3560 Training loss: 1.9402 0.0401 sec/batch
Epoch 6/20  Iteration 973/3560 Training loss: 1.9396 0.0398 sec/batch
Epoch 6/20  Iteratio

Epoch 7/20  Iteration 1080/3560 Training loss: 1.8987 0.0409 sec/batch
Epoch 7/20  Iteration 1081/3560 Training loss: 1.8990 0.0417 sec/batch
Epoch 7/20  Iteration 1082/3560 Training loss: 1.9014 0.0417 sec/batch
Epoch 7/20  Iteration 1083/3560 Training loss: 1.9013 0.0418 sec/batch
Epoch 7/20  Iteration 1084/3560 Training loss: 1.8998 0.0403 sec/batch
Epoch 7/20  Iteration 1085/3560 Training loss: 1.8997 0.0428 sec/batch
Epoch 7/20  Iteration 1086/3560 Training loss: 1.9017 0.0450 sec/batch
Epoch 7/20  Iteration 1087/3560 Training loss: 1.9013 0.0402 sec/batch
Epoch 7/20  Iteration 1088/3560 Training loss: 1.9013 0.0394 sec/batch
Epoch 7/20  Iteration 1089/3560 Training loss: 1.9006 0.0418 sec/batch
Epoch 7/20  Iteration 1090/3560 Training loss: 1.9030 0.0447 sec/batch
Epoch 7/20  Iteration 1091/3560 Training loss: 1.9021 0.0432 sec/batch
Epoch 7/20  Iteration 1092/3560 Training loss: 1.9014 0.0398 sec/batch
Epoch 7/20  Iteration 1093/3560 Training loss: 1.9010 0.0403 sec/batch
Epoch 

Epoch 7/20  Iteration 1200/3560 Training loss: 1.8777 0.0402 sec/batch
Epoch 7/20  Iteration 1201/3560 Training loss: 1.8777 0.0451 sec/batch
Epoch 7/20  Iteration 1202/3560 Training loss: 1.8776 0.0423 sec/batch
Epoch 7/20  Iteration 1203/3560 Training loss: 1.8775 0.0406 sec/batch
Epoch 7/20  Iteration 1204/3560 Training loss: 1.8775 0.0418 sec/batch
Epoch 7/20  Iteration 1205/3560 Training loss: 1.8775 0.0426 sec/batch
Epoch 7/20  Iteration 1206/3560 Training loss: 1.8775 0.0403 sec/batch
Epoch 7/20  Iteration 1207/3560 Training loss: 1.8777 0.0469 sec/batch
Epoch 7/20  Iteration 1208/3560 Training loss: 1.8775 0.0398 sec/batch
Epoch 7/20  Iteration 1209/3560 Training loss: 1.8777 0.0419 sec/batch
Epoch 7/20  Iteration 1210/3560 Training loss: 1.8775 0.0401 sec/batch
Epoch 7/20  Iteration 1211/3560 Training loss: 1.8775 0.0401 sec/batch
Epoch 7/20  Iteration 1212/3560 Training loss: 1.8775 0.0395 sec/batch
Epoch 7/20  Iteration 1213/3560 Training loss: 1.8773 0.0426 sec/batch
Epoch 

Epoch 8/20  Iteration 1320/3560 Training loss: 1.8482 0.0415 sec/batch
Epoch 8/20  Iteration 1321/3560 Training loss: 1.8481 0.0414 sec/batch
Epoch 8/20  Iteration 1322/3560 Training loss: 1.8485 0.0426 sec/batch
Epoch 8/20  Iteration 1323/3560 Training loss: 1.8483 0.0397 sec/batch
Epoch 8/20  Iteration 1324/3560 Training loss: 1.8484 0.0452 sec/batch
Epoch 8/20  Iteration 1325/3560 Training loss: 1.8480 0.0400 sec/batch
Epoch 8/20  Iteration 1326/3560 Training loss: 1.8477 0.0398 sec/batch
Epoch 8/20  Iteration 1327/3560 Training loss: 1.8470 0.0394 sec/batch
Epoch 8/20  Iteration 1328/3560 Training loss: 1.8470 0.0442 sec/batch
Epoch 8/20  Iteration 1329/3560 Training loss: 1.8465 0.0410 sec/batch
Epoch 8/20  Iteration 1330/3560 Training loss: 1.8463 0.0404 sec/batch
Epoch 8/20  Iteration 1331/3560 Training loss: 1.8457 0.0446 sec/batch
Epoch 8/20  Iteration 1332/3560 Training loss: 1.8453 0.0401 sec/batch
Epoch 8/20  Iteration 1333/3560 Training loss: 1.8450 0.0404 sec/batch
Epoch 

Epoch 9/20  Iteration 1440/3560 Training loss: 1.8184 0.0425 sec/batch
Epoch 9/20  Iteration 1441/3560 Training loss: 1.8185 0.0403 sec/batch
Epoch 9/20  Iteration 1442/3560 Training loss: 1.8203 0.0444 sec/batch
Epoch 9/20  Iteration 1443/3560 Training loss: 1.8202 0.0397 sec/batch
Epoch 9/20  Iteration 1444/3560 Training loss: 1.8201 0.0405 sec/batch
Epoch 9/20  Iteration 1445/3560 Training loss: 1.8195 0.0401 sec/batch
Epoch 9/20  Iteration 1446/3560 Training loss: 1.8215 0.0420 sec/batch
Epoch 9/20  Iteration 1447/3560 Training loss: 1.8205 0.0448 sec/batch
Epoch 9/20  Iteration 1448/3560 Training loss: 1.8199 0.0397 sec/batch
Epoch 9/20  Iteration 1449/3560 Training loss: 1.8195 0.0402 sec/batch
Epoch 9/20  Iteration 1450/3560 Training loss: 1.8187 0.0447 sec/batch
Epoch 9/20  Iteration 1451/3560 Training loss: 1.8176 0.0402 sec/batch
Epoch 9/20  Iteration 1452/3560 Training loss: 1.8179 0.0403 sec/batch
Epoch 9/20  Iteration 1453/3560 Training loss: 1.8191 0.0397 sec/batch
Epoch 

Epoch 9/20  Iteration 1560/3560 Training loss: 1.7985 0.0400 sec/batch
Epoch 9/20  Iteration 1561/3560 Training loss: 1.7985 0.0406 sec/batch
Epoch 9/20  Iteration 1562/3560 Training loss: 1.7986 0.0402 sec/batch
Epoch 9/20  Iteration 1563/3560 Training loss: 1.7987 0.0393 sec/batch
Epoch 9/20  Iteration 1564/3560 Training loss: 1.7986 0.0406 sec/batch
Epoch 9/20  Iteration 1565/3560 Training loss: 1.7989 0.0403 sec/batch
Epoch 9/20  Iteration 1566/3560 Training loss: 1.7987 0.0406 sec/batch
Epoch 9/20  Iteration 1567/3560 Training loss: 1.7987 0.0400 sec/batch
Epoch 9/20  Iteration 1568/3560 Training loss: 1.7988 0.0400 sec/batch
Epoch 9/20  Iteration 1569/3560 Training loss: 1.7986 0.0397 sec/batch
Epoch 9/20  Iteration 1570/3560 Training loss: 1.7988 0.0403 sec/batch
Epoch 9/20  Iteration 1571/3560 Training loss: 1.7988 0.0407 sec/batch
Epoch 9/20  Iteration 1572/3560 Training loss: 1.7989 0.0400 sec/batch
Epoch 9/20  Iteration 1573/3560 Training loss: 1.7989 0.0448 sec/batch
Epoch 

Epoch 10/20  Iteration 1675/3560 Training loss: 1.7769 0.0398 sec/batch
Epoch 10/20  Iteration 1676/3560 Training loss: 1.7765 0.0402 sec/batch
Epoch 10/20  Iteration 1677/3560 Training loss: 1.7764 0.0398 sec/batch
Epoch 10/20  Iteration 1678/3560 Training loss: 1.7768 0.0395 sec/batch
Epoch 10/20  Iteration 1679/3560 Training loss: 1.7766 0.0397 sec/batch
Epoch 10/20  Iteration 1680/3560 Training loss: 1.7767 0.0428 sec/batch
Epoch 10/20  Iteration 1681/3560 Training loss: 1.7763 0.0403 sec/batch
Epoch 10/20  Iteration 1682/3560 Training loss: 1.7761 0.0443 sec/batch
Epoch 10/20  Iteration 1683/3560 Training loss: 1.7755 0.0393 sec/batch
Epoch 10/20  Iteration 1684/3560 Training loss: 1.7755 0.0475 sec/batch
Epoch 10/20  Iteration 1685/3560 Training loss: 1.7750 0.0459 sec/batch
Epoch 10/20  Iteration 1686/3560 Training loss: 1.7749 0.0426 sec/batch
Epoch 10/20  Iteration 1687/3560 Training loss: 1.7744 0.0469 sec/batch
Epoch 10/20  Iteration 1688/3560 Training loss: 1.7740 0.0442 se

Epoch 11/20  Iteration 1790/3560 Training loss: 1.7564 0.0409 sec/batch
Epoch 11/20  Iteration 1791/3560 Training loss: 1.7538 0.0404 sec/batch
Epoch 11/20  Iteration 1792/3560 Training loss: 1.7518 0.0427 sec/batch
Epoch 11/20  Iteration 1793/3560 Training loss: 1.7515 0.0396 sec/batch
Epoch 11/20  Iteration 1794/3560 Training loss: 1.7539 0.0402 sec/batch
Epoch 11/20  Iteration 1795/3560 Training loss: 1.7534 0.0406 sec/batch
Epoch 11/20  Iteration 1796/3560 Training loss: 1.7519 0.0415 sec/batch
Epoch 11/20  Iteration 1797/3560 Training loss: 1.7520 0.0402 sec/batch
Epoch 11/20  Iteration 1798/3560 Training loss: 1.7539 0.0471 sec/batch
Epoch 11/20  Iteration 1799/3560 Training loss: 1.7539 0.0399 sec/batch
Epoch 11/20  Iteration 1800/3560 Training loss: 1.7541 0.0405 sec/batch
Epoch 11/20  Iteration 1801/3560 Training loss: 1.7535 0.0427 sec/batch
Epoch 11/20  Iteration 1802/3560 Training loss: 1.7554 0.0449 sec/batch
Epoch 11/20  Iteration 1803/3560 Training loss: 1.7544 0.0456 se

Epoch 11/20  Iteration 1905/3560 Training loss: 1.7365 0.0442 sec/batch
Epoch 11/20  Iteration 1906/3560 Training loss: 1.7362 0.0448 sec/batch
Epoch 11/20  Iteration 1907/3560 Training loss: 1.7363 0.0399 sec/batch
Epoch 11/20  Iteration 1908/3560 Training loss: 1.7363 0.0396 sec/batch
Epoch 11/20  Iteration 1909/3560 Training loss: 1.7362 0.0406 sec/batch
Epoch 11/20  Iteration 1910/3560 Training loss: 1.7362 0.0404 sec/batch
Epoch 11/20  Iteration 1911/3560 Training loss: 1.7360 0.0436 sec/batch
Epoch 11/20  Iteration 1912/3560 Training loss: 1.7357 0.0401 sec/batch
Epoch 11/20  Iteration 1913/3560 Training loss: 1.7357 0.0406 sec/batch
Epoch 11/20  Iteration 1914/3560 Training loss: 1.7357 0.0402 sec/batch
Epoch 11/20  Iteration 1915/3560 Training loss: 1.7356 0.0408 sec/batch
Epoch 11/20  Iteration 1916/3560 Training loss: 1.7357 0.0404 sec/batch
Epoch 11/20  Iteration 1917/3560 Training loss: 1.7357 0.0424 sec/batch
Epoch 11/20  Iteration 1918/3560 Training loss: 1.7358 0.0401 se

Epoch 12/20  Iteration 2020/3560 Training loss: 1.7186 0.0427 sec/batch
Epoch 12/20  Iteration 2021/3560 Training loss: 1.7189 0.0441 sec/batch
Epoch 12/20  Iteration 2022/3560 Training loss: 1.7190 0.0402 sec/batch
Epoch 12/20  Iteration 2023/3560 Training loss: 1.7189 0.0400 sec/batch
Epoch 12/20  Iteration 2024/3560 Training loss: 1.7193 0.0414 sec/batch
Epoch 12/20  Iteration 2025/3560 Training loss: 1.7195 0.0411 sec/batch
Epoch 12/20  Iteration 2026/3560 Training loss: 1.7190 0.0400 sec/batch
Epoch 12/20  Iteration 2027/3560 Training loss: 1.7190 0.0404 sec/batch
Epoch 12/20  Iteration 2028/3560 Training loss: 1.7190 0.0400 sec/batch
Epoch 12/20  Iteration 2029/3560 Training loss: 1.7193 0.0402 sec/batch
Epoch 12/20  Iteration 2030/3560 Training loss: 1.7194 0.0427 sec/batch
Epoch 12/20  Iteration 2031/3560 Training loss: 1.7197 0.0428 sec/batch
Epoch 12/20  Iteration 2032/3560 Training loss: 1.7194 0.0398 sec/batch
Epoch 12/20  Iteration 2033/3560 Training loss: 1.7193 0.0421 se

Epoch 12/20  Iteration 2135/3560 Training loss: 1.7104 0.0422 sec/batch
Epoch 12/20  Iteration 2136/3560 Training loss: 1.7104 0.0464 sec/batch
Epoch 13/20  Iteration 2137/3560 Training loss: 1.7700 0.0413 sec/batch
Epoch 13/20  Iteration 2138/3560 Training loss: 1.7326 0.0412 sec/batch
Epoch 13/20  Iteration 2139/3560 Training loss: 1.7209 0.0398 sec/batch
Epoch 13/20  Iteration 2140/3560 Training loss: 1.7140 0.0457 sec/batch
Epoch 13/20  Iteration 2141/3560 Training loss: 1.7096 0.0409 sec/batch
Epoch 13/20  Iteration 2142/3560 Training loss: 1.7019 0.0424 sec/batch
Epoch 13/20  Iteration 2143/3560 Training loss: 1.7026 0.0403 sec/batch
Epoch 13/20  Iteration 2144/3560 Training loss: 1.7020 0.0404 sec/batch
Epoch 13/20  Iteration 2145/3560 Training loss: 1.7046 0.0406 sec/batch
Epoch 13/20  Iteration 2146/3560 Training loss: 1.7040 0.0452 sec/batch
Epoch 13/20  Iteration 2147/3560 Training loss: 1.7010 0.0404 sec/batch
Epoch 13/20  Iteration 2148/3560 Training loss: 1.6992 0.0410 se

Epoch 13/20  Iteration 2250/3560 Training loss: 1.6875 0.0410 sec/batch
Epoch 13/20  Iteration 2251/3560 Training loss: 1.6873 0.0441 sec/batch
Epoch 13/20  Iteration 2252/3560 Training loss: 1.6869 0.0423 sec/batch
Epoch 13/20  Iteration 2253/3560 Training loss: 1.6867 0.0429 sec/batch
Epoch 13/20  Iteration 2254/3560 Training loss: 1.6866 0.0422 sec/batch
Epoch 13/20  Iteration 2255/3560 Training loss: 1.6865 0.0415 sec/batch
Epoch 13/20  Iteration 2256/3560 Training loss: 1.6864 0.0417 sec/batch
Epoch 13/20  Iteration 2257/3560 Training loss: 1.6864 0.0428 sec/batch
Epoch 13/20  Iteration 2258/3560 Training loss: 1.6861 0.0410 sec/batch
Epoch 13/20  Iteration 2259/3560 Training loss: 1.6859 0.0451 sec/batch
Epoch 13/20  Iteration 2260/3560 Training loss: 1.6860 0.0417 sec/batch
Epoch 13/20  Iteration 2261/3560 Training loss: 1.6859 0.0413 sec/batch
Epoch 13/20  Iteration 2262/3560 Training loss: 1.6855 0.0444 sec/batch
Epoch 13/20  Iteration 2263/3560 Training loss: 1.6857 0.0407 se

Epoch 14/20  Iteration 2365/3560 Training loss: 1.6700 0.0418 sec/batch
Epoch 14/20  Iteration 2366/3560 Training loss: 1.6708 0.0410 sec/batch
Epoch 14/20  Iteration 2367/3560 Training loss: 1.6706 0.0426 sec/batch
Epoch 14/20  Iteration 2368/3560 Training loss: 1.6707 0.0406 sec/batch
Epoch 14/20  Iteration 2369/3560 Training loss: 1.6705 0.0424 sec/batch
Epoch 14/20  Iteration 2370/3560 Training loss: 1.6706 0.0416 sec/batch
Epoch 14/20  Iteration 2371/3560 Training loss: 1.6709 0.0413 sec/batch
Epoch 14/20  Iteration 2372/3560 Training loss: 1.6705 0.0478 sec/batch
Epoch 14/20  Iteration 2373/3560 Training loss: 1.6700 0.0486 sec/batch
Epoch 14/20  Iteration 2374/3560 Training loss: 1.6707 0.0438 sec/batch
Epoch 14/20  Iteration 2375/3560 Training loss: 1.6707 0.0416 sec/batch
Epoch 14/20  Iteration 2376/3560 Training loss: 1.6714 0.0406 sec/batch
Epoch 14/20  Iteration 2377/3560 Training loss: 1.6717 0.0420 sec/batch
Epoch 14/20  Iteration 2378/3560 Training loss: 1.6719 0.0453 se

Epoch 14/20  Iteration 2480/3560 Training loss: 1.6636 0.0458 sec/batch
Epoch 14/20  Iteration 2481/3560 Training loss: 1.6637 0.0430 sec/batch
Epoch 14/20  Iteration 2482/3560 Training loss: 1.6640 0.0403 sec/batch
Epoch 14/20  Iteration 2483/3560 Training loss: 1.6640 0.0414 sec/batch
Epoch 14/20  Iteration 2484/3560 Training loss: 1.6639 0.0425 sec/batch
Epoch 14/20  Iteration 2485/3560 Training loss: 1.6638 0.0416 sec/batch
Epoch 14/20  Iteration 2486/3560 Training loss: 1.6638 0.0452 sec/batch
Epoch 14/20  Iteration 2487/3560 Training loss: 1.6639 0.0414 sec/batch
Epoch 14/20  Iteration 2488/3560 Training loss: 1.6640 0.0415 sec/batch
Epoch 14/20  Iteration 2489/3560 Training loss: 1.6641 0.0456 sec/batch
Epoch 14/20  Iteration 2490/3560 Training loss: 1.6641 0.0428 sec/batch
Epoch 14/20  Iteration 2491/3560 Training loss: 1.6639 0.0403 sec/batch
Epoch 14/20  Iteration 2492/3560 Training loss: 1.6640 0.0412 sec/batch
Epoch 15/20  Iteration 2493/3560 Training loss: 1.7240 0.0404 se

Epoch 15/20  Iteration 2595/3560 Training loss: 1.6460 0.0428 sec/batch
Epoch 15/20  Iteration 2596/3560 Training loss: 1.6458 0.0431 sec/batch
Epoch 15/20  Iteration 2597/3560 Training loss: 1.6456 0.0405 sec/batch
Epoch 15/20  Iteration 2598/3560 Training loss: 1.6455 0.0415 sec/batch
Epoch 15/20  Iteration 2599/3560 Training loss: 1.6454 0.0432 sec/batch
Epoch 15/20  Iteration 2600/3560 Training loss: 1.6454 0.0418 sec/batch
Epoch 15/20  Iteration 2601/3560 Training loss: 1.6453 0.0438 sec/batch
Epoch 15/20  Iteration 2602/3560 Training loss: 1.6453 0.0408 sec/batch
Epoch 15/20  Iteration 2603/3560 Training loss: 1.6452 0.0408 sec/batch
Epoch 15/20  Iteration 2604/3560 Training loss: 1.6450 0.0411 sec/batch
Epoch 15/20  Iteration 2605/3560 Training loss: 1.6448 0.0406 sec/batch
Epoch 15/20  Iteration 2606/3560 Training loss: 1.6447 0.0415 sec/batch
Epoch 15/20  Iteration 2607/3560 Training loss: 1.6445 0.0422 sec/batch
Epoch 15/20  Iteration 2608/3560 Training loss: 1.6441 0.0456 se

Epoch 16/20  Iteration 2710/3560 Training loss: 1.6331 0.0475 sec/batch
Epoch 16/20  Iteration 2711/3560 Training loss: 1.6326 0.0434 sec/batch
Epoch 16/20  Iteration 2712/3560 Training loss: 1.6329 0.0409 sec/batch
Epoch 16/20  Iteration 2713/3560 Training loss: 1.6322 0.0412 sec/batch
Epoch 16/20  Iteration 2714/3560 Training loss: 1.6314 0.0411 sec/batch
Epoch 16/20  Iteration 2715/3560 Training loss: 1.6315 0.0407 sec/batch
Epoch 16/20  Iteration 2716/3560 Training loss: 1.6305 0.0408 sec/batch
Epoch 16/20  Iteration 2717/3560 Training loss: 1.6304 0.0404 sec/batch
Epoch 16/20  Iteration 2718/3560 Training loss: 1.6299 0.0432 sec/batch
Epoch 16/20  Iteration 2719/3560 Training loss: 1.6297 0.0429 sec/batch
Epoch 16/20  Iteration 2720/3560 Training loss: 1.6304 0.0418 sec/batch
Epoch 16/20  Iteration 2721/3560 Training loss: 1.6299 0.0420 sec/batch
Epoch 16/20  Iteration 2722/3560 Training loss: 1.6308 0.0422 sec/batch
Epoch 16/20  Iteration 2723/3560 Training loss: 1.6306 0.0433 se

Epoch 16/20  Iteration 2825/3560 Training loss: 1.6248 0.0417 sec/batch
Epoch 16/20  Iteration 2826/3560 Training loss: 1.6248 0.0411 sec/batch
Epoch 16/20  Iteration 2827/3560 Training loss: 1.6248 0.0410 sec/batch
Epoch 16/20  Iteration 2828/3560 Training loss: 1.6248 0.0414 sec/batch
Epoch 16/20  Iteration 2829/3560 Training loss: 1.6246 0.0417 sec/batch
Epoch 16/20  Iteration 2830/3560 Training loss: 1.6247 0.0414 sec/batch
Epoch 16/20  Iteration 2831/3560 Training loss: 1.6248 0.0416 sec/batch
Epoch 16/20  Iteration 2832/3560 Training loss: 1.6248 0.0478 sec/batch
Epoch 16/20  Iteration 2833/3560 Training loss: 1.6248 0.0427 sec/batch
Epoch 16/20  Iteration 2834/3560 Training loss: 1.6248 0.0408 sec/batch
Epoch 16/20  Iteration 2835/3560 Training loss: 1.6248 0.0417 sec/batch
Epoch 16/20  Iteration 2836/3560 Training loss: 1.6248 0.0413 sec/batch
Epoch 16/20  Iteration 2837/3560 Training loss: 1.6249 0.0454 sec/batch
Epoch 16/20  Iteration 2838/3560 Training loss: 1.6252 0.0429 se

Epoch 17/20  Iteration 2940/3560 Training loss: 1.6125 0.0423 sec/batch
Epoch 17/20  Iteration 2941/3560 Training loss: 1.6122 0.0417 sec/batch
Epoch 17/20  Iteration 2942/3560 Training loss: 1.6119 0.0429 sec/batch
Epoch 17/20  Iteration 2943/3560 Training loss: 1.6115 0.0410 sec/batch
Epoch 17/20  Iteration 2944/3560 Training loss: 1.6115 0.0404 sec/batch
Epoch 17/20  Iteration 2945/3560 Training loss: 1.6115 0.0426 sec/batch
Epoch 17/20  Iteration 2946/3560 Training loss: 1.6111 0.0460 sec/batch
Epoch 17/20  Iteration 2947/3560 Training loss: 1.6108 0.0407 sec/batch
Epoch 17/20  Iteration 2948/3560 Training loss: 1.6104 0.0415 sec/batch
Epoch 17/20  Iteration 2949/3560 Training loss: 1.6104 0.0478 sec/batch
Epoch 17/20  Iteration 2950/3560 Training loss: 1.6103 0.0461 sec/batch
Epoch 17/20  Iteration 2951/3560 Training loss: 1.6100 0.0412 sec/batch
Epoch 17/20  Iteration 2952/3560 Training loss: 1.6098 0.0407 sec/batch
Epoch 17/20  Iteration 2953/3560 Training loss: 1.6096 0.0414 se

Epoch 18/20  Iteration 3055/3560 Training loss: 1.6023 0.0457 sec/batch
Epoch 18/20  Iteration 3056/3560 Training loss: 1.6027 0.0411 sec/batch
Epoch 18/20  Iteration 3057/3560 Training loss: 1.6025 0.0431 sec/batch
Epoch 18/20  Iteration 3058/3560 Training loss: 1.6017 0.0411 sec/batch
Epoch 18/20  Iteration 3059/3560 Training loss: 1.6021 0.0415 sec/batch
Epoch 18/20  Iteration 3060/3560 Training loss: 1.6026 0.0415 sec/batch
Epoch 18/20  Iteration 3061/3560 Training loss: 1.6024 0.0422 sec/batch
Epoch 18/20  Iteration 3062/3560 Training loss: 1.6023 0.0412 sec/batch
Epoch 18/20  Iteration 3063/3560 Training loss: 1.6016 0.0414 sec/batch
Epoch 18/20  Iteration 3064/3560 Training loss: 1.6007 0.0409 sec/batch
Epoch 18/20  Iteration 3065/3560 Training loss: 1.5993 0.0412 sec/batch
Epoch 18/20  Iteration 3066/3560 Training loss: 1.5987 0.0436 sec/batch
Epoch 18/20  Iteration 3067/3560 Training loss: 1.5981 0.0425 sec/batch
Epoch 18/20  Iteration 3068/3560 Training loss: 1.5986 0.0457 se

Epoch 18/20  Iteration 3170/3560 Training loss: 1.5911 0.0482 sec/batch
Epoch 18/20  Iteration 3171/3560 Training loss: 1.5910 0.0411 sec/batch
Epoch 18/20  Iteration 3172/3560 Training loss: 1.5912 0.0412 sec/batch
Epoch 18/20  Iteration 3173/3560 Training loss: 1.5913 0.0408 sec/batch
Epoch 18/20  Iteration 3174/3560 Training loss: 1.5916 0.0438 sec/batch
Epoch 18/20  Iteration 3175/3560 Training loss: 1.5916 0.0464 sec/batch
Epoch 18/20  Iteration 3176/3560 Training loss: 1.5915 0.0430 sec/batch
Epoch 18/20  Iteration 3177/3560 Training loss: 1.5912 0.0413 sec/batch
Epoch 18/20  Iteration 3178/3560 Training loss: 1.5914 0.0415 sec/batch
Epoch 18/20  Iteration 3179/3560 Training loss: 1.5914 0.0410 sec/batch
Epoch 18/20  Iteration 3180/3560 Training loss: 1.5914 0.0477 sec/batch
Epoch 18/20  Iteration 3181/3560 Training loss: 1.5914 0.0437 sec/batch
Epoch 18/20  Iteration 3182/3560 Training loss: 1.5914 0.0415 sec/batch
Epoch 18/20  Iteration 3183/3560 Training loss: 1.5914 0.0419 se

Epoch 19/20  Iteration 3285/3560 Training loss: 1.5835 0.0434 sec/batch
Epoch 19/20  Iteration 3286/3560 Training loss: 1.5836 0.0419 sec/batch
Epoch 19/20  Iteration 3287/3560 Training loss: 1.5831 0.0408 sec/batch
Epoch 19/20  Iteration 3288/3560 Training loss: 1.5832 0.0411 sec/batch
Epoch 19/20  Iteration 3289/3560 Training loss: 1.5827 0.0417 sec/batch
Epoch 19/20  Iteration 3290/3560 Training loss: 1.5825 0.0423 sec/batch
Epoch 19/20  Iteration 3291/3560 Training loss: 1.5823 0.0410 sec/batch
Epoch 19/20  Iteration 3292/3560 Training loss: 1.5821 0.0435 sec/batch
Epoch 19/20  Iteration 3293/3560 Training loss: 1.5817 0.0407 sec/batch
Epoch 19/20  Iteration 3294/3560 Training loss: 1.5819 0.0432 sec/batch
Epoch 19/20  Iteration 3295/3560 Training loss: 1.5817 0.0417 sec/batch
Epoch 19/20  Iteration 3296/3560 Training loss: 1.5816 0.0439 sec/batch
Epoch 19/20  Iteration 3297/3560 Training loss: 1.5812 0.0413 sec/batch
Epoch 19/20  Iteration 3298/3560 Training loss: 1.5810 0.0412 se

Epoch 20/20  Iteration 3399/3560 Training loss: 1.5721 0.0411 sec/batch
Epoch 20/20  Iteration 3400/3560 Training loss: 1.5738 0.0415 sec/batch
Epoch 20/20  Iteration 3401/3560 Training loss: 1.5743 0.0408 sec/batch
Epoch 20/20  Iteration 3402/3560 Training loss: 1.5749 0.0418 sec/batch
Epoch 20/20  Iteration 3403/3560 Training loss: 1.5745 0.0414 sec/batch
Epoch 20/20  Iteration 3404/3560 Training loss: 1.5754 0.0418 sec/batch
Epoch 20/20  Iteration 3405/3560 Training loss: 1.5745 0.0410 sec/batch
Epoch 20/20  Iteration 3406/3560 Training loss: 1.5741 0.0414 sec/batch
Epoch 20/20  Iteration 3407/3560 Training loss: 1.5736 0.0417 sec/batch
Epoch 20/20  Iteration 3408/3560 Training loss: 1.5725 0.0440 sec/batch
Epoch 20/20  Iteration 3409/3560 Training loss: 1.5717 0.0431 sec/batch
Epoch 20/20  Iteration 3410/3560 Training loss: 1.5725 0.0430 sec/batch
Epoch 20/20  Iteration 3411/3560 Training loss: 1.5733 0.0431 sec/batch
Epoch 20/20  Iteration 3412/3560 Training loss: 1.5737 0.0437 se

Epoch 20/20  Iteration 3514/3560 Training loss: 1.5624 0.0417 sec/batch
Epoch 20/20  Iteration 3515/3560 Training loss: 1.5624 0.0418 sec/batch
Epoch 20/20  Iteration 3516/3560 Training loss: 1.5625 0.0407 sec/batch
Epoch 20/20  Iteration 3517/3560 Training loss: 1.5624 0.0415 sec/batch
Epoch 20/20  Iteration 3518/3560 Training loss: 1.5625 0.0410 sec/batch
Epoch 20/20  Iteration 3519/3560 Training loss: 1.5626 0.0449 sec/batch
Epoch 20/20  Iteration 3520/3560 Training loss: 1.5627 0.0410 sec/batch
Epoch 20/20  Iteration 3521/3560 Training loss: 1.5628 0.0464 sec/batch
Epoch 20/20  Iteration 3522/3560 Training loss: 1.5627 0.0425 sec/batch
Epoch 20/20  Iteration 3523/3560 Training loss: 1.5630 0.0418 sec/batch
Epoch 20/20  Iteration 3524/3560 Training loss: 1.5630 0.0412 sec/batch
Epoch 20/20  Iteration 3525/3560 Training loss: 1.5630 0.0429 sec/batch
Epoch 20/20  Iteration 3526/3560 Training loss: 1.5631 0.0415 sec/batch
Epoch 20/20  Iteration 3527/3560 Training loss: 1.5631 0.0454 se

Epoch 1/20  Iteration 76/3560 Training loss: 3.3020 0.0300 sec/batch
Epoch 1/20  Iteration 77/3560 Training loss: 3.2991 0.0261 sec/batch
Epoch 1/20  Iteration 78/3560 Training loss: 3.2962 0.0288 sec/batch
Epoch 1/20  Iteration 79/3560 Training loss: 3.2934 0.0273 sec/batch
Epoch 1/20  Iteration 80/3560 Training loss: 3.2904 0.0260 sec/batch
Epoch 1/20  Iteration 81/3560 Training loss: 3.2876 0.0265 sec/batch
Epoch 1/20  Iteration 82/3560 Training loss: 3.2851 0.0261 sec/batch
Epoch 1/20  Iteration 83/3560 Training loss: 3.2826 0.0273 sec/batch
Epoch 1/20  Iteration 84/3560 Training loss: 3.2800 0.0260 sec/batch
Epoch 1/20  Iteration 85/3560 Training loss: 3.2773 0.0260 sec/batch
Epoch 1/20  Iteration 86/3560 Training loss: 3.2748 0.0263 sec/batch
Epoch 1/20  Iteration 87/3560 Training loss: 3.2722 0.0261 sec/batch
Epoch 1/20  Iteration 88/3560 Training loss: 3.2698 0.0347 sec/batch
Epoch 1/20  Iteration 89/3560 Training loss: 3.2676 0.0278 sec/batch
Epoch 1/20  Iteration 90/3560 Trai

Epoch 2/20  Iteration 195/3560 Training loss: 2.7987 0.0349 sec/batch
Epoch 2/20  Iteration 196/3560 Training loss: 2.7987 0.0344 sec/batch
Epoch 2/20  Iteration 197/3560 Training loss: 2.7970 0.0355 sec/batch
Epoch 2/20  Iteration 198/3560 Training loss: 2.7939 0.0352 sec/batch
Epoch 2/20  Iteration 199/3560 Training loss: 2.7918 0.0429 sec/batch
Epoch 2/20  Iteration 200/3560 Training loss: 2.7902 0.0350 sec/batch
Epoch 2/20  Iteration 201/3560 Training loss: 2.7881 0.0354 sec/batch
Epoch 2/20  Iteration 202/3560 Training loss: 2.7860 0.0349 sec/batch
Epoch 2/20  Iteration 203/3560 Training loss: 2.7836 0.0376 sec/batch
Epoch 2/20  Iteration 204/3560 Training loss: 2.7821 0.0360 sec/batch
Epoch 2/20  Iteration 205/3560 Training loss: 2.7805 0.0353 sec/batch
Epoch 2/20  Iteration 206/3560 Training loss: 2.7783 0.0383 sec/batch
Epoch 2/20  Iteration 207/3560 Training loss: 2.7765 0.0360 sec/batch
Epoch 2/20  Iteration 208/3560 Training loss: 2.7748 0.0363 sec/batch
Epoch 2/20  Iteratio

Epoch 2/20  Iteration 313/3560 Training loss: 2.6149 0.0388 sec/batch
Epoch 2/20  Iteration 314/3560 Training loss: 2.6138 0.0373 sec/batch
Epoch 2/20  Iteration 315/3560 Training loss: 2.6126 0.0387 sec/batch
Epoch 2/20  Iteration 316/3560 Training loss: 2.6115 0.0397 sec/batch
Epoch 2/20  Iteration 317/3560 Training loss: 2.6106 0.0373 sec/batch
Epoch 2/20  Iteration 318/3560 Training loss: 2.6095 0.0399 sec/batch
Epoch 2/20  Iteration 319/3560 Training loss: 2.6086 0.0395 sec/batch
Epoch 2/20  Iteration 320/3560 Training loss: 2.6075 0.0401 sec/batch
Epoch 2/20  Iteration 321/3560 Training loss: 2.6065 0.0385 sec/batch
Epoch 2/20  Iteration 322/3560 Training loss: 2.6054 0.0389 sec/batch
Epoch 2/20  Iteration 323/3560 Training loss: 2.6044 0.0381 sec/batch
Epoch 2/20  Iteration 324/3560 Training loss: 2.6035 0.0407 sec/batch
Epoch 2/20  Iteration 325/3560 Training loss: 2.6025 0.0374 sec/batch
Epoch 2/20  Iteration 326/3560 Training loss: 2.6017 0.0390 sec/batch
Epoch 2/20  Iteratio

Epoch 3/20  Iteration 431/3560 Training loss: 2.3737 0.0397 sec/batch
Epoch 3/20  Iteration 432/3560 Training loss: 2.3737 0.0376 sec/batch
Epoch 3/20  Iteration 433/3560 Training loss: 2.3732 0.0386 sec/batch
Epoch 3/20  Iteration 434/3560 Training loss: 2.3729 0.0377 sec/batch
Epoch 3/20  Iteration 435/3560 Training loss: 2.3723 0.0378 sec/batch
Epoch 3/20  Iteration 436/3560 Training loss: 2.3717 0.0401 sec/batch
Epoch 3/20  Iteration 437/3560 Training loss: 2.3711 0.0381 sec/batch
Epoch 3/20  Iteration 438/3560 Training loss: 2.3708 0.0389 sec/batch
Epoch 3/20  Iteration 439/3560 Training loss: 2.3703 0.0404 sec/batch
Epoch 3/20  Iteration 440/3560 Training loss: 2.3697 0.0383 sec/batch
Epoch 3/20  Iteration 441/3560 Training loss: 2.3687 0.0379 sec/batch
Epoch 3/20  Iteration 442/3560 Training loss: 2.3681 0.0384 sec/batch
Epoch 3/20  Iteration 443/3560 Training loss: 2.3676 0.0408 sec/batch
Epoch 3/20  Iteration 444/3560 Training loss: 2.3671 0.0387 sec/batch
Epoch 3/20  Iteratio

Epoch 4/20  Iteration 550/3560 Training loss: 2.2700 0.0436 sec/batch
Epoch 4/20  Iteration 551/3560 Training loss: 2.2699 0.0387 sec/batch
Epoch 4/20  Iteration 552/3560 Training loss: 2.2717 0.0438 sec/batch
Epoch 4/20  Iteration 553/3560 Training loss: 2.2717 0.0394 sec/batch
Epoch 4/20  Iteration 554/3560 Training loss: 2.2706 0.0432 sec/batch
Epoch 4/20  Iteration 555/3560 Training loss: 2.2699 0.0398 sec/batch
Epoch 4/20  Iteration 556/3560 Training loss: 2.2712 0.0394 sec/batch
Epoch 4/20  Iteration 557/3560 Training loss: 2.2707 0.0398 sec/batch
Epoch 4/20  Iteration 558/3560 Training loss: 2.2697 0.0387 sec/batch
Epoch 4/20  Iteration 559/3560 Training loss: 2.2690 0.0383 sec/batch
Epoch 4/20  Iteration 560/3560 Training loss: 2.2684 0.0434 sec/batch
Epoch 4/20  Iteration 561/3560 Training loss: 2.2679 0.0384 sec/batch
Epoch 4/20  Iteration 562/3560 Training loss: 2.2676 0.0408 sec/batch
Epoch 4/20  Iteration 563/3560 Training loss: 2.2679 0.0392 sec/batch
Epoch 4/20  Iteratio

Epoch 4/20  Iteration 667/3560 Training loss: 2.2294 0.0384 sec/batch
Epoch 4/20  Iteration 668/3560 Training loss: 2.2293 0.0419 sec/batch
Epoch 4/20  Iteration 669/3560 Training loss: 2.2290 0.0392 sec/batch
Epoch 4/20  Iteration 670/3560 Training loss: 2.2288 0.0391 sec/batch
Epoch 4/20  Iteration 671/3560 Training loss: 2.2285 0.0391 sec/batch
Epoch 4/20  Iteration 672/3560 Training loss: 2.2284 0.0396 sec/batch
Epoch 4/20  Iteration 673/3560 Training loss: 2.2283 0.0384 sec/batch
Epoch 4/20  Iteration 674/3560 Training loss: 2.2281 0.0439 sec/batch
Epoch 4/20  Iteration 675/3560 Training loss: 2.2280 0.0391 sec/batch
Epoch 4/20  Iteration 676/3560 Training loss: 2.2277 0.0395 sec/batch
Epoch 4/20  Iteration 677/3560 Training loss: 2.2275 0.0437 sec/batch
Epoch 4/20  Iteration 678/3560 Training loss: 2.2272 0.0400 sec/batch
Epoch 4/20  Iteration 679/3560 Training loss: 2.2270 0.0389 sec/batch
Epoch 4/20  Iteration 680/3560 Training loss: 2.2270 0.0397 sec/batch
Epoch 4/20  Iteratio

Epoch 5/20  Iteration 785/3560 Training loss: 2.1645 0.0417 sec/batch
Epoch 5/20  Iteration 786/3560 Training loss: 2.1641 0.0419 sec/batch
Epoch 5/20  Iteration 787/3560 Training loss: 2.1638 0.0417 sec/batch
Epoch 5/20  Iteration 788/3560 Training loss: 2.1641 0.0393 sec/batch
Epoch 5/20  Iteration 789/3560 Training loss: 2.1638 0.0395 sec/batch
Epoch 5/20  Iteration 790/3560 Training loss: 2.1638 0.0402 sec/batch
Epoch 5/20  Iteration 791/3560 Training loss: 2.1632 0.0400 sec/batch
Epoch 5/20  Iteration 792/3560 Training loss: 2.1628 0.0394 sec/batch
Epoch 5/20  Iteration 793/3560 Training loss: 2.1623 0.0406 sec/batch
Epoch 5/20  Iteration 794/3560 Training loss: 2.1622 0.0394 sec/batch
Epoch 5/20  Iteration 795/3560 Training loss: 2.1618 0.0399 sec/batch
Epoch 5/20  Iteration 796/3560 Training loss: 2.1613 0.0407 sec/batch
Epoch 5/20  Iteration 797/3560 Training loss: 2.1606 0.0392 sec/batch
Epoch 5/20  Iteration 798/3560 Training loss: 2.1601 0.0396 sec/batch
Epoch 5/20  Iteratio

Epoch 6/20  Iteration 906/3560 Training loss: 2.1124 0.0425 sec/batch
Epoch 6/20  Iteration 907/3560 Training loss: 2.1123 0.0398 sec/batch
Epoch 6/20  Iteration 908/3560 Training loss: 2.1142 0.0389 sec/batch
Epoch 6/20  Iteration 909/3560 Training loss: 2.1146 0.0401 sec/batch
Epoch 6/20  Iteration 910/3560 Training loss: 2.1139 0.0392 sec/batch
Epoch 6/20  Iteration 911/3560 Training loss: 2.1134 0.0441 sec/batch
Epoch 6/20  Iteration 912/3560 Training loss: 2.1155 0.0448 sec/batch
Epoch 6/20  Iteration 913/3560 Training loss: 2.1151 0.0394 sec/batch
Epoch 6/20  Iteration 914/3560 Training loss: 2.1142 0.0393 sec/batch
Epoch 6/20  Iteration 915/3560 Training loss: 2.1138 0.0398 sec/batch
Epoch 6/20  Iteration 916/3560 Training loss: 2.1131 0.0417 sec/batch
Epoch 6/20  Iteration 917/3560 Training loss: 2.1125 0.0393 sec/batch
Epoch 6/20  Iteration 918/3560 Training loss: 2.1123 0.0407 sec/batch
Epoch 6/20  Iteration 919/3560 Training loss: 2.1131 0.0393 sec/batch
Epoch 6/20  Iteratio

Epoch 6/20  Iteration 1026/3560 Training loss: 2.0870 0.0406 sec/batch
Epoch 6/20  Iteration 1027/3560 Training loss: 2.0869 0.0425 sec/batch
Epoch 6/20  Iteration 1028/3560 Training loss: 2.0868 0.0394 sec/batch
Epoch 6/20  Iteration 1029/3560 Training loss: 2.0869 0.0397 sec/batch
Epoch 6/20  Iteration 1030/3560 Training loss: 2.0867 0.0401 sec/batch
Epoch 6/20  Iteration 1031/3560 Training loss: 2.0868 0.0412 sec/batch
Epoch 6/20  Iteration 1032/3560 Training loss: 2.0866 0.0396 sec/batch
Epoch 6/20  Iteration 1033/3560 Training loss: 2.0866 0.0430 sec/batch
Epoch 6/20  Iteration 1034/3560 Training loss: 2.0865 0.0399 sec/batch
Epoch 6/20  Iteration 1035/3560 Training loss: 2.0862 0.0397 sec/batch
Epoch 6/20  Iteration 1036/3560 Training loss: 2.0863 0.0398 sec/batch
Epoch 6/20  Iteration 1037/3560 Training loss: 2.0863 0.0401 sec/batch
Epoch 6/20  Iteration 1038/3560 Training loss: 2.0863 0.0404 sec/batch
Epoch 6/20  Iteration 1039/3560 Training loss: 2.0861 0.0396 sec/batch
Epoch 

Epoch 7/20  Iteration 1142/3560 Training loss: 2.0502 0.0460 sec/batch
Epoch 7/20  Iteration 1143/3560 Training loss: 2.0499 0.0394 sec/batch
Epoch 7/20  Iteration 1144/3560 Training loss: 2.0503 0.0443 sec/batch
Epoch 7/20  Iteration 1145/3560 Training loss: 2.0501 0.0393 sec/batch
Epoch 7/20  Iteration 1146/3560 Training loss: 2.0501 0.0390 sec/batch
Epoch 7/20  Iteration 1147/3560 Training loss: 2.0496 0.0393 sec/batch
Epoch 7/20  Iteration 1148/3560 Training loss: 2.0494 0.0425 sec/batch
Epoch 7/20  Iteration 1149/3560 Training loss: 2.0489 0.0393 sec/batch
Epoch 7/20  Iteration 1150/3560 Training loss: 2.0488 0.0391 sec/batch
Epoch 7/20  Iteration 1151/3560 Training loss: 2.0484 0.0403 sec/batch
Epoch 7/20  Iteration 1152/3560 Training loss: 2.0481 0.0391 sec/batch
Epoch 7/20  Iteration 1153/3560 Training loss: 2.0473 0.0414 sec/batch
Epoch 7/20  Iteration 1154/3560 Training loss: 2.0470 0.0393 sec/batch
Epoch 7/20  Iteration 1155/3560 Training loss: 2.0468 0.0395 sec/batch
Epoch 

Epoch 8/20  Iteration 1259/3560 Training loss: 2.0169 0.0395 sec/batch
Epoch 8/20  Iteration 1260/3560 Training loss: 2.0190 0.0397 sec/batch
Epoch 8/20  Iteration 1261/3560 Training loss: 2.0184 0.0399 sec/batch
Epoch 8/20  Iteration 1262/3560 Training loss: 2.0173 0.0451 sec/batch
Epoch 8/20  Iteration 1263/3560 Training loss: 2.0172 0.0412 sec/batch
Epoch 8/20  Iteration 1264/3560 Training loss: 2.0192 0.0429 sec/batch
Epoch 8/20  Iteration 1265/3560 Training loss: 2.0193 0.0397 sec/batch
Epoch 8/20  Iteration 1266/3560 Training loss: 2.0189 0.0422 sec/batch
Epoch 8/20  Iteration 1267/3560 Training loss: 2.0183 0.0454 sec/batch
Epoch 8/20  Iteration 1268/3560 Training loss: 2.0207 0.0398 sec/batch
Epoch 8/20  Iteration 1269/3560 Training loss: 2.0202 0.0408 sec/batch
Epoch 8/20  Iteration 1270/3560 Training loss: 2.0193 0.0392 sec/batch
Epoch 8/20  Iteration 1271/3560 Training loss: 2.0190 0.0395 sec/batch
Epoch 8/20  Iteration 1272/3560 Training loss: 2.0183 0.0424 sec/batch
Epoch 

Epoch 8/20  Iteration 1375/3560 Training loss: 1.9962 0.0395 sec/batch
Epoch 8/20  Iteration 1376/3560 Training loss: 1.9962 0.0399 sec/batch
Epoch 8/20  Iteration 1377/3560 Training loss: 1.9960 0.0398 sec/batch
Epoch 8/20  Iteration 1378/3560 Training loss: 1.9957 0.0419 sec/batch
Epoch 8/20  Iteration 1379/3560 Training loss: 1.9957 0.0416 sec/batch
Epoch 8/20  Iteration 1380/3560 Training loss: 1.9956 0.0404 sec/batch
Epoch 8/20  Iteration 1381/3560 Training loss: 1.9956 0.0403 sec/batch
Epoch 8/20  Iteration 1382/3560 Training loss: 1.9956 0.0398 sec/batch
Epoch 8/20  Iteration 1383/3560 Training loss: 1.9956 0.0428 sec/batch
Epoch 8/20  Iteration 1384/3560 Training loss: 1.9955 0.0402 sec/batch
Epoch 8/20  Iteration 1385/3560 Training loss: 1.9957 0.0427 sec/batch
Epoch 8/20  Iteration 1386/3560 Training loss: 1.9955 0.0392 sec/batch
Epoch 8/20  Iteration 1387/3560 Training loss: 1.9956 0.0399 sec/batch
Epoch 8/20  Iteration 1388/3560 Training loss: 1.9956 0.0394 sec/batch
Epoch 

Epoch 9/20  Iteration 1495/3560 Training loss: 1.9675 0.0468 sec/batch
Epoch 9/20  Iteration 1496/3560 Training loss: 1.9677 0.0397 sec/batch
Epoch 9/20  Iteration 1497/3560 Training loss: 1.9679 0.0405 sec/batch
Epoch 9/20  Iteration 1498/3560 Training loss: 1.9675 0.0422 sec/batch
Epoch 9/20  Iteration 1499/3560 Training loss: 1.9673 0.0393 sec/batch
Epoch 9/20  Iteration 1500/3560 Training loss: 1.9677 0.0449 sec/batch
Epoch 9/20  Iteration 1501/3560 Training loss: 1.9675 0.0400 sec/batch
Epoch 9/20  Iteration 1502/3560 Training loss: 1.9676 0.0412 sec/batch
Epoch 9/20  Iteration 1503/3560 Training loss: 1.9671 0.0396 sec/batch
Epoch 9/20  Iteration 1504/3560 Training loss: 1.9669 0.0399 sec/batch
Epoch 9/20  Iteration 1505/3560 Training loss: 1.9664 0.0417 sec/batch
Epoch 9/20  Iteration 1506/3560 Training loss: 1.9664 0.0397 sec/batch
Epoch 9/20  Iteration 1507/3560 Training loss: 1.9660 0.0399 sec/batch
Epoch 9/20  Iteration 1508/3560 Training loss: 1.9657 0.0397 sec/batch
Epoch 

Epoch 10/20  Iteration 1615/3560 Training loss: 1.9405 0.0401 sec/batch
Epoch 10/20  Iteration 1616/3560 Training loss: 1.9428 0.0425 sec/batch
Epoch 10/20  Iteration 1617/3560 Training loss: 1.9421 0.0402 sec/batch
Epoch 10/20  Iteration 1618/3560 Training loss: 1.9406 0.0391 sec/batch
Epoch 10/20  Iteration 1619/3560 Training loss: 1.9406 0.0424 sec/batch
Epoch 10/20  Iteration 1620/3560 Training loss: 1.9426 0.0399 sec/batch
Epoch 10/20  Iteration 1621/3560 Training loss: 1.9427 0.0410 sec/batch
Epoch 10/20  Iteration 1622/3560 Training loss: 1.9425 0.0421 sec/batch
Epoch 10/20  Iteration 1623/3560 Training loss: 1.9420 0.0419 sec/batch
Epoch 10/20  Iteration 1624/3560 Training loss: 1.9443 0.0401 sec/batch
Epoch 10/20  Iteration 1625/3560 Training loss: 1.9440 0.0418 sec/batch
Epoch 10/20  Iteration 1626/3560 Training loss: 1.9431 0.0395 sec/batch
Epoch 10/20  Iteration 1627/3560 Training loss: 1.9430 0.0402 sec/batch
Epoch 10/20  Iteration 1628/3560 Training loss: 1.9424 0.0404 se

Epoch 10/20  Iteration 1730/3560 Training loss: 1.9235 0.0466 sec/batch
Epoch 10/20  Iteration 1731/3560 Training loss: 1.9234 0.0397 sec/batch
Epoch 10/20  Iteration 1732/3560 Training loss: 1.9235 0.0405 sec/batch
Epoch 10/20  Iteration 1733/3560 Training loss: 1.9233 0.0402 sec/batch
Epoch 10/20  Iteration 1734/3560 Training loss: 1.9230 0.0401 sec/batch
Epoch 10/20  Iteration 1735/3560 Training loss: 1.9230 0.0403 sec/batch
Epoch 10/20  Iteration 1736/3560 Training loss: 1.9230 0.0400 sec/batch
Epoch 10/20  Iteration 1737/3560 Training loss: 1.9230 0.0403 sec/batch
Epoch 10/20  Iteration 1738/3560 Training loss: 1.9230 0.0466 sec/batch
Epoch 10/20  Iteration 1739/3560 Training loss: 1.9230 0.0405 sec/batch
Epoch 10/20  Iteration 1740/3560 Training loss: 1.9230 0.0413 sec/batch
Epoch 10/20  Iteration 1741/3560 Training loss: 1.9232 0.0403 sec/batch
Epoch 10/20  Iteration 1742/3560 Training loss: 1.9230 0.0396 sec/batch
Epoch 10/20  Iteration 1743/3560 Training loss: 1.9232 0.0397 se

Epoch 11/20  Iteration 1845/3560 Training loss: 1.9017 0.0411 sec/batch
Epoch 11/20  Iteration 1846/3560 Training loss: 1.9019 0.0399 sec/batch
Epoch 11/20  Iteration 1847/3560 Training loss: 1.9018 0.0400 sec/batch
Epoch 11/20  Iteration 1848/3560 Training loss: 1.9014 0.0395 sec/batch
Epoch 11/20  Iteration 1849/3560 Training loss: 1.9013 0.0397 sec/batch
Epoch 11/20  Iteration 1850/3560 Training loss: 1.9012 0.0407 sec/batch
Epoch 11/20  Iteration 1851/3560 Training loss: 1.9014 0.0467 sec/batch
Epoch 11/20  Iteration 1852/3560 Training loss: 1.9016 0.0399 sec/batch
Epoch 11/20  Iteration 1853/3560 Training loss: 1.9018 0.0401 sec/batch
Epoch 11/20  Iteration 1854/3560 Training loss: 1.9015 0.0398 sec/batch
Epoch 11/20  Iteration 1855/3560 Training loss: 1.9013 0.0404 sec/batch
Epoch 11/20  Iteration 1856/3560 Training loss: 1.9016 0.0403 sec/batch
Epoch 11/20  Iteration 1857/3560 Training loss: 1.9015 0.0396 sec/batch
Epoch 11/20  Iteration 1858/3560 Training loss: 1.9016 0.0453 se

Epoch 12/20  Iteration 1960/3560 Training loss: 1.9108 0.0403 sec/batch
Epoch 12/20  Iteration 1961/3560 Training loss: 1.8990 0.0401 sec/batch
Epoch 12/20  Iteration 1962/3560 Training loss: 1.8909 0.0394 sec/batch
Epoch 12/20  Iteration 1963/3560 Training loss: 1.8861 0.0397 sec/batch
Epoch 12/20  Iteration 1964/3560 Training loss: 1.8809 0.0404 sec/batch
Epoch 12/20  Iteration 1965/3560 Training loss: 1.8802 0.0413 sec/batch
Epoch 12/20  Iteration 1966/3560 Training loss: 1.8806 0.0426 sec/batch
Epoch 12/20  Iteration 1967/3560 Training loss: 1.8831 0.0447 sec/batch
Epoch 12/20  Iteration 1968/3560 Training loss: 1.8833 0.0398 sec/batch
Epoch 12/20  Iteration 1969/3560 Training loss: 1.8808 0.0399 sec/batch
Epoch 12/20  Iteration 1970/3560 Training loss: 1.8783 0.0400 sec/batch
Epoch 12/20  Iteration 1971/3560 Training loss: 1.8785 0.0397 sec/batch
Epoch 12/20  Iteration 1972/3560 Training loss: 1.8811 0.0395 sec/batch
Epoch 12/20  Iteration 1973/3560 Training loss: 1.8803 0.0402 se

Epoch 12/20  Iteration 2075/3560 Training loss: 1.8639 0.0406 sec/batch
Epoch 12/20  Iteration 2076/3560 Training loss: 1.8638 0.0414 sec/batch
Epoch 12/20  Iteration 2077/3560 Training loss: 1.8638 0.0415 sec/batch
Epoch 12/20  Iteration 2078/3560 Training loss: 1.8637 0.0424 sec/batch
Epoch 12/20  Iteration 2079/3560 Training loss: 1.8637 0.0476 sec/batch
Epoch 12/20  Iteration 2080/3560 Training loss: 1.8635 0.0407 sec/batch
Epoch 12/20  Iteration 2081/3560 Training loss: 1.8633 0.0407 sec/batch
Epoch 12/20  Iteration 2082/3560 Training loss: 1.8634 0.0420 sec/batch
Epoch 12/20  Iteration 2083/3560 Training loss: 1.8634 0.0402 sec/batch
Epoch 12/20  Iteration 2084/3560 Training loss: 1.8631 0.0409 sec/batch
Epoch 12/20  Iteration 2085/3560 Training loss: 1.8631 0.0429 sec/batch
Epoch 12/20  Iteration 2086/3560 Training loss: 1.8632 0.0409 sec/batch
Epoch 12/20  Iteration 2087/3560 Training loss: 1.8631 0.0399 sec/batch
Epoch 12/20  Iteration 2088/3560 Training loss: 1.8632 0.0400 se

Epoch 13/20  Iteration 2190/3560 Training loss: 1.8455 0.0407 sec/batch
Epoch 13/20  Iteration 2191/3560 Training loss: 1.8453 0.0408 sec/batch
Epoch 13/20  Iteration 2192/3560 Training loss: 1.8455 0.0420 sec/batch
Epoch 13/20  Iteration 2193/3560 Training loss: 1.8456 0.0426 sec/batch
Epoch 13/20  Iteration 2194/3560 Training loss: 1.8451 0.0427 sec/batch
Epoch 13/20  Iteration 2195/3560 Training loss: 1.8447 0.0399 sec/batch
Epoch 13/20  Iteration 2196/3560 Training loss: 1.8451 0.0393 sec/batch
Epoch 13/20  Iteration 2197/3560 Training loss: 1.8449 0.0394 sec/batch
Epoch 13/20  Iteration 2198/3560 Training loss: 1.8455 0.0399 sec/batch
Epoch 13/20  Iteration 2199/3560 Training loss: 1.8457 0.0395 sec/batch
Epoch 13/20  Iteration 2200/3560 Training loss: 1.8456 0.0404 sec/batch
Epoch 13/20  Iteration 2201/3560 Training loss: 1.8454 0.0452 sec/batch
Epoch 13/20  Iteration 2202/3560 Training loss: 1.8456 0.0396 sec/batch
Epoch 13/20  Iteration 2203/3560 Training loss: 1.8456 0.0418 se

Epoch 13/20  Iteration 2304/3560 Training loss: 1.8369 0.0395 sec/batch
Epoch 13/20  Iteration 2305/3560 Training loss: 1.8369 0.0415 sec/batch
Epoch 13/20  Iteration 2306/3560 Training loss: 1.8368 0.0416 sec/batch
Epoch 13/20  Iteration 2307/3560 Training loss: 1.8367 0.0417 sec/batch
Epoch 13/20  Iteration 2308/3560 Training loss: 1.8367 0.0443 sec/batch
Epoch 13/20  Iteration 2309/3560 Training loss: 1.8369 0.0451 sec/batch
Epoch 13/20  Iteration 2310/3560 Training loss: 1.8370 0.0421 sec/batch
Epoch 13/20  Iteration 2311/3560 Training loss: 1.8372 0.0404 sec/batch
Epoch 13/20  Iteration 2312/3560 Training loss: 1.8371 0.0397 sec/batch
Epoch 13/20  Iteration 2313/3560 Training loss: 1.8370 0.0394 sec/batch
Epoch 13/20  Iteration 2314/3560 Training loss: 1.8370 0.0398 sec/batch
Epoch 14/20  Iteration 2315/3560 Training loss: 1.9021 0.0409 sec/batch
Epoch 14/20  Iteration 2316/3560 Training loss: 1.8588 0.0396 sec/batch
Epoch 14/20  Iteration 2317/3560 Training loss: 1.8472 0.0401 se

Epoch 14/20  Iteration 2419/3560 Training loss: 1.8138 0.0450 sec/batch
Epoch 14/20  Iteration 2420/3560 Training loss: 1.8138 0.0399 sec/batch
Epoch 14/20  Iteration 2421/3560 Training loss: 1.8137 0.0402 sec/batch
Epoch 14/20  Iteration 2422/3560 Training loss: 1.8137 0.0400 sec/batch
Epoch 14/20  Iteration 2423/3560 Training loss: 1.8138 0.0447 sec/batch
Epoch 14/20  Iteration 2424/3560 Training loss: 1.8137 0.0414 sec/batch
Epoch 14/20  Iteration 2425/3560 Training loss: 1.8137 0.0390 sec/batch
Epoch 14/20  Iteration 2426/3560 Training loss: 1.8136 0.0398 sec/batch
Epoch 14/20  Iteration 2427/3560 Training loss: 1.8134 0.0392 sec/batch
Epoch 14/20  Iteration 2428/3560 Training loss: 1.8133 0.0445 sec/batch
Epoch 14/20  Iteration 2429/3560 Training loss: 1.8131 0.0398 sec/batch
Epoch 14/20  Iteration 2430/3560 Training loss: 1.8127 0.0421 sec/batch
Epoch 14/20  Iteration 2431/3560 Training loss: 1.8126 0.0403 sec/batch
Epoch 14/20  Iteration 2432/3560 Training loss: 1.8126 0.0451 se

Epoch 15/20  Iteration 2534/3560 Training loss: 1.7996 0.0402 sec/batch
Epoch 15/20  Iteration 2535/3560 Training loss: 1.7990 0.0402 sec/batch
Epoch 15/20  Iteration 2536/3560 Training loss: 1.7984 0.0404 sec/batch
Epoch 15/20  Iteration 2537/3560 Training loss: 1.7983 0.0401 sec/batch
Epoch 15/20  Iteration 2538/3560 Training loss: 1.7970 0.0464 sec/batch
Epoch 15/20  Iteration 2539/3560 Training loss: 1.7969 0.0405 sec/batch
Epoch 15/20  Iteration 2540/3560 Training loss: 1.7964 0.0422 sec/batch
Epoch 15/20  Iteration 2541/3560 Training loss: 1.7963 0.0445 sec/batch
Epoch 15/20  Iteration 2542/3560 Training loss: 1.7970 0.0395 sec/batch
Epoch 15/20  Iteration 2543/3560 Training loss: 1.7965 0.0406 sec/batch
Epoch 15/20  Iteration 2544/3560 Training loss: 1.7971 0.0405 sec/batch
Epoch 15/20  Iteration 2545/3560 Training loss: 1.7969 0.0403 sec/batch
Epoch 15/20  Iteration 2546/3560 Training loss: 1.7966 0.0424 sec/batch
Epoch 15/20  Iteration 2547/3560 Training loss: 1.7965 0.0399 se

Epoch 15/20  Iteration 2649/3560 Training loss: 1.7891 0.0409 sec/batch
Epoch 15/20  Iteration 2650/3560 Training loss: 1.7891 0.0427 sec/batch
Epoch 15/20  Iteration 2651/3560 Training loss: 1.7889 0.0400 sec/batch
Epoch 15/20  Iteration 2652/3560 Training loss: 1.7891 0.0401 sec/batch
Epoch 15/20  Iteration 2653/3560 Training loss: 1.7893 0.0399 sec/batch
Epoch 15/20  Iteration 2654/3560 Training loss: 1.7892 0.0400 sec/batch
Epoch 15/20  Iteration 2655/3560 Training loss: 1.7892 0.0406 sec/batch
Epoch 15/20  Iteration 2656/3560 Training loss: 1.7892 0.0407 sec/batch
Epoch 15/20  Iteration 2657/3560 Training loss: 1.7892 0.0423 sec/batch
Epoch 15/20  Iteration 2658/3560 Training loss: 1.7891 0.0401 sec/batch
Epoch 15/20  Iteration 2659/3560 Training loss: 1.7892 0.0398 sec/batch
Epoch 15/20  Iteration 2660/3560 Training loss: 1.7894 0.0424 sec/batch
Epoch 15/20  Iteration 2661/3560 Training loss: 1.7893 0.0397 sec/batch
Epoch 15/20  Iteration 2662/3560 Training loss: 1.7893 0.0400 se

Epoch 16/20  Iteration 2763/3560 Training loss: 1.7708 0.0452 sec/batch
Epoch 16/20  Iteration 2764/3560 Training loss: 1.7705 0.0452 sec/batch
Epoch 16/20  Iteration 2765/3560 Training loss: 1.7702 0.0461 sec/batch
Epoch 16/20  Iteration 2766/3560 Training loss: 1.7701 0.0472 sec/batch
Epoch 16/20  Iteration 2767/3560 Training loss: 1.7699 0.0424 sec/batch
Epoch 16/20  Iteration 2768/3560 Training loss: 1.7695 0.0400 sec/batch
Epoch 16/20  Iteration 2769/3560 Training loss: 1.7692 0.0398 sec/batch
Epoch 16/20  Iteration 2770/3560 Training loss: 1.7687 0.0500 sec/batch
Epoch 16/20  Iteration 2771/3560 Training loss: 1.7687 0.0450 sec/batch
Epoch 16/20  Iteration 2772/3560 Training loss: 1.7687 0.0394 sec/batch
Epoch 16/20  Iteration 2773/3560 Training loss: 1.7685 0.0416 sec/batch
Epoch 16/20  Iteration 2774/3560 Training loss: 1.7683 0.0420 sec/batch
Epoch 16/20  Iteration 2775/3560 Training loss: 1.7681 0.0408 sec/batch
Epoch 16/20  Iteration 2776/3560 Training loss: 1.7681 0.0412 se

Epoch 17/20  Iteration 2878/3560 Training loss: 1.7589 0.0478 sec/batch
Epoch 17/20  Iteration 2879/3560 Training loss: 1.7585 0.0431 sec/batch
Epoch 17/20  Iteration 2880/3560 Training loss: 1.7577 0.0475 sec/batch
Epoch 17/20  Iteration 2881/3560 Training loss: 1.7580 0.0534 sec/batch
Epoch 17/20  Iteration 2882/3560 Training loss: 1.7586 0.0471 sec/batch
Epoch 17/20  Iteration 2883/3560 Training loss: 1.7582 0.0477 sec/batch
Epoch 17/20  Iteration 2884/3560 Training loss: 1.7581 0.0503 sec/batch
Epoch 17/20  Iteration 2885/3560 Training loss: 1.7577 0.0445 sec/batch
Epoch 17/20  Iteration 2886/3560 Training loss: 1.7569 0.0433 sec/batch
Epoch 17/20  Iteration 2887/3560 Training loss: 1.7559 0.0436 sec/batch
Epoch 17/20  Iteration 2888/3560 Training loss: 1.7553 0.0424 sec/batch
Epoch 17/20  Iteration 2889/3560 Training loss: 1.7547 0.0430 sec/batch
Epoch 17/20  Iteration 2890/3560 Training loss: 1.7548 0.0403 sec/batch
Epoch 17/20  Iteration 2891/3560 Training loss: 1.7542 0.0406 se

Epoch 17/20  Iteration 2993/3560 Training loss: 1.7447 0.0448 sec/batch
Epoch 17/20  Iteration 2994/3560 Training loss: 1.7449 0.0400 sec/batch
Epoch 17/20  Iteration 2995/3560 Training loss: 1.7449 0.0474 sec/batch
Epoch 17/20  Iteration 2996/3560 Training loss: 1.7451 0.0399 sec/batch
Epoch 17/20  Iteration 2997/3560 Training loss: 1.7451 0.0400 sec/batch
Epoch 17/20  Iteration 2998/3560 Training loss: 1.7450 0.0399 sec/batch
Epoch 17/20  Iteration 2999/3560 Training loss: 1.7448 0.0403 sec/batch
Epoch 17/20  Iteration 3000/3560 Training loss: 1.7450 0.0404 sec/batch
Epoch 17/20  Iteration 3001/3560 Training loss: 1.7451 0.0405 sec/batch
Epoch 17/20  Iteration 3002/3560 Training loss: 1.7451 0.0401 sec/batch
Epoch 17/20  Iteration 3003/3560 Training loss: 1.7451 0.0403 sec/batch
Epoch 17/20  Iteration 3004/3560 Training loss: 1.7451 0.0397 sec/batch
Epoch 17/20  Iteration 3005/3560 Training loss: 1.7451 0.0420 sec/batch
Epoch 17/20  Iteration 3006/3560 Training loss: 1.7451 0.0407 se

Epoch 18/20  Iteration 3108/3560 Training loss: 1.7312 0.0398 sec/batch
Epoch 18/20  Iteration 3109/3560 Training loss: 1.7307 0.0397 sec/batch
Epoch 18/20  Iteration 3110/3560 Training loss: 1.7307 0.0398 sec/batch
Epoch 18/20  Iteration 3111/3560 Training loss: 1.7301 0.0395 sec/batch
Epoch 18/20  Iteration 3112/3560 Training loss: 1.7298 0.0402 sec/batch
Epoch 18/20  Iteration 3113/3560 Training loss: 1.7297 0.0410 sec/batch
Epoch 18/20  Iteration 3114/3560 Training loss: 1.7294 0.0404 sec/batch
Epoch 18/20  Iteration 3115/3560 Training loss: 1.7290 0.0396 sec/batch
Epoch 18/20  Iteration 3116/3560 Training loss: 1.7291 0.0403 sec/batch
Epoch 18/20  Iteration 3117/3560 Training loss: 1.7288 0.0397 sec/batch
Epoch 18/20  Iteration 3118/3560 Training loss: 1.7287 0.0424 sec/batch
Epoch 18/20  Iteration 3119/3560 Training loss: 1.7283 0.0402 sec/batch
Epoch 18/20  Iteration 3120/3560 Training loss: 1.7280 0.0406 sec/batch
Epoch 18/20  Iteration 3121/3560 Training loss: 1.7276 0.0406 se

Epoch 19/20  Iteration 3223/3560 Training loss: 1.7178 0.0480 sec/batch
Epoch 19/20  Iteration 3224/3560 Training loss: 1.7181 0.0426 sec/batch
Epoch 19/20  Iteration 3225/3560 Training loss: 1.7177 0.0477 sec/batch
Epoch 19/20  Iteration 3226/3560 Training loss: 1.7186 0.0426 sec/batch
Epoch 19/20  Iteration 3227/3560 Training loss: 1.7182 0.0466 sec/batch
Epoch 19/20  Iteration 3228/3560 Training loss: 1.7175 0.0452 sec/batch
Epoch 19/20  Iteration 3229/3560 Training loss: 1.7174 0.0440 sec/batch
Epoch 19/20  Iteration 3230/3560 Training loss: 1.7164 0.0421 sec/batch
Epoch 19/20  Iteration 3231/3560 Training loss: 1.7154 0.0425 sec/batch
Epoch 19/20  Iteration 3232/3560 Training loss: 1.7157 0.0415 sec/batch
Epoch 19/20  Iteration 3233/3560 Training loss: 1.7166 0.0439 sec/batch
Epoch 19/20  Iteration 3234/3560 Training loss: 1.7172 0.0416 sec/batch
Epoch 19/20  Iteration 3235/3560 Training loss: 1.7167 0.0458 sec/batch
Epoch 19/20  Iteration 3236/3560 Training loss: 1.7160 0.0470 se

Epoch 19/20  Iteration 3338/3560 Training loss: 1.7040 0.0415 sec/batch
Epoch 19/20  Iteration 3339/3560 Training loss: 1.7040 0.0423 sec/batch
Epoch 19/20  Iteration 3340/3560 Training loss: 1.7040 0.0444 sec/batch
Epoch 19/20  Iteration 3341/3560 Training loss: 1.7041 0.0477 sec/batch
Epoch 19/20  Iteration 3342/3560 Training loss: 1.7042 0.0429 sec/batch
Epoch 19/20  Iteration 3343/3560 Training loss: 1.7044 0.0412 sec/batch
Epoch 19/20  Iteration 3344/3560 Training loss: 1.7043 0.0433 sec/batch
Epoch 19/20  Iteration 3345/3560 Training loss: 1.7046 0.0422 sec/batch
Epoch 19/20  Iteration 3346/3560 Training loss: 1.7046 0.0443 sec/batch
Epoch 19/20  Iteration 3347/3560 Training loss: 1.7046 0.0401 sec/batch
Epoch 19/20  Iteration 3348/3560 Training loss: 1.7047 0.0435 sec/batch
Epoch 19/20  Iteration 3349/3560 Training loss: 1.7046 0.0476 sec/batch
Epoch 19/20  Iteration 3350/3560 Training loss: 1.7047 0.0424 sec/batch
Epoch 19/20  Iteration 3351/3560 Training loss: 1.7048 0.0411 se

Epoch 20/20  Iteration 3453/3560 Training loss: 1.6940 0.0453 sec/batch
Epoch 20/20  Iteration 3454/3560 Training loss: 1.6942 0.0401 sec/batch
Epoch 20/20  Iteration 3455/3560 Training loss: 1.6946 0.0421 sec/batch
Epoch 20/20  Iteration 3456/3560 Training loss: 1.6943 0.0414 sec/batch
Epoch 20/20  Iteration 3457/3560 Training loss: 1.6941 0.0450 sec/batch
Epoch 20/20  Iteration 3458/3560 Training loss: 1.6944 0.0404 sec/batch
Epoch 20/20  Iteration 3459/3560 Training loss: 1.6943 0.0404 sec/batch
Epoch 20/20  Iteration 3460/3560 Training loss: 1.6944 0.0449 sec/batch
Epoch 20/20  Iteration 3461/3560 Training loss: 1.6940 0.0402 sec/batch
Epoch 20/20  Iteration 3462/3560 Training loss: 1.6940 0.0414 sec/batch
Epoch 20/20  Iteration 3463/3560 Training loss: 1.6934 0.0410 sec/batch
Epoch 20/20  Iteration 3464/3560 Training loss: 1.6936 0.0410 sec/batch
Epoch 20/20  Iteration 3465/3560 Training loss: 1.6931 0.0408 sec/batch
Epoch 20/20  Iteration 3466/3560 Training loss: 1.6931 0.0398 se

Epoch 1/20  Iteration 11/3560 Training loss: 3.7176 0.0451 sec/batch
Epoch 1/20  Iteration 12/3560 Training loss: 3.6719 0.0454 sec/batch
Epoch 1/20  Iteration 13/3560 Training loss: 3.6320 0.0448 sec/batch
Epoch 1/20  Iteration 14/3560 Training loss: 3.5981 0.0447 sec/batch
Epoch 1/20  Iteration 15/3560 Training loss: 3.5682 0.0538 sec/batch
Epoch 1/20  Iteration 16/3560 Training loss: 3.5422 0.0448 sec/batch
Epoch 1/20  Iteration 17/3560 Training loss: 3.5181 0.0448 sec/batch
Epoch 1/20  Iteration 18/3560 Training loss: 3.4988 0.0477 sec/batch
Epoch 1/20  Iteration 19/3560 Training loss: 3.4800 0.0453 sec/batch
Epoch 1/20  Iteration 20/3560 Training loss: 3.4610 0.0455 sec/batch
Epoch 1/20  Iteration 21/3560 Training loss: 3.4448 0.0480 sec/batch
Epoch 1/20  Iteration 22/3560 Training loss: 3.4303 0.0485 sec/batch
Epoch 1/20  Iteration 23/3560 Training loss: 3.4163 0.0533 sec/batch
Epoch 1/20  Iteration 24/3560 Training loss: 3.4038 0.0504 sec/batch
Epoch 1/20  Iteration 25/3560 Trai

Epoch 1/20  Iteration 132/3560 Training loss: 3.1249 0.0484 sec/batch
Epoch 1/20  Iteration 133/3560 Training loss: 3.1232 0.0447 sec/batch
Epoch 1/20  Iteration 134/3560 Training loss: 3.1213 0.0477 sec/batch
Epoch 1/20  Iteration 135/3560 Training loss: 3.1192 0.0496 sec/batch
Epoch 1/20  Iteration 136/3560 Training loss: 3.1172 0.0449 sec/batch
Epoch 1/20  Iteration 137/3560 Training loss: 3.1153 0.0446 sec/batch
Epoch 1/20  Iteration 138/3560 Training loss: 3.1133 0.0445 sec/batch
Epoch 1/20  Iteration 139/3560 Training loss: 3.1115 0.0451 sec/batch
Epoch 1/20  Iteration 140/3560 Training loss: 3.1095 0.0492 sec/batch
Epoch 1/20  Iteration 141/3560 Training loss: 3.1075 0.0452 sec/batch
Epoch 1/20  Iteration 142/3560 Training loss: 3.1054 0.0445 sec/batch
Epoch 1/20  Iteration 143/3560 Training loss: 3.1033 0.0464 sec/batch
Epoch 1/20  Iteration 144/3560 Training loss: 3.1011 0.0447 sec/batch
Epoch 1/20  Iteration 145/3560 Training loss: 3.0991 0.0459 sec/batch
Epoch 1/20  Iteratio

Epoch 2/20  Iteration 252/3560 Training loss: 2.4683 0.0468 sec/batch
Epoch 2/20  Iteration 253/3560 Training loss: 2.4672 0.0471 sec/batch
Epoch 2/20  Iteration 254/3560 Training loss: 2.4666 0.0489 sec/batch
Epoch 2/20  Iteration 255/3560 Training loss: 2.4654 0.0523 sec/batch
Epoch 2/20  Iteration 256/3560 Training loss: 2.4645 0.0456 sec/batch
Epoch 2/20  Iteration 257/3560 Training loss: 2.4633 0.0461 sec/batch
Epoch 2/20  Iteration 258/3560 Training loss: 2.4622 0.0483 sec/batch
Epoch 2/20  Iteration 259/3560 Training loss: 2.4610 0.0465 sec/batch
Epoch 2/20  Iteration 260/3560 Training loss: 2.4601 0.0549 sec/batch
Epoch 2/20  Iteration 261/3560 Training loss: 2.4592 0.0468 sec/batch
Epoch 2/20  Iteration 262/3560 Training loss: 2.4580 0.0497 sec/batch
Epoch 2/20  Iteration 263/3560 Training loss: 2.4565 0.0451 sec/batch
Epoch 2/20  Iteration 264/3560 Training loss: 2.4554 0.0489 sec/batch
Epoch 2/20  Iteration 265/3560 Training loss: 2.4544 0.0602 sec/batch
Epoch 2/20  Iteratio

Epoch 3/20  Iteration 374/3560 Training loss: 2.2464 0.0472 sec/batch
Epoch 3/20  Iteration 375/3560 Training loss: 2.2460 0.0759 sec/batch
Epoch 3/20  Iteration 376/3560 Training loss: 2.2443 0.0525 sec/batch
Epoch 3/20  Iteration 377/3560 Training loss: 2.2434 0.0468 sec/batch
Epoch 3/20  Iteration 378/3560 Training loss: 2.2445 0.0459 sec/batch
Epoch 3/20  Iteration 379/3560 Training loss: 2.2437 0.0568 sec/batch
Epoch 3/20  Iteration 380/3560 Training loss: 2.2425 0.0493 sec/batch
Epoch 3/20  Iteration 381/3560 Training loss: 2.2415 0.0465 sec/batch
Epoch 3/20  Iteration 382/3560 Training loss: 2.2408 0.0459 sec/batch
Epoch 3/20  Iteration 383/3560 Training loss: 2.2397 0.0459 sec/batch
Epoch 3/20  Iteration 384/3560 Training loss: 2.2392 0.0460 sec/batch
Epoch 3/20  Iteration 385/3560 Training loss: 2.2391 0.0501 sec/batch
Epoch 3/20  Iteration 386/3560 Training loss: 2.2388 0.0476 sec/batch
Epoch 3/20  Iteration 387/3560 Training loss: 2.2387 0.0480 sec/batch
Epoch 3/20  Iteratio

Epoch 3/20  Iteration 493/3560 Training loss: 2.1753 0.0481 sec/batch
Epoch 3/20  Iteration 494/3560 Training loss: 2.1749 0.0479 sec/batch
Epoch 3/20  Iteration 495/3560 Training loss: 2.1748 0.0476 sec/batch
Epoch 3/20  Iteration 496/3560 Training loss: 2.1743 0.0475 sec/batch
Epoch 3/20  Iteration 497/3560 Training loss: 2.1741 0.0552 sec/batch
Epoch 3/20  Iteration 498/3560 Training loss: 2.1737 0.0484 sec/batch
Epoch 3/20  Iteration 499/3560 Training loss: 2.1733 0.0473 sec/batch
Epoch 3/20  Iteration 500/3560 Training loss: 2.1729 0.0478 sec/batch
Epoch 3/20  Iteration 501/3560 Training loss: 2.1724 0.0467 sec/batch
Epoch 3/20  Iteration 502/3560 Training loss: 2.1723 0.0472 sec/batch
Epoch 3/20  Iteration 503/3560 Training loss: 2.1720 0.0481 sec/batch
Epoch 3/20  Iteration 504/3560 Training loss: 2.1717 0.0486 sec/batch
Epoch 3/20  Iteration 505/3560 Training loss: 2.1713 0.0479 sec/batch
Epoch 3/20  Iteration 506/3560 Training loss: 2.1708 0.0471 sec/batch
Epoch 3/20  Iteratio

Epoch 4/20  Iteration 611/3560 Training loss: 2.0569 0.0475 sec/batch
Epoch 4/20  Iteration 612/3560 Training loss: 2.0567 0.0500 sec/batch
Epoch 4/20  Iteration 613/3560 Training loss: 2.0561 0.0504 sec/batch
Epoch 4/20  Iteration 614/3560 Training loss: 2.0557 0.0478 sec/batch
Epoch 4/20  Iteration 615/3560 Training loss: 2.0551 0.0599 sec/batch
Epoch 4/20  Iteration 616/3560 Training loss: 2.0548 0.0476 sec/batch
Epoch 4/20  Iteration 617/3560 Training loss: 2.0541 0.0478 sec/batch
Epoch 4/20  Iteration 618/3560 Training loss: 2.0536 0.0475 sec/batch
Epoch 4/20  Iteration 619/3560 Training loss: 2.0528 0.0483 sec/batch
Epoch 4/20  Iteration 620/3560 Training loss: 2.0522 0.0476 sec/batch
Epoch 4/20  Iteration 621/3560 Training loss: 2.0518 0.0469 sec/batch
Epoch 4/20  Iteration 622/3560 Training loss: 2.0513 0.0577 sec/batch
Epoch 4/20  Iteration 623/3560 Training loss: 2.0507 0.0501 sec/batch
Epoch 4/20  Iteration 624/3560 Training loss: 2.0504 0.0509 sec/batch
Epoch 4/20  Iteratio

Epoch 5/20  Iteration 732/3560 Training loss: 1.9721 0.0502 sec/batch
Epoch 5/20  Iteration 733/3560 Training loss: 1.9713 0.0471 sec/batch
Epoch 5/20  Iteration 734/3560 Training loss: 1.9733 0.0470 sec/batch
Epoch 5/20  Iteration 735/3560 Training loss: 1.9722 0.0611 sec/batch
Epoch 5/20  Iteration 736/3560 Training loss: 1.9713 0.0472 sec/batch
Epoch 5/20  Iteration 737/3560 Training loss: 1.9708 0.0477 sec/batch
Epoch 5/20  Iteration 738/3560 Training loss: 1.9700 0.0474 sec/batch
Epoch 5/20  Iteration 739/3560 Training loss: 1.9690 0.0468 sec/batch
Epoch 5/20  Iteration 740/3560 Training loss: 1.9690 0.0577 sec/batch
Epoch 5/20  Iteration 741/3560 Training loss: 1.9698 0.0501 sec/batch
Epoch 5/20  Iteration 742/3560 Training loss: 1.9701 0.0499 sec/batch
Epoch 5/20  Iteration 743/3560 Training loss: 1.9698 0.0478 sec/batch
Epoch 5/20  Iteration 744/3560 Training loss: 1.9688 0.0474 sec/batch
Epoch 5/20  Iteration 745/3560 Training loss: 1.9683 0.0482 sec/batch
Epoch 5/20  Iteratio

Epoch 5/20  Iteration 850/3560 Training loss: 1.9335 0.0473 sec/batch
Epoch 5/20  Iteration 851/3560 Training loss: 1.9336 0.0477 sec/batch
Epoch 5/20  Iteration 852/3560 Training loss: 1.9333 0.0489 sec/batch
Epoch 5/20  Iteration 853/3560 Training loss: 1.9334 0.0488 sec/batch
Epoch 5/20  Iteration 854/3560 Training loss: 1.9332 0.0480 sec/batch
Epoch 5/20  Iteration 855/3560 Training loss: 1.9331 0.0486 sec/batch
Epoch 5/20  Iteration 856/3560 Training loss: 1.9329 0.0492 sec/batch
Epoch 5/20  Iteration 857/3560 Training loss: 1.9327 0.0479 sec/batch
Epoch 5/20  Iteration 858/3560 Training loss: 1.9328 0.0469 sec/batch
Epoch 5/20  Iteration 859/3560 Training loss: 1.9326 0.0472 sec/batch
Epoch 5/20  Iteration 860/3560 Training loss: 1.9327 0.0485 sec/batch
Epoch 5/20  Iteration 861/3560 Training loss: 1.9325 0.0602 sec/batch
Epoch 5/20  Iteration 862/3560 Training loss: 1.9323 0.0475 sec/batch
Epoch 5/20  Iteration 863/3560 Training loss: 1.9322 0.0475 sec/batch
Epoch 5/20  Iteratio

Epoch 6/20  Iteration 969/3560 Training loss: 1.8732 0.0483 sec/batch
Epoch 6/20  Iteration 970/3560 Training loss: 1.8730 0.0502 sec/batch
Epoch 6/20  Iteration 971/3560 Training loss: 1.8725 0.0495 sec/batch
Epoch 6/20  Iteration 972/3560 Training loss: 1.8724 0.0492 sec/batch
Epoch 6/20  Iteration 973/3560 Training loss: 1.8718 0.0482 sec/batch
Epoch 6/20  Iteration 974/3560 Training loss: 1.8715 0.0474 sec/batch
Epoch 6/20  Iteration 975/3560 Training loss: 1.8709 0.0480 sec/batch
Epoch 6/20  Iteration 976/3560 Training loss: 1.8704 0.0478 sec/batch
Epoch 6/20  Iteration 977/3560 Training loss: 1.8702 0.0483 sec/batch
Epoch 6/20  Iteration 978/3560 Training loss: 1.8698 0.0476 sec/batch
Epoch 6/20  Iteration 979/3560 Training loss: 1.8692 0.0479 sec/batch
Epoch 6/20  Iteration 980/3560 Training loss: 1.8692 0.0478 sec/batch
Epoch 6/20  Iteration 981/3560 Training loss: 1.8688 0.0479 sec/batch
Epoch 6/20  Iteration 982/3560 Training loss: 1.8686 0.0480 sec/batch
Epoch 6/20  Iteratio

Epoch 7/20  Iteration 1087/3560 Training loss: 1.8251 0.0490 sec/batch
Epoch 7/20  Iteration 1088/3560 Training loss: 1.8249 0.0485 sec/batch
Epoch 7/20  Iteration 1089/3560 Training loss: 1.8241 0.0481 sec/batch
Epoch 7/20  Iteration 1090/3560 Training loss: 1.8260 0.0481 sec/batch
Epoch 7/20  Iteration 1091/3560 Training loss: 1.8248 0.0476 sec/batch
Epoch 7/20  Iteration 1092/3560 Training loss: 1.8237 0.0486 sec/batch
Epoch 7/20  Iteration 1093/3560 Training loss: 1.8232 0.0505 sec/batch
Epoch 7/20  Iteration 1094/3560 Training loss: 1.8221 0.0477 sec/batch
Epoch 7/20  Iteration 1095/3560 Training loss: 1.8210 0.0473 sec/batch
Epoch 7/20  Iteration 1096/3560 Training loss: 1.8211 0.0482 sec/batch
Epoch 7/20  Iteration 1097/3560 Training loss: 1.8221 0.0509 sec/batch
Epoch 7/20  Iteration 1098/3560 Training loss: 1.8222 0.0476 sec/batch
Epoch 7/20  Iteration 1099/3560 Training loss: 1.8219 0.0509 sec/batch
Epoch 7/20  Iteration 1100/3560 Training loss: 1.8208 0.0475 sec/batch
Epoch 

Epoch 7/20  Iteration 1205/3560 Training loss: 1.7942 0.0487 sec/batch
Epoch 7/20  Iteration 1206/3560 Training loss: 1.7941 0.0549 sec/batch
Epoch 7/20  Iteration 1207/3560 Training loss: 1.7943 0.0513 sec/batch
Epoch 7/20  Iteration 1208/3560 Training loss: 1.7941 0.0509 sec/batch
Epoch 7/20  Iteration 1209/3560 Training loss: 1.7942 0.0490 sec/batch
Epoch 7/20  Iteration 1210/3560 Training loss: 1.7941 0.0486 sec/batch
Epoch 7/20  Iteration 1211/3560 Training loss: 1.7940 0.0517 sec/batch
Epoch 7/20  Iteration 1212/3560 Training loss: 1.7939 0.0489 sec/batch
Epoch 7/20  Iteration 1213/3560 Training loss: 1.7938 0.0482 sec/batch
Epoch 7/20  Iteration 1214/3560 Training loss: 1.7939 0.0495 sec/batch
Epoch 7/20  Iteration 1215/3560 Training loss: 1.7939 0.0489 sec/batch
Epoch 7/20  Iteration 1216/3560 Training loss: 1.7940 0.0490 sec/batch
Epoch 7/20  Iteration 1217/3560 Training loss: 1.7940 0.0480 sec/batch
Epoch 7/20  Iteration 1218/3560 Training loss: 1.7938 0.0485 sec/batch
Epoch 

Epoch 8/20  Iteration 1323/3560 Training loss: 1.7538 0.0514 sec/batch
Epoch 8/20  Iteration 1324/3560 Training loss: 1.7538 0.0514 sec/batch
Epoch 8/20  Iteration 1325/3560 Training loss: 1.7533 0.0472 sec/batch
Epoch 8/20  Iteration 1326/3560 Training loss: 1.7532 0.0474 sec/batch
Epoch 8/20  Iteration 1327/3560 Training loss: 1.7527 0.0474 sec/batch
Epoch 8/20  Iteration 1328/3560 Training loss: 1.7527 0.0522 sec/batch
Epoch 8/20  Iteration 1329/3560 Training loss: 1.7521 0.0480 sec/batch
Epoch 8/20  Iteration 1330/3560 Training loss: 1.7520 0.0509 sec/batch
Epoch 8/20  Iteration 1331/3560 Training loss: 1.7514 0.0478 sec/batch
Epoch 8/20  Iteration 1332/3560 Training loss: 1.7510 0.0479 sec/batch
Epoch 8/20  Iteration 1333/3560 Training loss: 1.7508 0.0500 sec/batch
Epoch 8/20  Iteration 1334/3560 Training loss: 1.7504 0.0472 sec/batch
Epoch 8/20  Iteration 1335/3560 Training loss: 1.7500 0.0483 sec/batch
Epoch 8/20  Iteration 1336/3560 Training loss: 1.7500 0.0496 sec/batch
Epoch 

Epoch 9/20  Iteration 1441/3560 Training loss: 1.7164 0.0520 sec/batch
Epoch 9/20  Iteration 1442/3560 Training loss: 1.7185 0.0493 sec/batch
Epoch 9/20  Iteration 1443/3560 Training loss: 1.7184 0.0488 sec/batch
Epoch 9/20  Iteration 1444/3560 Training loss: 1.7188 0.0513 sec/batch
Epoch 9/20  Iteration 1445/3560 Training loss: 1.7180 0.0480 sec/batch
Epoch 9/20  Iteration 1446/3560 Training loss: 1.7197 0.0480 sec/batch
Epoch 9/20  Iteration 1447/3560 Training loss: 1.7185 0.0480 sec/batch
Epoch 9/20  Iteration 1448/3560 Training loss: 1.7177 0.0513 sec/batch
Epoch 9/20  Iteration 1449/3560 Training loss: 1.7173 0.0479 sec/batch
Epoch 9/20  Iteration 1450/3560 Training loss: 1.7162 0.0530 sec/batch
Epoch 9/20  Iteration 1451/3560 Training loss: 1.7152 0.0483 sec/batch
Epoch 9/20  Iteration 1452/3560 Training loss: 1.7154 0.0530 sec/batch
Epoch 9/20  Iteration 1453/3560 Training loss: 1.7165 0.0511 sec/batch
Epoch 9/20  Iteration 1454/3560 Training loss: 1.7166 0.0515 sec/batch
Epoch 

Epoch 9/20  Iteration 1557/3560 Training loss: 1.6961 0.0472 sec/batch
Epoch 9/20  Iteration 1558/3560 Training loss: 1.6961 0.0474 sec/batch
Epoch 9/20  Iteration 1559/3560 Training loss: 1.6960 0.0481 sec/batch
Epoch 9/20  Iteration 1560/3560 Training loss: 1.6960 0.0473 sec/batch
Epoch 9/20  Iteration 1561/3560 Training loss: 1.6960 0.0474 sec/batch
Epoch 9/20  Iteration 1562/3560 Training loss: 1.6960 0.0503 sec/batch
Epoch 9/20  Iteration 1563/3560 Training loss: 1.6962 0.0474 sec/batch
Epoch 9/20  Iteration 1564/3560 Training loss: 1.6960 0.0481 sec/batch
Epoch 9/20  Iteration 1565/3560 Training loss: 1.6963 0.0484 sec/batch
Epoch 9/20  Iteration 1566/3560 Training loss: 1.6962 0.0477 sec/batch
Epoch 9/20  Iteration 1567/3560 Training loss: 1.6961 0.0523 sec/batch
Epoch 9/20  Iteration 1568/3560 Training loss: 1.6961 0.0479 sec/batch
Epoch 9/20  Iteration 1569/3560 Training loss: 1.6960 0.0480 sec/batch
Epoch 9/20  Iteration 1570/3560 Training loss: 1.6961 0.0476 sec/batch
Epoch 

Epoch 10/20  Iteration 1675/3560 Training loss: 1.6701 0.0508 sec/batch
Epoch 10/20  Iteration 1676/3560 Training loss: 1.6698 0.0482 sec/batch
Epoch 10/20  Iteration 1677/3560 Training loss: 1.6695 0.0573 sec/batch
Epoch 10/20  Iteration 1678/3560 Training loss: 1.6697 0.0474 sec/batch
Epoch 10/20  Iteration 1679/3560 Training loss: 1.6696 0.0475 sec/batch
Epoch 10/20  Iteration 1680/3560 Training loss: 1.6696 0.0497 sec/batch
Epoch 10/20  Iteration 1681/3560 Training loss: 1.6691 0.0482 sec/batch
Epoch 10/20  Iteration 1682/3560 Training loss: 1.6691 0.0474 sec/batch
Epoch 10/20  Iteration 1683/3560 Training loss: 1.6686 0.0579 sec/batch
Epoch 10/20  Iteration 1684/3560 Training loss: 1.6685 0.0503 sec/batch
Epoch 10/20  Iteration 1685/3560 Training loss: 1.6680 0.0511 sec/batch
Epoch 10/20  Iteration 1686/3560 Training loss: 1.6679 0.0577 sec/batch
Epoch 10/20  Iteration 1687/3560 Training loss: 1.6675 0.0558 sec/batch
Epoch 10/20  Iteration 1688/3560 Training loss: 1.6671 0.0549 se

Epoch 11/20  Iteration 1789/3560 Training loss: 1.6470 0.0485 sec/batch
Epoch 11/20  Iteration 1790/3560 Training loss: 1.6469 0.0501 sec/batch
Epoch 11/20  Iteration 1791/3560 Training loss: 1.6440 0.0479 sec/batch
Epoch 11/20  Iteration 1792/3560 Training loss: 1.6423 0.0462 sec/batch
Epoch 11/20  Iteration 1793/3560 Training loss: 1.6425 0.0478 sec/batch
Epoch 11/20  Iteration 1794/3560 Training loss: 1.6450 0.0603 sec/batch
Epoch 11/20  Iteration 1795/3560 Training loss: 1.6436 0.0480 sec/batch
Epoch 11/20  Iteration 1796/3560 Training loss: 1.6420 0.0494 sec/batch
Epoch 11/20  Iteration 1797/3560 Training loss: 1.6420 0.0590 sec/batch
Epoch 11/20  Iteration 1798/3560 Training loss: 1.6439 0.0494 sec/batch
Epoch 11/20  Iteration 1799/3560 Training loss: 1.6440 0.0522 sec/batch
Epoch 11/20  Iteration 1800/3560 Training loss: 1.6447 0.0487 sec/batch
Epoch 11/20  Iteration 1801/3560 Training loss: 1.6440 0.0490 sec/batch
Epoch 11/20  Iteration 1802/3560 Training loss: 1.6455 0.0485 se

Epoch 11/20  Iteration 1906/3560 Training loss: 1.6272 0.0537 sec/batch
Epoch 11/20  Iteration 1907/3560 Training loss: 1.6273 0.0575 sec/batch
Epoch 11/20  Iteration 1908/3560 Training loss: 1.6273 0.0480 sec/batch
Epoch 11/20  Iteration 1909/3560 Training loss: 1.6271 0.0485 sec/batch
Epoch 11/20  Iteration 1910/3560 Training loss: 1.6270 0.0485 sec/batch
Epoch 11/20  Iteration 1911/3560 Training loss: 1.6266 0.0484 sec/batch
Epoch 11/20  Iteration 1912/3560 Training loss: 1.6264 0.0473 sec/batch
Epoch 11/20  Iteration 1913/3560 Training loss: 1.6264 0.0478 sec/batch
Epoch 11/20  Iteration 1914/3560 Training loss: 1.6264 0.0486 sec/batch
Epoch 11/20  Iteration 1915/3560 Training loss: 1.6264 0.0512 sec/batch
Epoch 11/20  Iteration 1916/3560 Training loss: 1.6264 0.0507 sec/batch
Epoch 11/20  Iteration 1917/3560 Training loss: 1.6265 0.0485 sec/batch
Epoch 11/20  Iteration 1918/3560 Training loss: 1.6265 0.0485 sec/batch
Epoch 11/20  Iteration 1919/3560 Training loss: 1.6267 0.0522 se

Epoch 12/20  Iteration 2020/3560 Training loss: 1.6078 0.0488 sec/batch
Epoch 12/20  Iteration 2021/3560 Training loss: 1.6081 0.0485 sec/batch
Epoch 12/20  Iteration 2022/3560 Training loss: 1.6081 0.0490 sec/batch
Epoch 12/20  Iteration 2023/3560 Training loss: 1.6079 0.0613 sec/batch
Epoch 12/20  Iteration 2024/3560 Training loss: 1.6082 0.0497 sec/batch
Epoch 12/20  Iteration 2025/3560 Training loss: 1.6084 0.0483 sec/batch
Epoch 12/20  Iteration 2026/3560 Training loss: 1.6081 0.0492 sec/batch
Epoch 12/20  Iteration 2027/3560 Training loss: 1.6080 0.0484 sec/batch
Epoch 12/20  Iteration 2028/3560 Training loss: 1.6080 0.0490 sec/batch
Epoch 12/20  Iteration 2029/3560 Training loss: 1.6085 0.0495 sec/batch
Epoch 12/20  Iteration 2030/3560 Training loss: 1.6087 0.0518 sec/batch
Epoch 12/20  Iteration 2031/3560 Training loss: 1.6091 0.0492 sec/batch
Epoch 12/20  Iteration 2032/3560 Training loss: 1.6088 0.0494 sec/batch
Epoch 12/20  Iteration 2033/3560 Training loss: 1.6086 0.0508 se

Epoch 12/20  Iteration 2134/3560 Training loss: 1.5990 0.0508 sec/batch
Epoch 12/20  Iteration 2135/3560 Training loss: 1.5988 0.0481 sec/batch
Epoch 12/20  Iteration 2136/3560 Training loss: 1.5989 0.0481 sec/batch
Epoch 13/20  Iteration 2137/3560 Training loss: 1.6521 0.0478 sec/batch
Epoch 13/20  Iteration 2138/3560 Training loss: 1.6213 0.0480 sec/batch
Epoch 13/20  Iteration 2139/3560 Training loss: 1.6143 0.0489 sec/batch
Epoch 13/20  Iteration 2140/3560 Training loss: 1.6079 0.0506 sec/batch
Epoch 13/20  Iteration 2141/3560 Training loss: 1.6010 0.0477 sec/batch
Epoch 13/20  Iteration 2142/3560 Training loss: 1.5910 0.0501 sec/batch
Epoch 13/20  Iteration 2143/3560 Training loss: 1.5913 0.0480 sec/batch
Epoch 13/20  Iteration 2144/3560 Training loss: 1.5902 0.0505 sec/batch
Epoch 13/20  Iteration 2145/3560 Training loss: 1.5919 0.0475 sec/batch
Epoch 13/20  Iteration 2146/3560 Training loss: 1.5914 0.0478 sec/batch
Epoch 13/20  Iteration 2147/3560 Training loss: 1.5886 0.0484 se

Epoch 13/20  Iteration 2248/3560 Training loss: 1.5764 0.0481 sec/batch
Epoch 13/20  Iteration 2249/3560 Training loss: 1.5762 0.0586 sec/batch
Epoch 13/20  Iteration 2250/3560 Training loss: 1.5761 0.0477 sec/batch
Epoch 13/20  Iteration 2251/3560 Training loss: 1.5758 0.0486 sec/batch
Epoch 13/20  Iteration 2252/3560 Training loss: 1.5755 0.0587 sec/batch
Epoch 13/20  Iteration 2253/3560 Training loss: 1.5754 0.0481 sec/batch
Epoch 13/20  Iteration 2254/3560 Training loss: 1.5754 0.0484 sec/batch
Epoch 13/20  Iteration 2255/3560 Training loss: 1.5753 0.0512 sec/batch
Epoch 13/20  Iteration 2256/3560 Training loss: 1.5752 0.0491 sec/batch
Epoch 13/20  Iteration 2257/3560 Training loss: 1.5751 0.0501 sec/batch
Epoch 13/20  Iteration 2258/3560 Training loss: 1.5748 0.0584 sec/batch
Epoch 13/20  Iteration 2259/3560 Training loss: 1.5744 0.0490 sec/batch
Epoch 13/20  Iteration 2260/3560 Training loss: 1.5745 0.0512 sec/batch
Epoch 13/20  Iteration 2261/3560 Training loss: 1.5744 0.0503 se

Epoch 14/20  Iteration 2366/3560 Training loss: 1.5585 0.0491 sec/batch
Epoch 14/20  Iteration 2367/3560 Training loss: 1.5584 0.0481 sec/batch
Epoch 14/20  Iteration 2368/3560 Training loss: 1.5585 0.0484 sec/batch
Epoch 14/20  Iteration 2369/3560 Training loss: 1.5581 0.0481 sec/batch
Epoch 14/20  Iteration 2370/3560 Training loss: 1.5583 0.0483 sec/batch
Epoch 14/20  Iteration 2371/3560 Training loss: 1.5585 0.0495 sec/batch
Epoch 14/20  Iteration 2372/3560 Training loss: 1.5582 0.0509 sec/batch
Epoch 14/20  Iteration 2373/3560 Training loss: 1.5577 0.0481 sec/batch
Epoch 14/20  Iteration 2374/3560 Training loss: 1.5582 0.0495 sec/batch
Epoch 14/20  Iteration 2375/3560 Training loss: 1.5583 0.0495 sec/batch
Epoch 14/20  Iteration 2376/3560 Training loss: 1.5592 0.0486 sec/batch
Epoch 14/20  Iteration 2377/3560 Training loss: 1.5595 0.0478 sec/batch
Epoch 14/20  Iteration 2378/3560 Training loss: 1.5594 0.0490 sec/batch
Epoch 14/20  Iteration 2379/3560 Training loss: 1.5593 0.0484 se

Epoch 14/20  Iteration 2482/3560 Training loss: 1.5518 0.0610 sec/batch
Epoch 14/20  Iteration 2483/3560 Training loss: 1.5518 0.0489 sec/batch
Epoch 14/20  Iteration 2484/3560 Training loss: 1.5517 0.0511 sec/batch
Epoch 14/20  Iteration 2485/3560 Training loss: 1.5516 0.0480 sec/batch
Epoch 14/20  Iteration 2486/3560 Training loss: 1.5515 0.0478 sec/batch
Epoch 14/20  Iteration 2487/3560 Training loss: 1.5516 0.0477 sec/batch
Epoch 14/20  Iteration 2488/3560 Training loss: 1.5516 0.0487 sec/batch
Epoch 14/20  Iteration 2489/3560 Training loss: 1.5517 0.0516 sec/batch
Epoch 14/20  Iteration 2490/3560 Training loss: 1.5516 0.0490 sec/batch
Epoch 14/20  Iteration 2491/3560 Training loss: 1.5514 0.0598 sec/batch
Epoch 14/20  Iteration 2492/3560 Training loss: 1.5515 0.0511 sec/batch
Epoch 15/20  Iteration 2493/3560 Training loss: 1.6037 0.0513 sec/batch
Epoch 15/20  Iteration 2494/3560 Training loss: 1.5766 0.0516 sec/batch
Epoch 15/20  Iteration 2495/3560 Training loss: 1.5699 0.0503 se

Epoch 15/20  Iteration 2597/3560 Training loss: 1.5334 0.0499 sec/batch
Epoch 15/20  Iteration 2598/3560 Training loss: 1.5334 0.0482 sec/batch
Epoch 15/20  Iteration 2599/3560 Training loss: 1.5334 0.0483 sec/batch
Epoch 15/20  Iteration 2600/3560 Training loss: 1.5334 0.0478 sec/batch
Epoch 15/20  Iteration 2601/3560 Training loss: 1.5333 0.0507 sec/batch
Epoch 15/20  Iteration 2602/3560 Training loss: 1.5333 0.0492 sec/batch
Epoch 15/20  Iteration 2603/3560 Training loss: 1.5331 0.0484 sec/batch
Epoch 15/20  Iteration 2604/3560 Training loss: 1.5329 0.0485 sec/batch
Epoch 15/20  Iteration 2605/3560 Training loss: 1.5328 0.0477 sec/batch
Epoch 15/20  Iteration 2606/3560 Training loss: 1.5326 0.0532 sec/batch
Epoch 15/20  Iteration 2607/3560 Training loss: 1.5324 0.0496 sec/batch
Epoch 15/20  Iteration 2608/3560 Training loss: 1.5321 0.0482 sec/batch
Epoch 15/20  Iteration 2609/3560 Training loss: 1.5320 0.0481 sec/batch
Epoch 15/20  Iteration 2610/3560 Training loss: 1.5319 0.0477 se

Epoch 16/20  Iteration 2713/3560 Training loss: 1.5190 0.0507 sec/batch
Epoch 16/20  Iteration 2714/3560 Training loss: 1.5183 0.0490 sec/batch
Epoch 16/20  Iteration 2715/3560 Training loss: 1.5185 0.0495 sec/batch
Epoch 16/20  Iteration 2716/3560 Training loss: 1.5175 0.0498 sec/batch
Epoch 16/20  Iteration 2717/3560 Training loss: 1.5173 0.0501 sec/batch
Epoch 16/20  Iteration 2718/3560 Training loss: 1.5167 0.0592 sec/batch
Epoch 16/20  Iteration 2719/3560 Training loss: 1.5166 0.0481 sec/batch
Epoch 16/20  Iteration 2720/3560 Training loss: 1.5172 0.0492 sec/batch
Epoch 16/20  Iteration 2721/3560 Training loss: 1.5167 0.0484 sec/batch
Epoch 16/20  Iteration 2722/3560 Training loss: 1.5175 0.0481 sec/batch
Epoch 16/20  Iteration 2723/3560 Training loss: 1.5175 0.0490 sec/batch
Epoch 16/20  Iteration 2724/3560 Training loss: 1.5176 0.0600 sec/batch
Epoch 16/20  Iteration 2725/3560 Training loss: 1.5173 0.0492 sec/batch
Epoch 16/20  Iteration 2726/3560 Training loss: 1.5174 0.0479 se

Epoch 16/20  Iteration 2828/3560 Training loss: 1.5119 0.0518 sec/batch
Epoch 16/20  Iteration 2829/3560 Training loss: 1.5118 0.0478 sec/batch
Epoch 16/20  Iteration 2830/3560 Training loss: 1.5119 0.0485 sec/batch
Epoch 16/20  Iteration 2831/3560 Training loss: 1.5121 0.0480 sec/batch
Epoch 16/20  Iteration 2832/3560 Training loss: 1.5121 0.0476 sec/batch
Epoch 16/20  Iteration 2833/3560 Training loss: 1.5121 0.0494 sec/batch
Epoch 16/20  Iteration 2834/3560 Training loss: 1.5121 0.0481 sec/batch
Epoch 16/20  Iteration 2835/3560 Training loss: 1.5121 0.0481 sec/batch
Epoch 16/20  Iteration 2836/3560 Training loss: 1.5120 0.0478 sec/batch
Epoch 16/20  Iteration 2837/3560 Training loss: 1.5121 0.0478 sec/batch
Epoch 16/20  Iteration 2838/3560 Training loss: 1.5126 0.0518 sec/batch
Epoch 16/20  Iteration 2839/3560 Training loss: 1.5126 0.0479 sec/batch
Epoch 16/20  Iteration 2840/3560 Training loss: 1.5125 0.0489 sec/batch
Epoch 16/20  Iteration 2841/3560 Training loss: 1.5124 0.0480 se

Epoch 17/20  Iteration 2944/3560 Training loss: 1.4987 0.0676 sec/batch
Epoch 17/20  Iteration 2945/3560 Training loss: 1.4987 0.0504 sec/batch
Epoch 17/20  Iteration 2946/3560 Training loss: 1.4983 0.0487 sec/batch
Epoch 17/20  Iteration 2947/3560 Training loss: 1.4980 0.0489 sec/batch
Epoch 17/20  Iteration 2948/3560 Training loss: 1.4976 0.0545 sec/batch
Epoch 17/20  Iteration 2949/3560 Training loss: 1.4975 0.0484 sec/batch
Epoch 17/20  Iteration 2950/3560 Training loss: 1.4974 0.0485 sec/batch
Epoch 17/20  Iteration 2951/3560 Training loss: 1.4972 0.0477 sec/batch
Epoch 17/20  Iteration 2952/3560 Training loss: 1.4971 0.0483 sec/batch
Epoch 17/20  Iteration 2953/3560 Training loss: 1.4969 0.0514 sec/batch
Epoch 17/20  Iteration 2954/3560 Training loss: 1.4969 0.0479 sec/batch
Epoch 17/20  Iteration 2955/3560 Training loss: 1.4969 0.0576 sec/batch
Epoch 17/20  Iteration 2956/3560 Training loss: 1.4969 0.0479 sec/batch
Epoch 17/20  Iteration 2957/3560 Training loss: 1.4968 0.0496 se

Epoch 18/20  Iteration 3061/3560 Training loss: 1.4889 0.0607 sec/batch
Epoch 18/20  Iteration 3062/3560 Training loss: 1.4884 0.0484 sec/batch
Epoch 18/20  Iteration 3063/3560 Training loss: 1.4878 0.0485 sec/batch
Epoch 18/20  Iteration 3064/3560 Training loss: 1.4868 0.0475 sec/batch
Epoch 18/20  Iteration 3065/3560 Training loss: 1.4855 0.0481 sec/batch
Epoch 18/20  Iteration 3066/3560 Training loss: 1.4850 0.0526 sec/batch
Epoch 18/20  Iteration 3067/3560 Training loss: 1.4845 0.0503 sec/batch
Epoch 18/20  Iteration 3068/3560 Training loss: 1.4851 0.0487 sec/batch
Epoch 18/20  Iteration 3069/3560 Training loss: 1.4846 0.0485 sec/batch
Epoch 18/20  Iteration 3070/3560 Training loss: 1.4839 0.0606 sec/batch
Epoch 18/20  Iteration 3071/3560 Training loss: 1.4840 0.0514 sec/batch
Epoch 18/20  Iteration 3072/3560 Training loss: 1.4831 0.0487 sec/batch
Epoch 18/20  Iteration 3073/3560 Training loss: 1.4829 0.0478 sec/batch
Epoch 18/20  Iteration 3074/3560 Training loss: 1.4823 0.0487 se

Epoch 18/20  Iteration 3179/3560 Training loss: 1.4784 0.0493 sec/batch
Epoch 18/20  Iteration 3180/3560 Training loss: 1.4785 0.0484 sec/batch
Epoch 18/20  Iteration 3181/3560 Training loss: 1.4785 0.0476 sec/batch
Epoch 18/20  Iteration 3182/3560 Training loss: 1.4786 0.0478 sec/batch
Epoch 18/20  Iteration 3183/3560 Training loss: 1.4786 0.0515 sec/batch
Epoch 18/20  Iteration 3184/3560 Training loss: 1.4787 0.0527 sec/batch
Epoch 18/20  Iteration 3185/3560 Training loss: 1.4785 0.0489 sec/batch
Epoch 18/20  Iteration 3186/3560 Training loss: 1.4786 0.0607 sec/batch
Epoch 18/20  Iteration 3187/3560 Training loss: 1.4788 0.0508 sec/batch
Epoch 18/20  Iteration 3188/3560 Training loss: 1.4788 0.0503 sec/batch
Epoch 18/20  Iteration 3189/3560 Training loss: 1.4789 0.0488 sec/batch
Epoch 18/20  Iteration 3190/3560 Training loss: 1.4789 0.0480 sec/batch
Epoch 18/20  Iteration 3191/3560 Training loss: 1.4789 0.0488 sec/batch
Epoch 18/20  Iteration 3192/3560 Training loss: 1.4789 0.0521 se

Epoch 19/20  Iteration 3294/3560 Training loss: 1.4684 0.0515 sec/batch
Epoch 19/20  Iteration 3295/3560 Training loss: 1.4681 0.0610 sec/batch
Epoch 19/20  Iteration 3296/3560 Training loss: 1.4681 0.0491 sec/batch
Epoch 19/20  Iteration 3297/3560 Training loss: 1.4678 0.0492 sec/batch
Epoch 19/20  Iteration 3298/3560 Training loss: 1.4676 0.0499 sec/batch
Epoch 19/20  Iteration 3299/3560 Training loss: 1.4673 0.0485 sec/batch
Epoch 19/20  Iteration 3300/3560 Training loss: 1.4673 0.0499 sec/batch
Epoch 19/20  Iteration 3301/3560 Training loss: 1.4674 0.0490 sec/batch
Epoch 19/20  Iteration 3302/3560 Training loss: 1.4669 0.0482 sec/batch
Epoch 19/20  Iteration 3303/3560 Training loss: 1.4667 0.0479 sec/batch
Epoch 19/20  Iteration 3304/3560 Training loss: 1.4662 0.0507 sec/batch
Epoch 19/20  Iteration 3305/3560 Training loss: 1.4662 0.0482 sec/batch
Epoch 19/20  Iteration 3306/3560 Training loss: 1.4661 0.0490 sec/batch
Epoch 19/20  Iteration 3307/3560 Training loss: 1.4660 0.0491 se

Epoch 20/20  Iteration 3411/3560 Training loss: 1.4602 0.0501 sec/batch
Epoch 20/20  Iteration 3412/3560 Training loss: 1.4606 0.0496 sec/batch
Epoch 20/20  Iteration 3413/3560 Training loss: 1.4601 0.0598 sec/batch
Epoch 20/20  Iteration 3414/3560 Training loss: 1.4592 0.0500 sec/batch
Epoch 20/20  Iteration 3415/3560 Training loss: 1.4596 0.0513 sec/batch
Epoch 20/20  Iteration 3416/3560 Training loss: 1.4599 0.0500 sec/batch
Epoch 20/20  Iteration 3417/3560 Training loss: 1.4598 0.0521 sec/batch
Epoch 20/20  Iteration 3418/3560 Training loss: 1.4592 0.0501 sec/batch
Epoch 20/20  Iteration 3419/3560 Training loss: 1.4586 0.0510 sec/batch
Epoch 20/20  Iteration 3420/3560 Training loss: 1.4576 0.0536 sec/batch
Epoch 20/20  Iteration 3421/3560 Training loss: 1.4563 0.0502 sec/batch
Epoch 20/20  Iteration 3422/3560 Training loss: 1.4558 0.0500 sec/batch
Epoch 20/20  Iteration 3423/3560 Training loss: 1.4553 0.0510 sec/batch
Epoch 20/20  Iteration 3424/3560 Training loss: 1.4559 0.0546 se

Epoch 20/20  Iteration 3527/3560 Training loss: 1.4496 0.0487 sec/batch
Epoch 20/20  Iteration 3528/3560 Training loss: 1.4498 0.0513 sec/batch
Epoch 20/20  Iteration 3529/3560 Training loss: 1.4499 0.0489 sec/batch
Epoch 20/20  Iteration 3530/3560 Training loss: 1.4502 0.0530 sec/batch
Epoch 20/20  Iteration 3531/3560 Training loss: 1.4503 0.0493 sec/batch
Epoch 20/20  Iteration 3532/3560 Training loss: 1.4502 0.0516 sec/batch
Epoch 20/20  Iteration 3533/3560 Training loss: 1.4498 0.0489 sec/batch
Epoch 20/20  Iteration 3534/3560 Training loss: 1.4499 0.0615 sec/batch
Epoch 20/20  Iteration 3535/3560 Training loss: 1.4500 0.0492 sec/batch
Epoch 20/20  Iteration 3536/3560 Training loss: 1.4500 0.0496 sec/batch
Epoch 20/20  Iteration 3537/3560 Training loss: 1.4501 0.0519 sec/batch
Epoch 20/20  Iteration 3538/3560 Training loss: 1.4502 0.0496 sec/batch
Epoch 20/20  Iteration 3539/3560 Training loss: 1.4502 0.0501 sec/batch
Epoch 20/20  Iteration 3540/3560 Training loss: 1.4502 0.0502 se

Epoch 1/20  Iteration 87/3560 Training loss: 3.2375 0.0472 sec/batch
Epoch 1/20  Iteration 88/3560 Training loss: 3.2359 0.0489 sec/batch
Epoch 1/20  Iteration 89/3560 Training loss: 3.2345 0.0482 sec/batch
Epoch 1/20  Iteration 90/3560 Training loss: 3.2332 0.0489 sec/batch
Epoch 1/20  Iteration 91/3560 Training loss: 3.2318 0.0512 sec/batch
Epoch 1/20  Iteration 92/3560 Training loss: 3.2304 0.0477 sec/batch
Epoch 1/20  Iteration 93/3560 Training loss: 3.2291 0.0479 sec/batch
Epoch 1/20  Iteration 94/3560 Training loss: 3.2278 0.0491 sec/batch
Epoch 1/20  Iteration 95/3560 Training loss: 3.2265 0.0507 sec/batch
Epoch 1/20  Iteration 96/3560 Training loss: 3.2252 0.0535 sec/batch
Epoch 1/20  Iteration 97/3560 Training loss: 3.2240 0.0697 sec/batch
Epoch 1/20  Iteration 98/3560 Training loss: 3.2228 0.0646 sec/batch
Epoch 1/20  Iteration 99/3560 Training loss: 3.2216 0.0506 sec/batch
Epoch 1/20  Iteration 100/3560 Training loss: 3.2204 0.0553 sec/batch
Epoch 1/20  Iteration 101/3560 Tr

Epoch 2/20  Iteration 208/3560 Training loss: 2.9971 0.0569 sec/batch
Epoch 2/20  Iteration 209/3560 Training loss: 2.9961 0.0666 sec/batch
Epoch 2/20  Iteration 210/3560 Training loss: 2.9939 0.0474 sec/batch
Epoch 2/20  Iteration 211/3560 Training loss: 2.9915 0.0497 sec/batch
Epoch 2/20  Iteration 212/3560 Training loss: 2.9896 0.0498 sec/batch
Epoch 2/20  Iteration 213/3560 Training loss: 2.9872 0.0518 sec/batch
Epoch 2/20  Iteration 214/3560 Training loss: 2.9855 0.0458 sec/batch
Epoch 2/20  Iteration 215/3560 Training loss: 2.9829 0.0451 sec/batch
Epoch 2/20  Iteration 216/3560 Training loss: 2.9803 0.0445 sec/batch
Epoch 2/20  Iteration 217/3560 Training loss: 2.9776 0.0464 sec/batch
Epoch 2/20  Iteration 218/3560 Training loss: 2.9751 0.0496 sec/batch
Epoch 2/20  Iteration 219/3560 Training loss: 2.9724 0.0564 sec/batch
Epoch 2/20  Iteration 220/3560 Training loss: 2.9698 0.0503 sec/batch
Epoch 2/20  Iteration 221/3560 Training loss: 2.9671 0.0468 sec/batch
Epoch 2/20  Iteratio

Epoch 2/20  Iteration 328/3560 Training loss: 2.7423 0.0467 sec/batch
Epoch 2/20  Iteration 329/3560 Training loss: 2.7412 0.0468 sec/batch
Epoch 2/20  Iteration 330/3560 Training loss: 2.7400 0.0461 sec/batch
Epoch 2/20  Iteration 331/3560 Training loss: 2.7387 0.0464 sec/batch
Epoch 2/20  Iteration 332/3560 Training loss: 2.7373 0.0462 sec/batch
Epoch 2/20  Iteration 333/3560 Training loss: 2.7359 0.0452 sec/batch
Epoch 2/20  Iteration 334/3560 Training loss: 2.7344 0.0475 sec/batch
Epoch 2/20  Iteration 335/3560 Training loss: 2.7329 0.0499 sec/batch
Epoch 2/20  Iteration 336/3560 Training loss: 2.7315 0.0459 sec/batch
Epoch 2/20  Iteration 337/3560 Training loss: 2.7300 0.0498 sec/batch
Epoch 2/20  Iteration 338/3560 Training loss: 2.7286 0.0461 sec/batch
Epoch 2/20  Iteration 339/3560 Training loss: 2.7273 0.0469 sec/batch
Epoch 2/20  Iteration 340/3560 Training loss: 2.7257 0.0523 sec/batch
Epoch 2/20  Iteration 341/3560 Training loss: 2.7242 0.0483 sec/batch
Epoch 2/20  Iteratio

Epoch 3/20  Iteration 450/3560 Training loss: 2.4262 0.0471 sec/batch
Epoch 3/20  Iteration 451/3560 Training loss: 2.4255 0.0469 sec/batch
Epoch 3/20  Iteration 452/3560 Training loss: 2.4249 0.0463 sec/batch
Epoch 3/20  Iteration 453/3560 Training loss: 2.4244 0.0462 sec/batch
Epoch 3/20  Iteration 454/3560 Training loss: 2.4238 0.0462 sec/batch
Epoch 3/20  Iteration 455/3560 Training loss: 2.4233 0.0464 sec/batch
Epoch 3/20  Iteration 456/3560 Training loss: 2.4227 0.0456 sec/batch
Epoch 3/20  Iteration 457/3560 Training loss: 2.4224 0.0495 sec/batch
Epoch 3/20  Iteration 458/3560 Training loss: 2.4219 0.0457 sec/batch
Epoch 3/20  Iteration 459/3560 Training loss: 2.4212 0.0456 sec/batch
Epoch 3/20  Iteration 460/3560 Training loss: 2.4206 0.0451 sec/batch
Epoch 3/20  Iteration 461/3560 Training loss: 2.4202 0.0460 sec/batch
Epoch 3/20  Iteration 462/3560 Training loss: 2.4197 0.0496 sec/batch
Epoch 3/20  Iteration 463/3560 Training loss: 2.4191 0.0486 sec/batch
Epoch 3/20  Iteratio

Epoch 4/20  Iteration 570/3560 Training loss: 2.3172 0.0501 sec/batch
Epoch 4/20  Iteration 571/3560 Training loss: 2.3166 0.0482 sec/batch
Epoch 4/20  Iteration 572/3560 Training loss: 2.3154 0.0496 sec/batch
Epoch 4/20  Iteration 573/3560 Training loss: 2.3145 0.0472 sec/batch
Epoch 4/20  Iteration 574/3560 Training loss: 2.3135 0.0477 sec/batch
Epoch 4/20  Iteration 575/3560 Training loss: 2.3126 0.0470 sec/batch
Epoch 4/20  Iteration 576/3560 Training loss: 2.3118 0.0468 sec/batch
Epoch 4/20  Iteration 577/3560 Training loss: 2.3108 0.0469 sec/batch
Epoch 4/20  Iteration 578/3560 Training loss: 2.3100 0.0467 sec/batch
Epoch 4/20  Iteration 579/3560 Training loss: 2.3092 0.0463 sec/batch
Epoch 4/20  Iteration 580/3560 Training loss: 2.3078 0.0473 sec/batch
Epoch 4/20  Iteration 581/3560 Training loss: 2.3077 0.0467 sec/batch
Epoch 4/20  Iteration 582/3560 Training loss: 2.3072 0.0472 sec/batch
Epoch 4/20  Iteration 583/3560 Training loss: 2.3066 0.0468 sec/batch
Epoch 4/20  Iteratio

Epoch 4/20  Iteration 689/3560 Training loss: 2.2715 0.0466 sec/batch
Epoch 4/20  Iteration 690/3560 Training loss: 2.2713 0.0493 sec/batch
Epoch 4/20  Iteration 691/3560 Training loss: 2.2710 0.0464 sec/batch
Epoch 4/20  Iteration 692/3560 Training loss: 2.2706 0.0470 sec/batch
Epoch 4/20  Iteration 693/3560 Training loss: 2.2702 0.0503 sec/batch
Epoch 4/20  Iteration 694/3560 Training loss: 2.2702 0.0465 sec/batch
Epoch 4/20  Iteration 695/3560 Training loss: 2.2700 0.0575 sec/batch
Epoch 4/20  Iteration 696/3560 Training loss: 2.2696 0.0470 sec/batch
Epoch 4/20  Iteration 697/3560 Training loss: 2.2693 0.0467 sec/batch
Epoch 4/20  Iteration 698/3560 Training loss: 2.2690 0.0495 sec/batch
Epoch 4/20  Iteration 699/3560 Training loss: 2.2688 0.0492 sec/batch
Epoch 4/20  Iteration 700/3560 Training loss: 2.2686 0.0475 sec/batch
Epoch 4/20  Iteration 701/3560 Training loss: 2.2684 0.0502 sec/batch
Epoch 4/20  Iteration 702/3560 Training loss: 2.2683 0.0569 sec/batch
Epoch 4/20  Iteratio

Epoch 5/20  Iteration 809/3560 Training loss: 2.1877 0.0487 sec/batch
Epoch 5/20  Iteration 810/3560 Training loss: 2.1874 0.0469 sec/batch
Epoch 5/20  Iteration 811/3560 Training loss: 2.1871 0.0476 sec/batch
Epoch 5/20  Iteration 812/3560 Training loss: 2.1866 0.0476 sec/batch
Epoch 5/20  Iteration 813/3560 Training loss: 2.1865 0.0577 sec/batch
Epoch 5/20  Iteration 814/3560 Training loss: 2.1863 0.0485 sec/batch
Epoch 5/20  Iteration 815/3560 Training loss: 2.1858 0.0471 sec/batch
Epoch 5/20  Iteration 816/3560 Training loss: 2.1855 0.0473 sec/batch
Epoch 5/20  Iteration 817/3560 Training loss: 2.1851 0.0481 sec/batch
Epoch 5/20  Iteration 818/3560 Training loss: 2.1849 0.0488 sec/batch
Epoch 5/20  Iteration 819/3560 Training loss: 2.1845 0.0519 sec/batch
Epoch 5/20  Iteration 820/3560 Training loss: 2.1845 0.0475 sec/batch
Epoch 5/20  Iteration 821/3560 Training loss: 2.1844 0.0473 sec/batch
Epoch 5/20  Iteration 822/3560 Training loss: 2.1839 0.0470 sec/batch
Epoch 5/20  Iteratio

Epoch 6/20  Iteration 929/3560 Training loss: 2.1302 0.0546 sec/batch
Epoch 6/20  Iteration 930/3560 Training loss: 2.1293 0.0496 sec/batch
Epoch 6/20  Iteration 931/3560 Training loss: 2.1286 0.0480 sec/batch
Epoch 6/20  Iteration 932/3560 Training loss: 2.1280 0.0572 sec/batch
Epoch 6/20  Iteration 933/3560 Training loss: 2.1271 0.0476 sec/batch
Epoch 6/20  Iteration 934/3560 Training loss: 2.1264 0.0480 sec/batch
Epoch 6/20  Iteration 935/3560 Training loss: 2.1257 0.0519 sec/batch
Epoch 6/20  Iteration 936/3560 Training loss: 2.1243 0.0473 sec/batch
Epoch 6/20  Iteration 937/3560 Training loss: 2.1243 0.0466 sec/batch
Epoch 6/20  Iteration 938/3560 Training loss: 2.1238 0.0473 sec/batch
Epoch 6/20  Iteration 939/3560 Training loss: 2.1234 0.0472 sec/batch
Epoch 6/20  Iteration 940/3560 Training loss: 2.1237 0.0588 sec/batch
Epoch 6/20  Iteration 941/3560 Training loss: 2.1231 0.0495 sec/batch
Epoch 6/20  Iteration 942/3560 Training loss: 2.1232 0.0482 sec/batch
Epoch 6/20  Iteratio

Epoch 6/20  Iteration 1046/3560 Training loss: 2.1017 0.0487 sec/batch
Epoch 6/20  Iteration 1047/3560 Training loss: 2.1015 0.0501 sec/batch
Epoch 6/20  Iteration 1048/3560 Training loss: 2.1013 0.0513 sec/batch
Epoch 6/20  Iteration 1049/3560 Training loss: 2.1010 0.0473 sec/batch
Epoch 6/20  Iteration 1050/3560 Training loss: 2.1012 0.0476 sec/batch
Epoch 6/20  Iteration 1051/3560 Training loss: 2.1011 0.0507 sec/batch
Epoch 6/20  Iteration 1052/3560 Training loss: 2.1009 0.0476 sec/batch
Epoch 6/20  Iteration 1053/3560 Training loss: 2.1007 0.0480 sec/batch
Epoch 6/20  Iteration 1054/3560 Training loss: 2.1006 0.0479 sec/batch
Epoch 6/20  Iteration 1055/3560 Training loss: 2.1006 0.0479 sec/batch
Epoch 6/20  Iteration 1056/3560 Training loss: 2.1004 0.0476 sec/batch
Epoch 6/20  Iteration 1057/3560 Training loss: 2.1004 0.0483 sec/batch
Epoch 6/20  Iteration 1058/3560 Training loss: 2.1004 0.0499 sec/batch
Epoch 6/20  Iteration 1059/3560 Training loss: 2.1003 0.0585 sec/batch
Epoch 

Epoch 7/20  Iteration 1165/3560 Training loss: 2.0497 0.0471 sec/batch
Epoch 7/20  Iteration 1166/3560 Training loss: 2.0494 0.0488 sec/batch
Epoch 7/20  Iteration 1167/3560 Training loss: 2.0491 0.0474 sec/batch
Epoch 7/20  Iteration 1168/3560 Training loss: 2.0486 0.0486 sec/batch
Epoch 7/20  Iteration 1169/3560 Training loss: 2.0486 0.0475 sec/batch
Epoch 7/20  Iteration 1170/3560 Training loss: 2.0484 0.0481 sec/batch
Epoch 7/20  Iteration 1171/3560 Training loss: 2.0481 0.0473 sec/batch
Epoch 7/20  Iteration 1172/3560 Training loss: 2.0479 0.0484 sec/batch
Epoch 7/20  Iteration 1173/3560 Training loss: 2.0476 0.0473 sec/batch
Epoch 7/20  Iteration 1174/3560 Training loss: 2.0475 0.0477 sec/batch
Epoch 7/20  Iteration 1175/3560 Training loss: 2.0472 0.0589 sec/batch
Epoch 7/20  Iteration 1176/3560 Training loss: 2.0472 0.0472 sec/batch
Epoch 7/20  Iteration 1177/3560 Training loss: 2.0472 0.0478 sec/batch
Epoch 7/20  Iteration 1178/3560 Training loss: 2.0468 0.0509 sec/batch
Epoch 

Epoch 8/20  Iteration 1281/3560 Training loss: 2.0176 0.0552 sec/batch
Epoch 8/20  Iteration 1282/3560 Training loss: 2.0174 0.0481 sec/batch
Epoch 8/20  Iteration 1283/3560 Training loss: 2.0171 0.0479 sec/batch
Epoch 8/20  Iteration 1284/3560 Training loss: 2.0159 0.0528 sec/batch
Epoch 8/20  Iteration 1285/3560 Training loss: 2.0150 0.0478 sec/batch
Epoch 8/20  Iteration 1286/3560 Training loss: 2.0142 0.0480 sec/batch
Epoch 8/20  Iteration 1287/3560 Training loss: 2.0136 0.0475 sec/batch
Epoch 8/20  Iteration 1288/3560 Training loss: 2.0131 0.0509 sec/batch
Epoch 8/20  Iteration 1289/3560 Training loss: 2.0123 0.0528 sec/batch
Epoch 8/20  Iteration 1290/3560 Training loss: 2.0117 0.0498 sec/batch
Epoch 8/20  Iteration 1291/3560 Training loss: 2.0111 0.0501 sec/batch
Epoch 8/20  Iteration 1292/3560 Training loss: 2.0096 0.0475 sec/batch
Epoch 8/20  Iteration 1293/3560 Training loss: 2.0096 0.0470 sec/batch
Epoch 8/20  Iteration 1294/3560 Training loss: 2.0091 0.0479 sec/batch
Epoch 

Epoch 8/20  Iteration 1398/3560 Training loss: 1.9935 0.0609 sec/batch
Epoch 8/20  Iteration 1399/3560 Training loss: 1.9935 0.0492 sec/batch
Epoch 8/20  Iteration 1400/3560 Training loss: 1.9934 0.0480 sec/batch
Epoch 8/20  Iteration 1401/3560 Training loss: 1.9933 0.0493 sec/batch
Epoch 8/20  Iteration 1402/3560 Training loss: 1.9932 0.0490 sec/batch
Epoch 8/20  Iteration 1403/3560 Training loss: 1.9931 0.0532 sec/batch
Epoch 8/20  Iteration 1404/3560 Training loss: 1.9930 0.0488 sec/batch
Epoch 8/20  Iteration 1405/3560 Training loss: 1.9928 0.0477 sec/batch
Epoch 8/20  Iteration 1406/3560 Training loss: 1.9930 0.0472 sec/batch
Epoch 8/20  Iteration 1407/3560 Training loss: 1.9930 0.0475 sec/batch
Epoch 8/20  Iteration 1408/3560 Training loss: 1.9929 0.0481 sec/batch
Epoch 8/20  Iteration 1409/3560 Training loss: 1.9928 0.0484 sec/batch
Epoch 8/20  Iteration 1410/3560 Training loss: 1.9927 0.0587 sec/batch
Epoch 8/20  Iteration 1411/3560 Training loss: 1.9927 0.0477 sec/batch
Epoch 

Epoch 9/20  Iteration 1516/3560 Training loss: 1.9576 0.0478 sec/batch
Epoch 9/20  Iteration 1517/3560 Training loss: 1.9573 0.0485 sec/batch
Epoch 9/20  Iteration 1518/3560 Training loss: 1.9569 0.0618 sec/batch
Epoch 9/20  Iteration 1519/3560 Training loss: 1.9566 0.0479 sec/batch
Epoch 9/20  Iteration 1520/3560 Training loss: 1.9563 0.0503 sec/batch
Epoch 9/20  Iteration 1521/3560 Training loss: 1.9561 0.0474 sec/batch
Epoch 9/20  Iteration 1522/3560 Training loss: 1.9559 0.0600 sec/batch
Epoch 9/20  Iteration 1523/3560 Training loss: 1.9555 0.0602 sec/batch
Epoch 9/20  Iteration 1524/3560 Training loss: 1.9551 0.0511 sec/batch
Epoch 9/20  Iteration 1525/3560 Training loss: 1.9550 0.0579 sec/batch
Epoch 9/20  Iteration 1526/3560 Training loss: 1.9549 0.0508 sec/batch
Epoch 9/20  Iteration 1527/3560 Training loss: 1.9546 0.0470 sec/batch
Epoch 9/20  Iteration 1528/3560 Training loss: 1.9545 0.0481 sec/batch
Epoch 9/20  Iteration 1529/3560 Training loss: 1.9542 0.0533 sec/batch
Epoch 

Epoch 10/20  Iteration 1631/3560 Training loss: 1.9339 0.0494 sec/batch
Epoch 10/20  Iteration 1632/3560 Training loss: 1.9343 0.0477 sec/batch
Epoch 10/20  Iteration 1633/3560 Training loss: 1.9342 0.0482 sec/batch
Epoch 10/20  Iteration 1634/3560 Training loss: 1.9334 0.0480 sec/batch
Epoch 10/20  Iteration 1635/3560 Training loss: 1.9331 0.0475 sec/batch
Epoch 10/20  Iteration 1636/3560 Training loss: 1.9338 0.0473 sec/batch
Epoch 10/20  Iteration 1637/3560 Training loss: 1.9332 0.0601 sec/batch
Epoch 10/20  Iteration 1638/3560 Training loss: 1.9331 0.0504 sec/batch
Epoch 10/20  Iteration 1639/3560 Training loss: 1.9327 0.0483 sec/batch
Epoch 10/20  Iteration 1640/3560 Training loss: 1.9316 0.0473 sec/batch
Epoch 10/20  Iteration 1641/3560 Training loss: 1.9306 0.0479 sec/batch
Epoch 10/20  Iteration 1642/3560 Training loss: 1.9298 0.0481 sec/batch
Epoch 10/20  Iteration 1643/3560 Training loss: 1.9292 0.0478 sec/batch
Epoch 10/20  Iteration 1644/3560 Training loss: 1.9289 0.0503 se

Epoch 10/20  Iteration 1745/3560 Training loss: 1.9125 0.0491 sec/batch
Epoch 10/20  Iteration 1746/3560 Training loss: 1.9124 0.0477 sec/batch
Epoch 10/20  Iteration 1747/3560 Training loss: 1.9123 0.0476 sec/batch
Epoch 10/20  Iteration 1748/3560 Training loss: 1.9124 0.0479 sec/batch
Epoch 10/20  Iteration 1749/3560 Training loss: 1.9124 0.0482 sec/batch
Epoch 10/20  Iteration 1750/3560 Training loss: 1.9125 0.0634 sec/batch
Epoch 10/20  Iteration 1751/3560 Training loss: 1.9125 0.0485 sec/batch
Epoch 10/20  Iteration 1752/3560 Training loss: 1.9123 0.0587 sec/batch
Epoch 10/20  Iteration 1753/3560 Training loss: 1.9124 0.0474 sec/batch
Epoch 10/20  Iteration 1754/3560 Training loss: 1.9127 0.0580 sec/batch
Epoch 10/20  Iteration 1755/3560 Training loss: 1.9127 0.0603 sec/batch
Epoch 10/20  Iteration 1756/3560 Training loss: 1.9127 0.0580 sec/batch
Epoch 10/20  Iteration 1757/3560 Training loss: 1.9126 0.0543 sec/batch
Epoch 10/20  Iteration 1758/3560 Training loss: 1.9126 0.0480 se

Epoch 11/20  Iteration 1863/3560 Training loss: 1.8873 0.0501 sec/batch
Epoch 11/20  Iteration 1864/3560 Training loss: 1.8871 0.0503 sec/batch
Epoch 11/20  Iteration 1865/3560 Training loss: 1.8865 0.0473 sec/batch
Epoch 11/20  Iteration 1866/3560 Training loss: 1.8862 0.0485 sec/batch
Epoch 11/20  Iteration 1867/3560 Training loss: 1.8861 0.0486 sec/batch
Epoch 11/20  Iteration 1868/3560 Training loss: 1.8857 0.0479 sec/batch
Epoch 11/20  Iteration 1869/3560 Training loss: 1.8853 0.0481 sec/batch
Epoch 11/20  Iteration 1870/3560 Training loss: 1.8854 0.0474 sec/batch
Epoch 11/20  Iteration 1871/3560 Training loss: 1.8851 0.0475 sec/batch
Epoch 11/20  Iteration 1872/3560 Training loss: 1.8849 0.0506 sec/batch
Epoch 11/20  Iteration 1873/3560 Training loss: 1.8845 0.0509 sec/batch
Epoch 11/20  Iteration 1874/3560 Training loss: 1.8842 0.0499 sec/batch
Epoch 11/20  Iteration 1875/3560 Training loss: 1.8839 0.0501 sec/batch
Epoch 11/20  Iteration 1876/3560 Training loss: 1.8837 0.0601 se

Epoch 12/20  Iteration 1978/3560 Training loss: 1.8671 0.0608 sec/batch
Epoch 12/20  Iteration 1979/3560 Training loss: 1.8668 0.0516 sec/batch
Epoch 12/20  Iteration 1980/3560 Training loss: 1.8698 0.0476 sec/batch
Epoch 12/20  Iteration 1981/3560 Training loss: 1.8689 0.0490 sec/batch
Epoch 12/20  Iteration 1982/3560 Training loss: 1.8681 0.0481 sec/batch
Epoch 12/20  Iteration 1983/3560 Training loss: 1.8675 0.0494 sec/batch
Epoch 12/20  Iteration 1984/3560 Training loss: 1.8666 0.0535 sec/batch
Epoch 12/20  Iteration 1985/3560 Training loss: 1.8656 0.0480 sec/batch
Epoch 12/20  Iteration 1986/3560 Training loss: 1.8656 0.0480 sec/batch
Epoch 12/20  Iteration 1987/3560 Training loss: 1.8665 0.0491 sec/batch
Epoch 12/20  Iteration 1988/3560 Training loss: 1.8668 0.0480 sec/batch
Epoch 12/20  Iteration 1989/3560 Training loss: 1.8667 0.0490 sec/batch
Epoch 12/20  Iteration 1990/3560 Training loss: 1.8659 0.0482 sec/batch
Epoch 12/20  Iteration 1991/3560 Training loss: 1.8657 0.0486 se

Epoch 12/20  Iteration 2093/3560 Training loss: 1.8480 0.0582 sec/batch
Epoch 12/20  Iteration 2094/3560 Training loss: 1.8480 0.0479 sec/batch
Epoch 12/20  Iteration 2095/3560 Training loss: 1.8480 0.0486 sec/batch
Epoch 12/20  Iteration 2096/3560 Training loss: 1.8480 0.0472 sec/batch
Epoch 12/20  Iteration 2097/3560 Training loss: 1.8482 0.0477 sec/batch
Epoch 12/20  Iteration 2098/3560 Training loss: 1.8482 0.0514 sec/batch
Epoch 12/20  Iteration 2099/3560 Training loss: 1.8484 0.0502 sec/batch
Epoch 12/20  Iteration 2100/3560 Training loss: 1.8483 0.0512 sec/batch
Epoch 12/20  Iteration 2101/3560 Training loss: 1.8483 0.0512 sec/batch
Epoch 12/20  Iteration 2102/3560 Training loss: 1.8483 0.0479 sec/batch
Epoch 12/20  Iteration 2103/3560 Training loss: 1.8481 0.0485 sec/batch
Epoch 12/20  Iteration 2104/3560 Training loss: 1.8483 0.0502 sec/batch
Epoch 12/20  Iteration 2105/3560 Training loss: 1.8483 0.0587 sec/batch
Epoch 12/20  Iteration 2106/3560 Training loss: 1.8485 0.0484 se

Epoch 13/20  Iteration 2209/3560 Training loss: 1.8298 0.0476 sec/batch
Epoch 13/20  Iteration 2210/3560 Training loss: 1.8295 0.0509 sec/batch
Epoch 13/20  Iteration 2211/3560 Training loss: 1.8293 0.0482 sec/batch
Epoch 13/20  Iteration 2212/3560 Training loss: 1.8296 0.0508 sec/batch
Epoch 13/20  Iteration 2213/3560 Training loss: 1.8294 0.0485 sec/batch
Epoch 13/20  Iteration 2214/3560 Training loss: 1.8296 0.0503 sec/batch
Epoch 13/20  Iteration 2215/3560 Training loss: 1.8292 0.0482 sec/batch
Epoch 13/20  Iteration 2216/3560 Training loss: 1.8291 0.0479 sec/batch
Epoch 13/20  Iteration 2217/3560 Training loss: 1.8286 0.0477 sec/batch
Epoch 13/20  Iteration 2218/3560 Training loss: 1.8287 0.0492 sec/batch
Epoch 13/20  Iteration 2219/3560 Training loss: 1.8283 0.0488 sec/batch
Epoch 13/20  Iteration 2220/3560 Training loss: 1.8282 0.0484 sec/batch
Epoch 13/20  Iteration 2221/3560 Training loss: 1.8276 0.0484 sec/batch
Epoch 13/20  Iteration 2222/3560 Training loss: 1.8273 0.0617 se

Epoch 14/20  Iteration 2326/3560 Training loss: 1.8083 0.0501 sec/batch
Epoch 14/20  Iteration 2327/3560 Training loss: 1.8085 0.0481 sec/batch
Epoch 14/20  Iteration 2328/3560 Training loss: 1.8111 0.0475 sec/batch
Epoch 14/20  Iteration 2329/3560 Training loss: 1.8105 0.0504 sec/batch
Epoch 14/20  Iteration 2330/3560 Training loss: 1.8095 0.0482 sec/batch
Epoch 14/20  Iteration 2331/3560 Training loss: 1.8093 0.0475 sec/batch
Epoch 14/20  Iteration 2332/3560 Training loss: 1.8115 0.0484 sec/batch
Epoch 14/20  Iteration 2333/3560 Training loss: 1.8114 0.0476 sec/batch
Epoch 14/20  Iteration 2334/3560 Training loss: 1.8113 0.0479 sec/batch
Epoch 14/20  Iteration 2335/3560 Training loss: 1.8109 0.0502 sec/batch
Epoch 14/20  Iteration 2336/3560 Training loss: 1.8137 0.0480 sec/batch
Epoch 14/20  Iteration 2337/3560 Training loss: 1.8128 0.0481 sec/batch
Epoch 14/20  Iteration 2338/3560 Training loss: 1.8120 0.0490 sec/batch
Epoch 14/20  Iteration 2339/3560 Training loss: 1.8115 0.0481 se

Epoch 14/20  Iteration 2443/3560 Training loss: 1.7950 0.0475 sec/batch
Epoch 14/20  Iteration 2444/3560 Training loss: 1.7950 0.0480 sec/batch
Epoch 14/20  Iteration 2445/3560 Training loss: 1.7948 0.0507 sec/batch
Epoch 14/20  Iteration 2446/3560 Training loss: 1.7946 0.0501 sec/batch
Epoch 14/20  Iteration 2447/3560 Training loss: 1.7946 0.0486 sec/batch
Epoch 14/20  Iteration 2448/3560 Training loss: 1.7947 0.0490 sec/batch
Epoch 14/20  Iteration 2449/3560 Training loss: 1.7946 0.0475 sec/batch
Epoch 14/20  Iteration 2450/3560 Training loss: 1.7947 0.0477 sec/batch
Epoch 14/20  Iteration 2451/3560 Training loss: 1.7947 0.0469 sec/batch
Epoch 14/20  Iteration 2452/3560 Training loss: 1.7947 0.0470 sec/batch
Epoch 14/20  Iteration 2453/3560 Training loss: 1.7949 0.0478 sec/batch
Epoch 14/20  Iteration 2454/3560 Training loss: 1.7949 0.0484 sec/batch
Epoch 14/20  Iteration 2455/3560 Training loss: 1.7951 0.0481 sec/batch
Epoch 14/20  Iteration 2456/3560 Training loss: 1.7951 0.0481 se

Epoch 15/20  Iteration 2557/3560 Training loss: 1.7790 0.0512 sec/batch
Epoch 15/20  Iteration 2558/3560 Training loss: 1.7795 0.0624 sec/batch
Epoch 15/20  Iteration 2559/3560 Training loss: 1.7797 0.0528 sec/batch
Epoch 15/20  Iteration 2560/3560 Training loss: 1.7792 0.0571 sec/batch
Epoch 15/20  Iteration 2561/3560 Training loss: 1.7790 0.0475 sec/batch
Epoch 15/20  Iteration 2562/3560 Training loss: 1.7790 0.0485 sec/batch
Epoch 15/20  Iteration 2563/3560 Training loss: 1.7793 0.0496 sec/batch
Epoch 15/20  Iteration 2564/3560 Training loss: 1.7793 0.0510 sec/batch
Epoch 15/20  Iteration 2565/3560 Training loss: 1.7797 0.0500 sec/batch
Epoch 15/20  Iteration 2566/3560 Training loss: 1.7794 0.0496 sec/batch
Epoch 15/20  Iteration 2567/3560 Training loss: 1.7792 0.0480 sec/batch
Epoch 15/20  Iteration 2568/3560 Training loss: 1.7795 0.0537 sec/batch
Epoch 15/20  Iteration 2569/3560 Training loss: 1.7793 0.0516 sec/batch
Epoch 15/20  Iteration 2570/3560 Training loss: 1.7795 0.0484 se

Epoch 16/20  Iteration 2674/3560 Training loss: 1.7717 0.0481 sec/batch
Epoch 16/20  Iteration 2675/3560 Training loss: 1.7682 0.0492 sec/batch
Epoch 16/20  Iteration 2676/3560 Training loss: 1.7617 0.0484 sec/batch
Epoch 16/20  Iteration 2677/3560 Training loss: 1.7618 0.0480 sec/batch
Epoch 16/20  Iteration 2678/3560 Training loss: 1.7620 0.0486 sec/batch
Epoch 16/20  Iteration 2679/3560 Training loss: 1.7636 0.0493 sec/batch
Epoch 16/20  Iteration 2680/3560 Training loss: 1.7637 0.0492 sec/batch
Epoch 16/20  Iteration 2681/3560 Training loss: 1.7618 0.0486 sec/batch
Epoch 16/20  Iteration 2682/3560 Training loss: 1.7604 0.0486 sec/batch
Epoch 16/20  Iteration 2683/3560 Training loss: 1.7607 0.0484 sec/batch
Epoch 16/20  Iteration 2684/3560 Training loss: 1.7633 0.0518 sec/batch
Epoch 16/20  Iteration 2685/3560 Training loss: 1.7626 0.0499 sec/batch
Epoch 16/20  Iteration 2686/3560 Training loss: 1.7615 0.0507 sec/batch
Epoch 16/20  Iteration 2687/3560 Training loss: 1.7613 0.0495 se

Epoch 16/20  Iteration 2791/3560 Training loss: 1.7504 0.0492 sec/batch
Epoch 16/20  Iteration 2792/3560 Training loss: 1.7501 0.0482 sec/batch
Epoch 16/20  Iteration 2793/3560 Training loss: 1.7499 0.0491 sec/batch
Epoch 16/20  Iteration 2794/3560 Training loss: 1.7499 0.0602 sec/batch
Epoch 16/20  Iteration 2795/3560 Training loss: 1.7498 0.0477 sec/batch
Epoch 16/20  Iteration 2796/3560 Training loss: 1.7495 0.0475 sec/batch
Epoch 16/20  Iteration 2797/3560 Training loss: 1.7496 0.0521 sec/batch
Epoch 16/20  Iteration 2798/3560 Training loss: 1.7497 0.0492 sec/batch
Epoch 16/20  Iteration 2799/3560 Training loss: 1.7496 0.0508 sec/batch
Epoch 16/20  Iteration 2800/3560 Training loss: 1.7496 0.0454 sec/batch
Epoch 16/20  Iteration 2801/3560 Training loss: 1.7494 0.0483 sec/batch
Epoch 16/20  Iteration 2802/3560 Training loss: 1.7491 0.0482 sec/batch
Epoch 16/20  Iteration 2803/3560 Training loss: 1.7492 0.0475 sec/batch
Epoch 16/20  Iteration 2804/3560 Training loss: 1.7492 0.0528 se

Epoch 17/20  Iteration 2906/3560 Training loss: 1.7355 0.0517 sec/batch
Epoch 17/20  Iteration 2907/3560 Training loss: 1.7350 0.0507 sec/batch
Epoch 17/20  Iteration 2908/3560 Training loss: 1.7357 0.0490 sec/batch
Epoch 17/20  Iteration 2909/3560 Training loss: 1.7357 0.0528 sec/batch
Epoch 17/20  Iteration 2910/3560 Training loss: 1.7364 0.0485 sec/batch
Epoch 17/20  Iteration 2911/3560 Training loss: 1.7368 0.0513 sec/batch
Epoch 17/20  Iteration 2912/3560 Training loss: 1.7369 0.0500 sec/batch
Epoch 17/20  Iteration 2913/3560 Training loss: 1.7367 0.0508 sec/batch
Epoch 17/20  Iteration 2914/3560 Training loss: 1.7372 0.0487 sec/batch
Epoch 17/20  Iteration 2915/3560 Training loss: 1.7374 0.0503 sec/batch
Epoch 17/20  Iteration 2916/3560 Training loss: 1.7370 0.0503 sec/batch
Epoch 17/20  Iteration 2917/3560 Training loss: 1.7369 0.0499 sec/batch
Epoch 17/20  Iteration 2918/3560 Training loss: 1.7369 0.0491 sec/batch
Epoch 17/20  Iteration 2919/3560 Training loss: 1.7372 0.0487 se

Epoch 17/20  Iteration 3021/3560 Training loss: 1.7311 0.0511 sec/batch
Epoch 17/20  Iteration 3022/3560 Training loss: 1.7315 0.0501 sec/batch
Epoch 17/20  Iteration 3023/3560 Training loss: 1.7318 0.0487 sec/batch
Epoch 17/20  Iteration 3024/3560 Training loss: 1.7318 0.0494 sec/batch
Epoch 17/20  Iteration 3025/3560 Training loss: 1.7316 0.0517 sec/batch
Epoch 17/20  Iteration 3026/3560 Training loss: 1.7316 0.0520 sec/batch
Epoch 18/20  Iteration 3027/3560 Training loss: 1.7871 0.0483 sec/batch
Epoch 18/20  Iteration 3028/3560 Training loss: 1.7518 0.0505 sec/batch
Epoch 18/20  Iteration 3029/3560 Training loss: 1.7404 0.0479 sec/batch
Epoch 18/20  Iteration 3030/3560 Training loss: 1.7321 0.0491 sec/batch
Epoch 18/20  Iteration 3031/3560 Training loss: 1.7285 0.0490 sec/batch
Epoch 18/20  Iteration 3032/3560 Training loss: 1.7217 0.0483 sec/batch
Epoch 18/20  Iteration 3033/3560 Training loss: 1.7219 0.0501 sec/batch
Epoch 18/20  Iteration 3034/3560 Training loss: 1.7220 0.0498 se

Epoch 18/20  Iteration 3138/3560 Training loss: 1.7125 0.0496 sec/batch
Epoch 18/20  Iteration 3139/3560 Training loss: 1.7124 0.0504 sec/batch
Epoch 18/20  Iteration 3140/3560 Training loss: 1.7123 0.0496 sec/batch
Epoch 18/20  Iteration 3141/3560 Training loss: 1.7121 0.0482 sec/batch
Epoch 18/20  Iteration 3142/3560 Training loss: 1.7117 0.0489 sec/batch
Epoch 18/20  Iteration 3143/3560 Training loss: 1.7115 0.0495 sec/batch
Epoch 18/20  Iteration 3144/3560 Training loss: 1.7115 0.0482 sec/batch
Epoch 18/20  Iteration 3145/3560 Training loss: 1.7114 0.0481 sec/batch
Epoch 18/20  Iteration 3146/3560 Training loss: 1.7113 0.0491 sec/batch
Epoch 18/20  Iteration 3147/3560 Training loss: 1.7113 0.0498 sec/batch
Epoch 18/20  Iteration 3148/3560 Training loss: 1.7110 0.0509 sec/batch
Epoch 18/20  Iteration 3149/3560 Training loss: 1.7107 0.0483 sec/batch
Epoch 18/20  Iteration 3150/3560 Training loss: 1.7108 0.0521 sec/batch
Epoch 18/20  Iteration 3151/3560 Training loss: 1.7107 0.0497 se

Epoch 19/20  Iteration 3254/3560 Training loss: 1.6983 0.0490 sec/batch
Epoch 19/20  Iteration 3255/3560 Training loss: 1.6978 0.0468 sec/batch
Epoch 19/20  Iteration 3256/3560 Training loss: 1.6987 0.0493 sec/batch
Epoch 19/20  Iteration 3257/3560 Training loss: 1.6987 0.0483 sec/batch
Epoch 19/20  Iteration 3258/3560 Training loss: 1.6986 0.0590 sec/batch
Epoch 19/20  Iteration 3259/3560 Training loss: 1.6982 0.0503 sec/batch
Epoch 19/20  Iteration 3260/3560 Training loss: 1.6984 0.0492 sec/batch
Epoch 19/20  Iteration 3261/3560 Training loss: 1.6986 0.0618 sec/batch
Epoch 19/20  Iteration 3262/3560 Training loss: 1.6983 0.0504 sec/batch
Epoch 19/20  Iteration 3263/3560 Training loss: 1.6977 0.0484 sec/batch
Epoch 19/20  Iteration 3264/3560 Training loss: 1.6985 0.0493 sec/batch
Epoch 19/20  Iteration 3265/3560 Training loss: 1.6985 0.0500 sec/batch
Epoch 19/20  Iteration 3266/3560 Training loss: 1.6993 0.0496 sec/batch
Epoch 19/20  Iteration 3267/3560 Training loss: 1.6997 0.0513 se

Epoch 19/20  Iteration 3369/3560 Training loss: 1.6935 0.0508 sec/batch
Epoch 19/20  Iteration 3370/3560 Training loss: 1.6934 0.0495 sec/batch
Epoch 19/20  Iteration 3371/3560 Training loss: 1.6936 0.0498 sec/batch
Epoch 19/20  Iteration 3372/3560 Training loss: 1.6940 0.0486 sec/batch
Epoch 19/20  Iteration 3373/3560 Training loss: 1.6939 0.0497 sec/batch
Epoch 19/20  Iteration 3374/3560 Training loss: 1.6939 0.0509 sec/batch
Epoch 19/20  Iteration 3375/3560 Training loss: 1.6939 0.0491 sec/batch
Epoch 19/20  Iteration 3376/3560 Training loss: 1.6941 0.0506 sec/batch
Epoch 19/20  Iteration 3377/3560 Training loss: 1.6945 0.0500 sec/batch
Epoch 19/20  Iteration 3378/3560 Training loss: 1.6948 0.0493 sec/batch
Epoch 19/20  Iteration 3379/3560 Training loss: 1.6952 0.0499 sec/batch
Epoch 19/20  Iteration 3380/3560 Training loss: 1.6951 0.0491 sec/batch
Epoch 19/20  Iteration 3381/3560 Training loss: 1.6949 0.0509 sec/batch
Epoch 19/20  Iteration 3382/3560 Training loss: 1.6950 0.0508 se

Epoch 20/20  Iteration 3484/3560 Training loss: 1.6786 0.0494 sec/batch
Epoch 20/20  Iteration 3485/3560 Training loss: 1.6784 0.0526 sec/batch
Epoch 20/20  Iteration 3486/3560 Training loss: 1.6783 0.0514 sec/batch
Epoch 20/20  Iteration 3487/3560 Training loss: 1.6781 0.0495 sec/batch
Epoch 20/20  Iteration 3488/3560 Training loss: 1.6782 0.0495 sec/batch
Epoch 20/20  Iteration 3489/3560 Training loss: 1.6781 0.0483 sec/batch
Epoch 20/20  Iteration 3490/3560 Training loss: 1.6782 0.0478 sec/batch
Epoch 20/20  Iteration 3491/3560 Training loss: 1.6782 0.0515 sec/batch
Epoch 20/20  Iteration 3492/3560 Training loss: 1.6781 0.0482 sec/batch
Epoch 20/20  Iteration 3493/3560 Training loss: 1.6779 0.0493 sec/batch
Epoch 20/20  Iteration 3494/3560 Training loss: 1.6777 0.0493 sec/batch
Epoch 20/20  Iteration 3495/3560 Training loss: 1.6776 0.0595 sec/batch
Epoch 20/20  Iteration 3496/3560 Training loss: 1.6775 0.0590 sec/batch
Epoch 20/20  Iteration 3497/3560 Training loss: 1.6773 0.0523 se

Epoch 1/20  Iteration 43/3560 Training loss: 3.1997 0.0316 sec/batch
Epoch 1/20  Iteration 44/3560 Training loss: 3.1955 0.0335 sec/batch
Epoch 1/20  Iteration 45/3560 Training loss: 3.1913 0.0311 sec/batch
Epoch 1/20  Iteration 46/3560 Training loss: 3.1876 0.0315 sec/batch
Epoch 1/20  Iteration 47/3560 Training loss: 3.1841 0.0312 sec/batch
Epoch 1/20  Iteration 48/3560 Training loss: 3.1808 0.0313 sec/batch
Epoch 1/20  Iteration 49/3560 Training loss: 3.1775 0.0315 sec/batch
Epoch 1/20  Iteration 50/3560 Training loss: 3.1743 0.0315 sec/batch
Epoch 1/20  Iteration 51/3560 Training loss: 3.1709 0.0310 sec/batch
Epoch 1/20  Iteration 52/3560 Training loss: 3.1675 0.0312 sec/batch
Epoch 1/20  Iteration 53/3560 Training loss: 3.1643 0.0339 sec/batch
Epoch 1/20  Iteration 54/3560 Training loss: 3.1609 0.0323 sec/batch
Epoch 1/20  Iteration 55/3560 Training loss: 3.1579 0.0337 sec/batch
Epoch 1/20  Iteration 56/3560 Training loss: 3.1545 0.0308 sec/batch
Epoch 1/20  Iteration 57/3560 Trai

Epoch 1/20  Iteration 164/3560 Training loss: 2.8072 0.0396 sec/batch
Epoch 1/20  Iteration 165/3560 Training loss: 2.8046 0.0445 sec/batch
Epoch 1/20  Iteration 166/3560 Training loss: 2.8019 0.0437 sec/batch
Epoch 1/20  Iteration 167/3560 Training loss: 2.7992 0.0391 sec/batch
Epoch 1/20  Iteration 168/3560 Training loss: 2.7966 0.0399 sec/batch
Epoch 1/20  Iteration 169/3560 Training loss: 2.7940 0.0401 sec/batch
Epoch 1/20  Iteration 170/3560 Training loss: 2.7913 0.0395 sec/batch
Epoch 1/20  Iteration 171/3560 Training loss: 2.7887 0.0388 sec/batch
Epoch 1/20  Iteration 172/3560 Training loss: 2.7862 0.0409 sec/batch
Epoch 1/20  Iteration 173/3560 Training loss: 2.7839 0.0402 sec/batch
Epoch 1/20  Iteration 174/3560 Training loss: 2.7816 0.0399 sec/batch
Epoch 1/20  Iteration 175/3560 Training loss: 2.7792 0.0440 sec/batch
Epoch 1/20  Iteration 176/3560 Training loss: 2.7767 0.0411 sec/batch
Epoch 1/20  Iteration 177/3560 Training loss: 2.7741 0.0442 sec/batch
Epoch 1/20  Iteratio

Epoch 2/20  Iteration 284/3560 Training loss: 2.2159 0.0427 sec/batch
Epoch 2/20  Iteration 285/3560 Training loss: 2.2151 0.0404 sec/batch
Epoch 2/20  Iteration 286/3560 Training loss: 2.2146 0.0407 sec/batch
Epoch 2/20  Iteration 287/3560 Training loss: 2.2139 0.0407 sec/batch
Epoch 2/20  Iteration 288/3560 Training loss: 2.2130 0.0405 sec/batch
Epoch 2/20  Iteration 289/3560 Training loss: 2.2122 0.0412 sec/batch
Epoch 2/20  Iteration 290/3560 Training loss: 2.2115 0.0403 sec/batch
Epoch 2/20  Iteration 291/3560 Training loss: 2.2107 0.0439 sec/batch
Epoch 2/20  Iteration 292/3560 Training loss: 2.2099 0.0407 sec/batch
Epoch 2/20  Iteration 293/3560 Training loss: 2.2091 0.0408 sec/batch
Epoch 2/20  Iteration 294/3560 Training loss: 2.2080 0.0413 sec/batch
Epoch 2/20  Iteration 295/3560 Training loss: 2.2073 0.0413 sec/batch
Epoch 2/20  Iteration 296/3560 Training loss: 2.2065 0.0401 sec/batch
Epoch 2/20  Iteration 297/3560 Training loss: 2.2059 0.0423 sec/batch
Epoch 2/20  Iteratio

Epoch 3/20  Iteration 404/3560 Training loss: 2.0214 0.0449 sec/batch
Epoch 3/20  Iteration 405/3560 Training loss: 2.0207 0.0407 sec/batch
Epoch 3/20  Iteration 406/3560 Training loss: 2.0208 0.0458 sec/batch
Epoch 3/20  Iteration 407/3560 Training loss: 2.0198 0.0403 sec/batch
Epoch 3/20  Iteration 408/3560 Training loss: 2.0199 0.0420 sec/batch
Epoch 3/20  Iteration 409/3560 Training loss: 2.0192 0.0400 sec/batch
Epoch 3/20  Iteration 410/3560 Training loss: 2.0184 0.0411 sec/batch
Epoch 3/20  Iteration 411/3560 Training loss: 2.0179 0.0400 sec/batch
Epoch 3/20  Iteration 412/3560 Training loss: 2.0175 0.0457 sec/batch
Epoch 3/20  Iteration 413/3560 Training loss: 2.0171 0.0404 sec/batch
Epoch 3/20  Iteration 414/3560 Training loss: 2.0163 0.0407 sec/batch
Epoch 3/20  Iteration 415/3560 Training loss: 2.0155 0.0466 sec/batch
Epoch 3/20  Iteration 416/3560 Training loss: 2.0156 0.0409 sec/batch
Epoch 3/20  Iteration 417/3560 Training loss: 2.0149 0.0408 sec/batch
Epoch 3/20  Iteratio

Epoch 3/20  Iteration 524/3560 Training loss: 1.9648 0.0417 sec/batch
Epoch 3/20  Iteration 525/3560 Training loss: 1.9644 0.0412 sec/batch
Epoch 3/20  Iteration 526/3560 Training loss: 1.9640 0.0433 sec/batch
Epoch 3/20  Iteration 527/3560 Training loss: 1.9636 0.0413 sec/batch
Epoch 3/20  Iteration 528/3560 Training loss: 1.9633 0.0463 sec/batch
Epoch 3/20  Iteration 529/3560 Training loss: 1.9631 0.0408 sec/batch
Epoch 3/20  Iteration 530/3560 Training loss: 1.9628 0.0426 sec/batch
Epoch 3/20  Iteration 531/3560 Training loss: 1.9626 0.0422 sec/batch
Epoch 3/20  Iteration 532/3560 Training loss: 1.9623 0.0411 sec/batch
Epoch 3/20  Iteration 533/3560 Training loss: 1.9618 0.0424 sec/batch
Epoch 3/20  Iteration 534/3560 Training loss: 1.9615 0.0437 sec/batch
Epoch 4/20  Iteration 535/3560 Training loss: 1.9658 0.0402 sec/batch
Epoch 4/20  Iteration 536/3560 Training loss: 1.9199 0.0453 sec/batch
Epoch 4/20  Iteration 537/3560 Training loss: 1.9062 0.0413 sec/batch
Epoch 4/20  Iteratio

Epoch 4/20  Iteration 643/3560 Training loss: 1.8482 0.0411 sec/batch
Epoch 4/20  Iteration 644/3560 Training loss: 1.8478 0.0406 sec/batch
Epoch 4/20  Iteration 645/3560 Training loss: 1.8474 0.0416 sec/batch
Epoch 4/20  Iteration 646/3560 Training loss: 1.8470 0.0407 sec/batch
Epoch 4/20  Iteration 647/3560 Training loss: 1.8467 0.0412 sec/batch
Epoch 4/20  Iteration 648/3560 Training loss: 1.8464 0.0409 sec/batch
Epoch 4/20  Iteration 649/3560 Training loss: 1.8459 0.0422 sec/batch
Epoch 4/20  Iteration 650/3560 Training loss: 1.8453 0.0429 sec/batch
Epoch 4/20  Iteration 651/3560 Training loss: 1.8450 0.0406 sec/batch
Epoch 4/20  Iteration 652/3560 Training loss: 1.8446 0.0430 sec/batch
Epoch 4/20  Iteration 653/3560 Training loss: 1.8444 0.0408 sec/batch
Epoch 4/20  Iteration 654/3560 Training loss: 1.8441 0.0404 sec/batch
Epoch 4/20  Iteration 655/3560 Training loss: 1.8439 0.0420 sec/batch
Epoch 4/20  Iteration 656/3560 Training loss: 1.8434 0.0491 sec/batch
Epoch 4/20  Iteratio

Epoch 5/20  Iteration 763/3560 Training loss: 1.7665 0.0404 sec/batch
Epoch 5/20  Iteration 764/3560 Training loss: 1.7671 0.0417 sec/batch
Epoch 5/20  Iteration 765/3560 Training loss: 1.7667 0.0413 sec/batch
Epoch 5/20  Iteration 766/3560 Training loss: 1.7664 0.0475 sec/batch
Epoch 5/20  Iteration 767/3560 Training loss: 1.7661 0.0468 sec/batch
Epoch 5/20  Iteration 768/3560 Training loss: 1.7660 0.0408 sec/batch
Epoch 5/20  Iteration 769/3560 Training loss: 1.7661 0.0418 sec/batch
Epoch 5/20  Iteration 770/3560 Training loss: 1.7656 0.0406 sec/batch
Epoch 5/20  Iteration 771/3560 Training loss: 1.7648 0.0462 sec/batch
Epoch 5/20  Iteration 772/3560 Training loss: 1.7651 0.0408 sec/batch
Epoch 5/20  Iteration 773/3560 Training loss: 1.7647 0.0480 sec/batch
Epoch 5/20  Iteration 774/3560 Training loss: 1.7652 0.0409 sec/batch
Epoch 5/20  Iteration 775/3560 Training loss: 1.7653 0.0407 sec/batch
Epoch 5/20  Iteration 776/3560 Training loss: 1.7653 0.0407 sec/batch
Epoch 5/20  Iteratio

Epoch 5/20  Iteration 883/3560 Training loss: 1.7388 0.0416 sec/batch
Epoch 5/20  Iteration 884/3560 Training loss: 1.7385 0.0406 sec/batch
Epoch 5/20  Iteration 885/3560 Training loss: 1.7384 0.0420 sec/batch
Epoch 5/20  Iteration 886/3560 Training loss: 1.7383 0.0478 sec/batch
Epoch 5/20  Iteration 887/3560 Training loss: 1.7382 0.0426 sec/batch
Epoch 5/20  Iteration 888/3560 Training loss: 1.7380 0.0435 sec/batch
Epoch 5/20  Iteration 889/3560 Training loss: 1.7377 0.0413 sec/batch
Epoch 5/20  Iteration 890/3560 Training loss: 1.7376 0.0424 sec/batch
Epoch 6/20  Iteration 891/3560 Training loss: 1.7923 0.0427 sec/batch
Epoch 6/20  Iteration 892/3560 Training loss: 1.7438 0.0458 sec/batch
Epoch 6/20  Iteration 893/3560 Training loss: 1.7309 0.0473 sec/batch
Epoch 6/20  Iteration 894/3560 Training loss: 1.7227 0.0433 sec/batch
Epoch 6/20  Iteration 895/3560 Training loss: 1.7182 0.0402 sec/batch
Epoch 6/20  Iteration 896/3560 Training loss: 1.7082 0.0443 sec/batch
Epoch 6/20  Iteratio

Epoch 6/20  Iteration 1003/3560 Training loss: 1.6747 0.0424 sec/batch
Epoch 6/20  Iteration 1004/3560 Training loss: 1.6746 0.0418 sec/batch
Epoch 6/20  Iteration 1005/3560 Training loss: 1.6742 0.0413 sec/batch
Epoch 6/20  Iteration 1006/3560 Training loss: 1.6737 0.0418 sec/batch
Epoch 6/20  Iteration 1007/3560 Training loss: 1.6734 0.0416 sec/batch
Epoch 6/20  Iteration 1008/3560 Training loss: 1.6733 0.0457 sec/batch
Epoch 6/20  Iteration 1009/3560 Training loss: 1.6731 0.0419 sec/batch
Epoch 6/20  Iteration 1010/3560 Training loss: 1.6729 0.0418 sec/batch
Epoch 6/20  Iteration 1011/3560 Training loss: 1.6727 0.0413 sec/batch
Epoch 6/20  Iteration 1012/3560 Training loss: 1.6723 0.0460 sec/batch
Epoch 6/20  Iteration 1013/3560 Training loss: 1.6719 0.0416 sec/batch
Epoch 6/20  Iteration 1014/3560 Training loss: 1.6719 0.0476 sec/batch
Epoch 6/20  Iteration 1015/3560 Training loss: 1.6717 0.0459 sec/batch
Epoch 6/20  Iteration 1016/3560 Training loss: 1.6712 0.0421 sec/batch
Epoch 

Epoch 7/20  Iteration 1123/3560 Training loss: 1.6274 0.0457 sec/batch
Epoch 7/20  Iteration 1124/3560 Training loss: 1.6274 0.0414 sec/batch
Epoch 7/20  Iteration 1125/3560 Training loss: 1.6278 0.0417 sec/batch
Epoch 7/20  Iteration 1126/3560 Training loss: 1.6274 0.0418 sec/batch
Epoch 7/20  Iteration 1127/3560 Training loss: 1.6267 0.0417 sec/batch
Epoch 7/20  Iteration 1128/3560 Training loss: 1.6271 0.0406 sec/batch
Epoch 7/20  Iteration 1129/3560 Training loss: 1.6269 0.0415 sec/batch
Epoch 7/20  Iteration 1130/3560 Training loss: 1.6276 0.0416 sec/batch
Epoch 7/20  Iteration 1131/3560 Training loss: 1.6279 0.0420 sec/batch
Epoch 7/20  Iteration 1132/3560 Training loss: 1.6280 0.0431 sec/batch
Epoch 7/20  Iteration 1133/3560 Training loss: 1.6278 0.0421 sec/batch
Epoch 7/20  Iteration 1134/3560 Training loss: 1.6279 0.0411 sec/batch
Epoch 7/20  Iteration 1135/3560 Training loss: 1.6280 0.0467 sec/batch
Epoch 7/20  Iteration 1136/3560 Training loss: 1.6276 0.0465 sec/batch
Epoch 

Epoch 7/20  Iteration 1243/3560 Training loss: 1.6091 0.0420 sec/batch
Epoch 7/20  Iteration 1244/3560 Training loss: 1.6089 0.0427 sec/batch
Epoch 7/20  Iteration 1245/3560 Training loss: 1.6086 0.0429 sec/batch
Epoch 7/20  Iteration 1246/3560 Training loss: 1.6087 0.0461 sec/batch
Epoch 8/20  Iteration 1247/3560 Training loss: 1.6814 0.0404 sec/batch
Epoch 8/20  Iteration 1248/3560 Training loss: 1.6330 0.0463 sec/batch
Epoch 8/20  Iteration 1249/3560 Training loss: 1.6189 0.0456 sec/batch
Epoch 8/20  Iteration 1250/3560 Training loss: 1.6120 0.0414 sec/batch
Epoch 8/20  Iteration 1251/3560 Training loss: 1.6064 0.0471 sec/batch
Epoch 8/20  Iteration 1252/3560 Training loss: 1.5949 0.0429 sec/batch
Epoch 8/20  Iteration 1253/3560 Training loss: 1.5937 0.0453 sec/batch
Epoch 8/20  Iteration 1254/3560 Training loss: 1.5924 0.0421 sec/batch
Epoch 8/20  Iteration 1255/3560 Training loss: 1.5937 0.0425 sec/batch
Epoch 8/20  Iteration 1256/3560 Training loss: 1.5929 0.0446 sec/batch
Epoch 

Epoch 8/20  Iteration 1363/3560 Training loss: 1.5658 0.0406 sec/batch
Epoch 8/20  Iteration 1364/3560 Training loss: 1.5657 0.0426 sec/batch
Epoch 8/20  Iteration 1365/3560 Training loss: 1.5656 0.0441 sec/batch
Epoch 8/20  Iteration 1366/3560 Training loss: 1.5654 0.0479 sec/batch
Epoch 8/20  Iteration 1367/3560 Training loss: 1.5652 0.0432 sec/batch
Epoch 8/20  Iteration 1368/3560 Training loss: 1.5649 0.0419 sec/batch
Epoch 8/20  Iteration 1369/3560 Training loss: 1.5645 0.0468 sec/batch
Epoch 8/20  Iteration 1370/3560 Training loss: 1.5644 0.0450 sec/batch
Epoch 8/20  Iteration 1371/3560 Training loss: 1.5643 0.0425 sec/batch
Epoch 8/20  Iteration 1372/3560 Training loss: 1.5638 0.0412 sec/batch
Epoch 8/20  Iteration 1373/3560 Training loss: 1.5638 0.0456 sec/batch
Epoch 8/20  Iteration 1374/3560 Training loss: 1.5638 0.0434 sec/batch
Epoch 8/20  Iteration 1375/3560 Training loss: 1.5636 0.0423 sec/batch
Epoch 8/20  Iteration 1376/3560 Training loss: 1.5634 0.0421 sec/batch
Epoch 

Epoch 9/20  Iteration 1481/3560 Training loss: 1.5322 0.0409 sec/batch
Epoch 9/20  Iteration 1482/3560 Training loss: 1.5318 0.0428 sec/batch
Epoch 9/20  Iteration 1483/3560 Training loss: 1.5312 0.0428 sec/batch
Epoch 9/20  Iteration 1484/3560 Training loss: 1.5317 0.0457 sec/batch
Epoch 9/20  Iteration 1485/3560 Training loss: 1.5317 0.0413 sec/batch
Epoch 9/20  Iteration 1486/3560 Training loss: 1.5325 0.0417 sec/batch
Epoch 9/20  Iteration 1487/3560 Training loss: 1.5329 0.0426 sec/batch
Epoch 9/20  Iteration 1488/3560 Training loss: 1.5330 0.0414 sec/batch
Epoch 9/20  Iteration 1489/3560 Training loss: 1.5329 0.0422 sec/batch
Epoch 9/20  Iteration 1490/3560 Training loss: 1.5331 0.0412 sec/batch
Epoch 9/20  Iteration 1491/3560 Training loss: 1.5332 0.0414 sec/batch
Epoch 9/20  Iteration 1492/3560 Training loss: 1.5328 0.0470 sec/batch
Epoch 9/20  Iteration 1493/3560 Training loss: 1.5328 0.0430 sec/batch
Epoch 9/20  Iteration 1494/3560 Training loss: 1.5327 0.0412 sec/batch
Epoch 

Epoch 9/20  Iteration 1601/3560 Training loss: 1.5186 0.0490 sec/batch
Epoch 9/20  Iteration 1602/3560 Training loss: 1.5187 0.0484 sec/batch
Epoch 10/20  Iteration 1603/3560 Training loss: 1.5950 0.0403 sec/batch
Epoch 10/20  Iteration 1604/3560 Training loss: 1.5497 0.0417 sec/batch
Epoch 10/20  Iteration 1605/3560 Training loss: 1.5347 0.0407 sec/batch
Epoch 10/20  Iteration 1606/3560 Training loss: 1.5281 0.0460 sec/batch
Epoch 10/20  Iteration 1607/3560 Training loss: 1.5207 0.0408 sec/batch
Epoch 10/20  Iteration 1608/3560 Training loss: 1.5096 0.0424 sec/batch
Epoch 10/20  Iteration 1609/3560 Training loss: 1.5092 0.0414 sec/batch
Epoch 10/20  Iteration 1610/3560 Training loss: 1.5077 0.0409 sec/batch
Epoch 10/20  Iteration 1611/3560 Training loss: 1.5087 0.0410 sec/batch
Epoch 10/20  Iteration 1612/3560 Training loss: 1.5078 0.0412 sec/batch
Epoch 10/20  Iteration 1613/3560 Training loss: 1.5043 0.0406 sec/batch
Epoch 10/20  Iteration 1614/3560 Training loss: 1.5031 0.0411 sec/

Epoch 10/20  Iteration 1718/3560 Training loss: 1.4853 0.0414 sec/batch
Epoch 10/20  Iteration 1719/3560 Training loss: 1.4852 0.0419 sec/batch
Epoch 10/20  Iteration 1720/3560 Training loss: 1.4852 0.0412 sec/batch
Epoch 10/20  Iteration 1721/3560 Training loss: 1.4851 0.0434 sec/batch
Epoch 10/20  Iteration 1722/3560 Training loss: 1.4850 0.1806 sec/batch
Epoch 10/20  Iteration 1723/3560 Training loss: 1.4848 0.0482 sec/batch
Epoch 10/20  Iteration 1724/3560 Training loss: 1.4845 0.0430 sec/batch
Epoch 10/20  Iteration 1725/3560 Training loss: 1.4841 0.0417 sec/batch
Epoch 10/20  Iteration 1726/3560 Training loss: 1.4841 0.0411 sec/batch
Epoch 10/20  Iteration 1727/3560 Training loss: 1.4839 0.0416 sec/batch
Epoch 10/20  Iteration 1728/3560 Training loss: 1.4835 0.0466 sec/batch
Epoch 10/20  Iteration 1729/3560 Training loss: 1.4835 0.0408 sec/batch
Epoch 10/20  Iteration 1730/3560 Training loss: 1.4835 0.0429 sec/batch
Epoch 10/20  Iteration 1731/3560 Training loss: 1.4833 0.0409 se

Epoch 11/20  Iteration 1832/3560 Training loss: 1.4595 0.0417 sec/batch
Epoch 11/20  Iteration 1833/3560 Training loss: 1.4593 0.0405 sec/batch
Epoch 11/20  Iteration 1834/3560 Training loss: 1.4595 0.0459 sec/batch
Epoch 11/20  Iteration 1835/3560 Training loss: 1.4592 0.0407 sec/batch
Epoch 11/20  Iteration 1836/3560 Training loss: 1.4593 0.0417 sec/batch
Epoch 11/20  Iteration 1837/3560 Training loss: 1.4598 0.0415 sec/batch
Epoch 11/20  Iteration 1838/3560 Training loss: 1.4594 0.0407 sec/batch
Epoch 11/20  Iteration 1839/3560 Training loss: 1.4588 0.0412 sec/batch
Epoch 11/20  Iteration 1840/3560 Training loss: 1.4594 0.0435 sec/batch
Epoch 11/20  Iteration 1841/3560 Training loss: 1.4595 0.0422 sec/batch
Epoch 11/20  Iteration 1842/3560 Training loss: 1.4604 0.0418 sec/batch
Epoch 11/20  Iteration 1843/3560 Training loss: 1.4608 0.0415 sec/batch
Epoch 11/20  Iteration 1844/3560 Training loss: 1.4610 0.0410 sec/batch
Epoch 11/20  Iteration 1845/3560 Training loss: 1.4609 0.0416 se

Epoch 11/20  Iteration 1947/3560 Training loss: 1.4495 0.0436 sec/batch
Epoch 11/20  Iteration 1948/3560 Training loss: 1.4499 0.0418 sec/batch
Epoch 11/20  Iteration 1949/3560 Training loss: 1.4499 0.0408 sec/batch
Epoch 11/20  Iteration 1950/3560 Training loss: 1.4498 0.0411 sec/batch
Epoch 11/20  Iteration 1951/3560 Training loss: 1.4497 0.0459 sec/batch
Epoch 11/20  Iteration 1952/3560 Training loss: 1.4495 0.0421 sec/batch
Epoch 11/20  Iteration 1953/3560 Training loss: 1.4495 0.0463 sec/batch
Epoch 11/20  Iteration 1954/3560 Training loss: 1.4495 0.0466 sec/batch
Epoch 11/20  Iteration 1955/3560 Training loss: 1.4495 0.0424 sec/batch
Epoch 11/20  Iteration 1956/3560 Training loss: 1.4493 0.0413 sec/batch
Epoch 11/20  Iteration 1957/3560 Training loss: 1.4492 0.0415 sec/batch
Epoch 11/20  Iteration 1958/3560 Training loss: 1.4493 0.0415 sec/batch
Epoch 12/20  Iteration 1959/3560 Training loss: 1.5291 0.0405 sec/batch
Epoch 12/20  Iteration 1960/3560 Training loss: 1.4871 0.0409 se

Epoch 12/20  Iteration 2065/3560 Training loss: 1.4245 0.0417 sec/batch
Epoch 12/20  Iteration 2066/3560 Training loss: 1.4245 0.0416 sec/batch
Epoch 12/20  Iteration 2067/3560 Training loss: 1.4244 0.0413 sec/batch
Epoch 12/20  Iteration 2068/3560 Training loss: 1.4244 0.0439 sec/batch
Epoch 12/20  Iteration 2069/3560 Training loss: 1.4240 0.0412 sec/batch
Epoch 12/20  Iteration 2070/3560 Training loss: 1.4239 0.0484 sec/batch
Epoch 12/20  Iteration 2071/3560 Training loss: 1.4238 0.0410 sec/batch
Epoch 12/20  Iteration 2072/3560 Training loss: 1.4237 0.0426 sec/batch
Epoch 12/20  Iteration 2073/3560 Training loss: 1.4234 0.0407 sec/batch
Epoch 12/20  Iteration 2074/3560 Training loss: 1.4230 0.0433 sec/batch
Epoch 12/20  Iteration 2075/3560 Training loss: 1.4230 0.0414 sec/batch
Epoch 12/20  Iteration 2076/3560 Training loss: 1.4229 0.0416 sec/batch
Epoch 12/20  Iteration 2077/3560 Training loss: 1.4229 0.0457 sec/batch
Epoch 12/20  Iteration 2078/3560 Training loss: 1.4227 0.0416 se

Epoch 13/20  Iteration 2180/3560 Training loss: 1.4045 0.0419 sec/batch
Epoch 13/20  Iteration 2181/3560 Training loss: 1.4047 0.0416 sec/batch
Epoch 13/20  Iteration 2182/3560 Training loss: 1.4037 0.0434 sec/batch
Epoch 13/20  Iteration 2183/3560 Training loss: 1.4034 0.0415 sec/batch
Epoch 13/20  Iteration 2184/3560 Training loss: 1.4028 0.0428 sec/batch
Epoch 13/20  Iteration 2185/3560 Training loss: 1.4026 0.0417 sec/batch
Epoch 13/20  Iteration 2186/3560 Training loss: 1.4029 0.0440 sec/batch
Epoch 13/20  Iteration 2187/3560 Training loss: 1.4024 0.0426 sec/batch
Epoch 13/20  Iteration 2188/3560 Training loss: 1.4032 0.0405 sec/batch
Epoch 13/20  Iteration 2189/3560 Training loss: 1.4030 0.0463 sec/batch
Epoch 13/20  Iteration 2190/3560 Training loss: 1.4032 0.0418 sec/batch
Epoch 13/20  Iteration 2191/3560 Training loss: 1.4030 0.0410 sec/batch
Epoch 13/20  Iteration 2192/3560 Training loss: 1.4031 0.0424 sec/batch
Epoch 13/20  Iteration 2193/3560 Training loss: 1.4035 0.0484 se

Epoch 13/20  Iteration 2295/3560 Training loss: 1.3952 0.0417 sec/batch
Epoch 13/20  Iteration 2296/3560 Training loss: 1.3953 0.0421 sec/batch
Epoch 13/20  Iteration 2297/3560 Training loss: 1.3954 0.0417 sec/batch
Epoch 13/20  Iteration 2298/3560 Training loss: 1.3954 0.0434 sec/batch
Epoch 13/20  Iteration 2299/3560 Training loss: 1.3954 0.0442 sec/batch
Epoch 13/20  Iteration 2300/3560 Training loss: 1.3954 0.0414 sec/batch
Epoch 13/20  Iteration 2301/3560 Training loss: 1.3954 0.0425 sec/batch
Epoch 13/20  Iteration 2302/3560 Training loss: 1.3953 0.0435 sec/batch
Epoch 13/20  Iteration 2303/3560 Training loss: 1.3954 0.0416 sec/batch
Epoch 13/20  Iteration 2304/3560 Training loss: 1.3959 0.0405 sec/batch
Epoch 13/20  Iteration 2305/3560 Training loss: 1.3958 0.0418 sec/batch
Epoch 13/20  Iteration 2306/3560 Training loss: 1.3958 0.0417 sec/batch
Epoch 13/20  Iteration 2307/3560 Training loss: 1.3957 0.0416 sec/batch
Epoch 13/20  Iteration 2308/3560 Training loss: 1.3955 0.0419 se

Epoch 14/20  Iteration 2410/3560 Training loss: 1.3781 0.0464 sec/batch
Epoch 14/20  Iteration 2411/3560 Training loss: 1.3781 0.0421 sec/batch
Epoch 14/20  Iteration 2412/3560 Training loss: 1.3777 0.0436 sec/batch
Epoch 14/20  Iteration 2413/3560 Training loss: 1.3774 0.0421 sec/batch
Epoch 14/20  Iteration 2414/3560 Training loss: 1.3770 0.0424 sec/batch
Epoch 14/20  Iteration 2415/3560 Training loss: 1.3770 0.0464 sec/batch
Epoch 14/20  Iteration 2416/3560 Training loss: 1.3768 0.0414 sec/batch
Epoch 14/20  Iteration 2417/3560 Training loss: 1.3766 0.0416 sec/batch
Epoch 14/20  Iteration 2418/3560 Training loss: 1.3764 0.0416 sec/batch
Epoch 14/20  Iteration 2419/3560 Training loss: 1.3761 0.0430 sec/batch
Epoch 14/20  Iteration 2420/3560 Training loss: 1.3761 0.0412 sec/batch
Epoch 14/20  Iteration 2421/3560 Training loss: 1.3761 0.0467 sec/batch
Epoch 14/20  Iteration 2422/3560 Training loss: 1.3761 0.0408 sec/batch
Epoch 14/20  Iteration 2423/3560 Training loss: 1.3760 0.0411 se

Epoch 15/20  Iteration 2525/3560 Training loss: 1.3647 0.0424 sec/batch
Epoch 15/20  Iteration 2526/3560 Training loss: 1.3653 0.0409 sec/batch
Epoch 15/20  Iteration 2527/3560 Training loss: 1.3654 0.0417 sec/batch
Epoch 15/20  Iteration 2528/3560 Training loss: 1.3649 0.0425 sec/batch
Epoch 15/20  Iteration 2529/3560 Training loss: 1.3641 0.0412 sec/batch
Epoch 15/20  Iteration 2530/3560 Training loss: 1.3630 0.0413 sec/batch
Epoch 15/20  Iteration 2531/3560 Training loss: 1.3617 0.0420 sec/batch
Epoch 15/20  Iteration 2532/3560 Training loss: 1.3614 0.0403 sec/batch
Epoch 15/20  Iteration 2533/3560 Training loss: 1.3609 0.0436 sec/batch
Epoch 15/20  Iteration 2534/3560 Training loss: 1.3617 0.0414 sec/batch
Epoch 15/20  Iteration 2535/3560 Training loss: 1.3613 0.0419 sec/batch
Epoch 15/20  Iteration 2536/3560 Training loss: 1.3609 0.0410 sec/batch
Epoch 15/20  Iteration 2537/3560 Training loss: 1.3612 0.0418 sec/batch
Epoch 15/20  Iteration 2538/3560 Training loss: 1.3602 0.0411 se

Epoch 15/20  Iteration 2640/3560 Training loss: 1.3538 0.0412 sec/batch
Epoch 15/20  Iteration 2641/3560 Training loss: 1.3538 0.0442 sec/batch
Epoch 15/20  Iteration 2642/3560 Training loss: 1.3537 0.0413 sec/batch
Epoch 15/20  Iteration 2643/3560 Training loss: 1.3534 0.0410 sec/batch
Epoch 15/20  Iteration 2644/3560 Training loss: 1.3533 0.0414 sec/batch
Epoch 15/20  Iteration 2645/3560 Training loss: 1.3533 0.0414 sec/batch
Epoch 15/20  Iteration 2646/3560 Training loss: 1.3533 0.0419 sec/batch
Epoch 15/20  Iteration 2647/3560 Training loss: 1.3534 0.0416 sec/batch
Epoch 15/20  Iteration 2648/3560 Training loss: 1.3533 0.0408 sec/batch
Epoch 15/20  Iteration 2649/3560 Training loss: 1.3534 0.0413 sec/batch
Epoch 15/20  Iteration 2650/3560 Training loss: 1.3534 0.0413 sec/batch
Epoch 15/20  Iteration 2651/3560 Training loss: 1.3532 0.0443 sec/batch
Epoch 15/20  Iteration 2652/3560 Training loss: 1.3534 0.0414 sec/batch
Epoch 15/20  Iteration 2653/3560 Training loss: 1.3535 0.0418 se

Epoch 16/20  Iteration 2755/3560 Training loss: 1.3417 0.0417 sec/batch
Epoch 16/20  Iteration 2756/3560 Training loss: 1.3415 0.0418 sec/batch
Epoch 16/20  Iteration 2757/3560 Training loss: 1.3413 0.0447 sec/batch
Epoch 16/20  Iteration 2758/3560 Training loss: 1.3410 0.0434 sec/batch
Epoch 16/20  Iteration 2759/3560 Training loss: 1.3406 0.0419 sec/batch
Epoch 16/20  Iteration 2760/3560 Training loss: 1.3408 0.0444 sec/batch
Epoch 16/20  Iteration 2761/3560 Training loss: 1.3405 0.0407 sec/batch
Epoch 16/20  Iteration 2762/3560 Training loss: 1.3405 0.0414 sec/batch
Epoch 16/20  Iteration 2763/3560 Training loss: 1.3402 0.0406 sec/batch
Epoch 16/20  Iteration 2764/3560 Training loss: 1.3399 0.0413 sec/batch
Epoch 16/20  Iteration 2765/3560 Training loss: 1.3397 0.0419 sec/batch
Epoch 16/20  Iteration 2766/3560 Training loss: 1.3399 0.0415 sec/batch
Epoch 16/20  Iteration 2767/3560 Training loss: 1.3398 0.0411 sec/batch
Epoch 16/20  Iteration 2768/3560 Training loss: 1.3395 0.0420 se

Epoch 17/20  Iteration 2870/3560 Training loss: 1.3320 0.0416 sec/batch
Epoch 17/20  Iteration 2871/3560 Training loss: 1.3310 0.0457 sec/batch
Epoch 17/20  Iteration 2872/3560 Training loss: 1.3313 0.0461 sec/batch
Epoch 17/20  Iteration 2873/3560 Training loss: 1.3313 0.0422 sec/batch
Epoch 17/20  Iteration 2874/3560 Training loss: 1.3296 0.0458 sec/batch
Epoch 17/20  Iteration 2875/3560 Training loss: 1.3286 0.0462 sec/batch
Epoch 17/20  Iteration 2876/3560 Training loss: 1.3291 0.0413 sec/batch
Epoch 17/20  Iteration 2877/3560 Training loss: 1.3295 0.0430 sec/batch
Epoch 17/20  Iteration 2878/3560 Training loss: 1.3298 0.0404 sec/batch
Epoch 17/20  Iteration 2879/3560 Training loss: 1.3293 0.0419 sec/batch
Epoch 17/20  Iteration 2880/3560 Training loss: 1.3283 0.0413 sec/batch
Epoch 17/20  Iteration 2881/3560 Training loss: 1.3286 0.0418 sec/batch
Epoch 17/20  Iteration 2882/3560 Training loss: 1.3293 0.0413 sec/batch
Epoch 17/20  Iteration 2883/3560 Training loss: 1.3293 0.0478 se

Epoch 17/20  Iteration 2985/3560 Training loss: 1.3181 0.0489 sec/batch
Epoch 17/20  Iteration 2986/3560 Training loss: 1.3181 0.0413 sec/batch
Epoch 17/20  Iteration 2987/3560 Training loss: 1.3181 0.0413 sec/batch
Epoch 17/20  Iteration 2988/3560 Training loss: 1.3181 0.0471 sec/batch
Epoch 17/20  Iteration 2989/3560 Training loss: 1.3184 0.0439 sec/batch
Epoch 17/20  Iteration 2990/3560 Training loss: 1.3184 0.0413 sec/batch
Epoch 17/20  Iteration 2991/3560 Training loss: 1.3184 0.0408 sec/batch
Epoch 17/20  Iteration 2992/3560 Training loss: 1.3186 0.0418 sec/batch
Epoch 17/20  Iteration 2993/3560 Training loss: 1.3185 0.0420 sec/batch
Epoch 17/20  Iteration 2994/3560 Training loss: 1.3187 0.0420 sec/batch
Epoch 17/20  Iteration 2995/3560 Training loss: 1.3189 0.0422 sec/batch
Epoch 17/20  Iteration 2996/3560 Training loss: 1.3192 0.0424 sec/batch
Epoch 17/20  Iteration 2997/3560 Training loss: 1.3193 0.0462 sec/batch
Epoch 17/20  Iteration 2998/3560 Training loss: 1.3192 0.0409 se

Epoch 18/20  Iteration 3100/3560 Training loss: 1.3117 0.0413 sec/batch
Epoch 18/20  Iteration 3101/3560 Training loss: 1.3115 0.0419 sec/batch
Epoch 18/20  Iteration 3102/3560 Training loss: 1.3116 0.0435 sec/batch
Epoch 18/20  Iteration 3103/3560 Training loss: 1.3114 0.0443 sec/batch
Epoch 18/20  Iteration 3104/3560 Training loss: 1.3113 0.0471 sec/batch
Epoch 18/20  Iteration 3105/3560 Training loss: 1.3108 0.0463 sec/batch
Epoch 18/20  Iteration 3106/3560 Training loss: 1.3107 0.0417 sec/batch
Epoch 18/20  Iteration 3107/3560 Training loss: 1.3104 0.0413 sec/batch
Epoch 18/20  Iteration 3108/3560 Training loss: 1.3103 0.0420 sec/batch
Epoch 18/20  Iteration 3109/3560 Training loss: 1.3097 0.0416 sec/batch
Epoch 18/20  Iteration 3110/3560 Training loss: 1.3097 0.0441 sec/batch
Epoch 18/20  Iteration 3111/3560 Training loss: 1.3094 0.0472 sec/batch
Epoch 18/20  Iteration 3112/3560 Training loss: 1.3092 0.0424 sec/batch
Epoch 18/20  Iteration 3113/3560 Training loss: 1.3090 0.0416 se

Epoch 19/20  Iteration 3215/3560 Training loss: 1.3007 0.0424 sec/batch
Epoch 19/20  Iteration 3216/3560 Training loss: 1.3007 0.0433 sec/batch
Epoch 19/20  Iteration 3217/3560 Training loss: 1.2999 0.0413 sec/batch
Epoch 19/20  Iteration 3218/3560 Training loss: 1.3013 0.0461 sec/batch
Epoch 19/20  Iteration 3219/3560 Training loss: 1.3001 0.0410 sec/batch
Epoch 19/20  Iteration 3220/3560 Training loss: 1.2987 0.0437 sec/batch
Epoch 19/20  Iteration 3221/3560 Training loss: 1.2990 0.0416 sec/batch
Epoch 19/20  Iteration 3222/3560 Training loss: 1.3003 0.0426 sec/batch
Epoch 19/20  Iteration 3223/3560 Training loss: 1.3006 0.0433 sec/batch
Epoch 19/20  Iteration 3224/3560 Training loss: 1.3019 0.0422 sec/batch
Epoch 19/20  Iteration 3225/3560 Training loss: 1.3016 0.0463 sec/batch
Epoch 19/20  Iteration 3226/3560 Training loss: 1.3018 0.0431 sec/batch
Epoch 19/20  Iteration 3227/3560 Training loss: 1.3008 0.0414 sec/batch
Epoch 19/20  Iteration 3228/3560 Training loss: 1.3011 0.0413 se

Epoch 19/20  Iteration 3330/3560 Training loss: 1.2896 0.0414 sec/batch
Epoch 19/20  Iteration 3331/3560 Training loss: 1.2897 0.0448 sec/batch
Epoch 19/20  Iteration 3332/3560 Training loss: 1.2897 0.0420 sec/batch
Epoch 19/20  Iteration 3333/3560 Training loss: 1.2895 0.0439 sec/batch
Epoch 19/20  Iteration 3334/3560 Training loss: 1.2892 0.0487 sec/batch
Epoch 19/20  Iteration 3335/3560 Training loss: 1.2886 0.0412 sec/batch
Epoch 19/20  Iteration 3336/3560 Training loss: 1.2885 0.0424 sec/batch
Epoch 19/20  Iteration 3337/3560 Training loss: 1.2886 0.0423 sec/batch
Epoch 19/20  Iteration 3338/3560 Training loss: 1.2886 0.0448 sec/batch
Epoch 19/20  Iteration 3339/3560 Training loss: 1.2886 0.0423 sec/batch
Epoch 19/20  Iteration 3340/3560 Training loss: 1.2887 0.0464 sec/batch
Epoch 19/20  Iteration 3341/3560 Training loss: 1.2888 0.0414 sec/batch
Epoch 19/20  Iteration 3342/3560 Training loss: 1.2888 0.0444 sec/batch
Epoch 19/20  Iteration 3343/3560 Training loss: 1.2888 0.0414 se

Epoch 20/20  Iteration 3445/3560 Training loss: 1.2834 0.0414 sec/batch
Epoch 20/20  Iteration 3446/3560 Training loss: 1.2836 0.0417 sec/batch
Epoch 20/20  Iteration 3447/3560 Training loss: 1.2836 0.0440 sec/batch
Epoch 20/20  Iteration 3448/3560 Training loss: 1.2837 0.0423 sec/batch
Epoch 20/20  Iteration 3449/3560 Training loss: 1.2840 0.0410 sec/batch
Epoch 20/20  Iteration 3450/3560 Training loss: 1.2837 0.0419 sec/batch
Epoch 20/20  Iteration 3451/3560 Training loss: 1.2838 0.0412 sec/batch
Epoch 20/20  Iteration 3452/3560 Training loss: 1.2837 0.0445 sec/batch
Epoch 20/20  Iteration 3453/3560 Training loss: 1.2842 0.0427 sec/batch
Epoch 20/20  Iteration 3454/3560 Training loss: 1.2846 0.0461 sec/batch
Epoch 20/20  Iteration 3455/3560 Training loss: 1.2850 0.0436 sec/batch
Epoch 20/20  Iteration 3456/3560 Training loss: 1.2847 0.0431 sec/batch
Epoch 20/20  Iteration 3457/3560 Training loss: 1.2845 0.0418 sec/batch
Epoch 20/20  Iteration 3458/3560 Training loss: 1.2846 0.0436 se

Epoch 20/20  Iteration 3560/3560 Training loss: 1.2773 0.0469 sec/batch
Epoch 1/20  Iteration 1/3560 Training loss: 4.4252 0.0477 sec/batch
Epoch 1/20  Iteration 2/3560 Training loss: 4.4122 0.0311 sec/batch
Epoch 1/20  Iteration 3/3560 Training loss: 4.3973 0.0304 sec/batch
Epoch 1/20  Iteration 4/3560 Training loss: 4.3758 0.0307 sec/batch
Epoch 1/20  Iteration 5/3560 Training loss: 4.3309 0.0308 sec/batch
Epoch 1/20  Iteration 6/3560 Training loss: 4.2357 0.0305 sec/batch
Epoch 1/20  Iteration 7/3560 Training loss: 4.1496 0.0326 sec/batch
Epoch 1/20  Iteration 8/3560 Training loss: 4.0714 0.0325 sec/batch
Epoch 1/20  Iteration 9/3560 Training loss: 3.9952 0.0332 sec/batch
Epoch 1/20  Iteration 10/3560 Training loss: 3.9264 0.0301 sec/batch
Epoch 1/20  Iteration 11/3560 Training loss: 3.8648 0.0307 sec/batch
Epoch 1/20  Iteration 12/3560 Training loss: 3.8115 0.0332 sec/batch
Epoch 1/20  Iteration 13/3560 Training loss: 3.7650 0.0317 sec/batch
Epoch 1/20  Iteration 14/3560 Training l

Epoch 1/20  Iteration 124/3560 Training loss: 3.1065 0.0417 sec/batch
Epoch 1/20  Iteration 125/3560 Training loss: 3.1047 0.0329 sec/batch
Epoch 1/20  Iteration 126/3560 Training loss: 3.1027 0.0356 sec/batch
Epoch 1/20  Iteration 127/3560 Training loss: 3.1009 0.0331 sec/batch
Epoch 1/20  Iteration 128/3560 Training loss: 3.0992 0.0386 sec/batch
Epoch 1/20  Iteration 129/3560 Training loss: 3.0973 0.0338 sec/batch
Epoch 1/20  Iteration 130/3560 Training loss: 3.0955 0.0338 sec/batch
Epoch 1/20  Iteration 131/3560 Training loss: 3.0938 0.0349 sec/batch
Epoch 1/20  Iteration 132/3560 Training loss: 3.0918 0.0353 sec/batch
Epoch 1/20  Iteration 133/3560 Training loss: 3.0900 0.0343 sec/batch
Epoch 1/20  Iteration 134/3560 Training loss: 3.0881 0.0344 sec/batch
Epoch 1/20  Iteration 135/3560 Training loss: 3.0859 0.0339 sec/batch
Epoch 1/20  Iteration 136/3560 Training loss: 3.0839 0.0349 sec/batch
Epoch 1/20  Iteration 137/3560 Training loss: 3.0819 0.0343 sec/batch
Epoch 1/20  Iteratio

Epoch 2/20  Iteration 245/3560 Training loss: 2.5099 0.0385 sec/batch
Epoch 2/20  Iteration 246/3560 Training loss: 2.5080 0.0385 sec/batch
Epoch 2/20  Iteration 247/3560 Training loss: 2.5064 0.0383 sec/batch
Epoch 2/20  Iteration 248/3560 Training loss: 2.5053 0.0430 sec/batch
Epoch 2/20  Iteration 249/3560 Training loss: 2.5041 0.0384 sec/batch
Epoch 2/20  Iteration 250/3560 Training loss: 2.5030 0.0381 sec/batch
Epoch 2/20  Iteration 251/3560 Training loss: 2.5017 0.0388 sec/batch
Epoch 2/20  Iteration 252/3560 Training loss: 2.5003 0.0391 sec/batch
Epoch 2/20  Iteration 253/3560 Training loss: 2.4991 0.0383 sec/batch
Epoch 2/20  Iteration 254/3560 Training loss: 2.4983 0.0394 sec/batch
Epoch 2/20  Iteration 255/3560 Training loss: 2.4970 0.0384 sec/batch
Epoch 2/20  Iteration 256/3560 Training loss: 2.4959 0.0385 sec/batch
Epoch 2/20  Iteration 257/3560 Training loss: 2.4946 0.0392 sec/batch
Epoch 2/20  Iteration 258/3560 Training loss: 2.4933 0.0386 sec/batch
Epoch 2/20  Iteratio

Epoch 3/20  Iteration 363/3560 Training loss: 2.2616 0.0394 sec/batch
Epoch 3/20  Iteration 364/3560 Training loss: 2.2618 0.0387 sec/batch
Epoch 3/20  Iteration 365/3560 Training loss: 2.2629 0.0467 sec/batch
Epoch 3/20  Iteration 366/3560 Training loss: 2.2624 0.0396 sec/batch
Epoch 3/20  Iteration 367/3560 Training loss: 2.2603 0.0417 sec/batch
Epoch 3/20  Iteration 368/3560 Training loss: 2.2596 0.0403 sec/batch
Epoch 3/20  Iteration 369/3560 Training loss: 2.2592 0.0399 sec/batch
Epoch 3/20  Iteration 370/3560 Training loss: 2.2613 0.0390 sec/batch
Epoch 3/20  Iteration 371/3560 Training loss: 2.2611 0.0399 sec/batch
Epoch 3/20  Iteration 372/3560 Training loss: 2.2605 0.0386 sec/batch
Epoch 3/20  Iteration 373/3560 Training loss: 2.2600 0.0444 sec/batch
Epoch 3/20  Iteration 374/3560 Training loss: 2.2614 0.0387 sec/batch
Epoch 3/20  Iteration 375/3560 Training loss: 2.2612 0.0396 sec/batch
Epoch 3/20  Iteration 376/3560 Training loss: 2.2596 0.0391 sec/batch
Epoch 3/20  Iteratio

Epoch 3/20  Iteration 484/3560 Training loss: 2.2000 0.0404 sec/batch
Epoch 3/20  Iteration 485/3560 Training loss: 2.1995 0.0398 sec/batch
Epoch 3/20  Iteration 486/3560 Training loss: 2.1992 0.0400 sec/batch
Epoch 3/20  Iteration 487/3560 Training loss: 2.1987 0.0456 sec/batch
Epoch 3/20  Iteration 488/3560 Training loss: 2.1981 0.0400 sec/batch
Epoch 3/20  Iteration 489/3560 Training loss: 2.1978 0.0391 sec/batch
Epoch 3/20  Iteration 490/3560 Training loss: 2.1974 0.0408 sec/batch
Epoch 3/20  Iteration 491/3560 Training loss: 2.1970 0.0397 sec/batch
Epoch 3/20  Iteration 492/3560 Training loss: 2.1967 0.0418 sec/batch
Epoch 3/20  Iteration 493/3560 Training loss: 2.1963 0.0429 sec/batch
Epoch 3/20  Iteration 494/3560 Training loss: 2.1960 0.0454 sec/batch
Epoch 3/20  Iteration 495/3560 Training loss: 2.1958 0.0398 sec/batch
Epoch 3/20  Iteration 496/3560 Training loss: 2.1954 0.0453 sec/batch
Epoch 3/20  Iteration 497/3560 Training loss: 2.1952 0.0397 sec/batch
Epoch 3/20  Iteratio

Epoch 4/20  Iteration 604/3560 Training loss: 2.0906 0.0410 sec/batch
Epoch 4/20  Iteration 605/3560 Training loss: 2.0904 0.0500 sec/batch
Epoch 4/20  Iteration 606/3560 Training loss: 2.0903 0.0400 sec/batch
Epoch 4/20  Iteration 607/3560 Training loss: 2.0902 0.0411 sec/batch
Epoch 4/20  Iteration 608/3560 Training loss: 2.0896 0.0401 sec/batch
Epoch 4/20  Iteration 609/3560 Training loss: 2.0892 0.0404 sec/batch
Epoch 4/20  Iteration 610/3560 Training loss: 2.0894 0.0405 sec/batch
Epoch 4/20  Iteration 611/3560 Training loss: 2.0889 0.0455 sec/batch
Epoch 4/20  Iteration 612/3560 Training loss: 2.0888 0.0457 sec/batch
Epoch 4/20  Iteration 613/3560 Training loss: 2.0882 0.0473 sec/batch
Epoch 4/20  Iteration 614/3560 Training loss: 2.0876 0.0404 sec/batch
Epoch 4/20  Iteration 615/3560 Training loss: 2.0870 0.0404 sec/batch
Epoch 4/20  Iteration 616/3560 Training loss: 2.0868 0.0402 sec/batch
Epoch 4/20  Iteration 617/3560 Training loss: 2.0861 0.0408 sec/batch
Epoch 4/20  Iteratio

Epoch 5/20  Iteration 724/3560 Training loss: 2.0083 0.0474 sec/batch
Epoch 5/20  Iteration 725/3560 Training loss: 2.0082 0.0410 sec/batch
Epoch 5/20  Iteration 726/3560 Training loss: 2.0099 0.0401 sec/batch
Epoch 5/20  Iteration 727/3560 Training loss: 2.0092 0.0401 sec/batch
Epoch 5/20  Iteration 728/3560 Training loss: 2.0078 0.0449 sec/batch
Epoch 5/20  Iteration 729/3560 Training loss: 2.0072 0.0453 sec/batch
Epoch 5/20  Iteration 730/3560 Training loss: 2.0091 0.0411 sec/batch
Epoch 5/20  Iteration 731/3560 Training loss: 2.0091 0.0400 sec/batch
Epoch 5/20  Iteration 732/3560 Training loss: 2.0085 0.0408 sec/batch
Epoch 5/20  Iteration 733/3560 Training loss: 2.0077 0.0426 sec/batch
Epoch 5/20  Iteration 734/3560 Training loss: 2.0096 0.0403 sec/batch
Epoch 5/20  Iteration 735/3560 Training loss: 2.0088 0.0401 sec/batch
Epoch 5/20  Iteration 736/3560 Training loss: 2.0078 0.0405 sec/batch
Epoch 5/20  Iteration 737/3560 Training loss: 2.0072 0.0414 sec/batch
Epoch 5/20  Iteratio

Epoch 5/20  Iteration 844/3560 Training loss: 1.9704 0.0430 sec/batch
Epoch 5/20  Iteration 845/3560 Training loss: 1.9703 0.0407 sec/batch
Epoch 5/20  Iteration 846/3560 Training loss: 1.9702 0.0401 sec/batch
Epoch 5/20  Iteration 847/3560 Training loss: 1.9700 0.0408 sec/batch
Epoch 5/20  Iteration 848/3560 Training loss: 1.9699 0.0403 sec/batch
Epoch 5/20  Iteration 849/3560 Training loss: 1.9698 0.0425 sec/batch
Epoch 5/20  Iteration 850/3560 Training loss: 1.9697 0.0406 sec/batch
Epoch 5/20  Iteration 851/3560 Training loss: 1.9698 0.0459 sec/batch
Epoch 5/20  Iteration 852/3560 Training loss: 1.9695 0.0405 sec/batch
Epoch 5/20  Iteration 853/3560 Training loss: 1.9695 0.0402 sec/batch
Epoch 5/20  Iteration 854/3560 Training loss: 1.9693 0.0425 sec/batch
Epoch 5/20  Iteration 855/3560 Training loss: 1.9691 0.0408 sec/batch
Epoch 5/20  Iteration 856/3560 Training loss: 1.9690 0.0405 sec/batch
Epoch 5/20  Iteration 857/3560 Training loss: 1.9687 0.0406 sec/batch
Epoch 5/20  Iteratio

Epoch 6/20  Iteration 964/3560 Training loss: 1.9144 0.0478 sec/batch
Epoch 6/20  Iteration 965/3560 Training loss: 1.9142 0.0458 sec/batch
Epoch 6/20  Iteration 966/3560 Training loss: 1.9145 0.0494 sec/batch
Epoch 6/20  Iteration 967/3560 Training loss: 1.9143 0.0427 sec/batch
Epoch 6/20  Iteration 968/3560 Training loss: 1.9144 0.0404 sec/batch
Epoch 6/20  Iteration 969/3560 Training loss: 1.9138 0.0408 sec/batch
Epoch 6/20  Iteration 970/3560 Training loss: 1.9136 0.0402 sec/batch
Epoch 6/20  Iteration 971/3560 Training loss: 1.9130 0.0409 sec/batch
Epoch 6/20  Iteration 972/3560 Training loss: 1.9130 0.0452 sec/batch
Epoch 6/20  Iteration 973/3560 Training loss: 1.9124 0.0420 sec/batch
Epoch 6/20  Iteration 974/3560 Training loss: 1.9121 0.0412 sec/batch
Epoch 6/20  Iteration 975/3560 Training loss: 1.9115 0.0421 sec/batch
Epoch 6/20  Iteration 976/3560 Training loss: 1.9110 0.0402 sec/batch
Epoch 6/20  Iteration 977/3560 Training loss: 1.9108 0.0400 sec/batch
Epoch 6/20  Iteratio

Epoch 7/20  Iteration 1083/3560 Training loss: 1.8668 0.0417 sec/batch
Epoch 7/20  Iteration 1084/3560 Training loss: 1.8653 0.0401 sec/batch
Epoch 7/20  Iteration 1085/3560 Training loss: 1.8649 0.0466 sec/batch
Epoch 7/20  Iteration 1086/3560 Training loss: 1.8669 0.0492 sec/batch
Epoch 7/20  Iteration 1087/3560 Training loss: 1.8670 0.0459 sec/batch
Epoch 7/20  Iteration 1088/3560 Training loss: 1.8671 0.0479 sec/batch
Epoch 7/20  Iteration 1089/3560 Training loss: 1.8667 0.0446 sec/batch
Epoch 7/20  Iteration 1090/3560 Training loss: 1.8686 0.0507 sec/batch
Epoch 7/20  Iteration 1091/3560 Training loss: 1.8678 0.0474 sec/batch
Epoch 7/20  Iteration 1092/3560 Training loss: 1.8671 0.0560 sec/batch
Epoch 7/20  Iteration 1093/3560 Training loss: 1.8667 0.0466 sec/batch
Epoch 7/20  Iteration 1094/3560 Training loss: 1.8654 0.0544 sec/batch
Epoch 7/20  Iteration 1095/3560 Training loss: 1.8642 0.0433 sec/batch
Epoch 7/20  Iteration 1096/3560 Training loss: 1.8643 0.0420 sec/batch
Epoch 

Epoch 7/20  Iteration 1202/3560 Training loss: 1.8403 0.0416 sec/batch
Epoch 7/20  Iteration 1203/3560 Training loss: 1.8402 0.0475 sec/batch
Epoch 7/20  Iteration 1204/3560 Training loss: 1.8402 0.0446 sec/batch
Epoch 7/20  Iteration 1205/3560 Training loss: 1.8401 0.0481 sec/batch
Epoch 7/20  Iteration 1206/3560 Training loss: 1.8401 0.0458 sec/batch
Epoch 7/20  Iteration 1207/3560 Training loss: 1.8403 0.0416 sec/batch
Epoch 7/20  Iteration 1208/3560 Training loss: 1.8401 0.0430 sec/batch
Epoch 7/20  Iteration 1209/3560 Training loss: 1.8402 0.0420 sec/batch
Epoch 7/20  Iteration 1210/3560 Training loss: 1.8400 0.0442 sec/batch
Epoch 7/20  Iteration 1211/3560 Training loss: 1.8399 0.0425 sec/batch
Epoch 7/20  Iteration 1212/3560 Training loss: 1.8399 0.0438 sec/batch
Epoch 7/20  Iteration 1213/3560 Training loss: 1.8396 0.0445 sec/batch
Epoch 7/20  Iteration 1214/3560 Training loss: 1.8397 0.0419 sec/batch
Epoch 7/20  Iteration 1215/3560 Training loss: 1.8397 0.0479 sec/batch
Epoch 

In [29]:
tf.train.get_checkpoint_state('checkpoints/anna')

model_checkpoint_path: "checkpoints/anna/i178_l512_2.463.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i178_l512_2.463.ckpt"

## Sampling

Now that the network is trained, we'll can use it to generate new text. The idea is that we pass in a character, then the network will predict the next character. We can use the new one, to predict the next one. And we keep doing this to generate all new text. I also included some functionality to prime the network with some text by passing in a string and building up a state from that.

The network gives us predictions for each character. To reduce noise and make things a little less random, I'm going to only choose a new character from the top N most likely characters.



In [30]:
def pick_top_n(preds, vocab_size, top_n=5):
    p = np.squeeze(preds)
    p[np.argsort(p)[:-top_n]] = 0
    p = p / np.sum(p)
    c = np.random.choice(vocab_size, 1, p=p)[0]
    return c

In [31]:
def sample(checkpoint, n_samples, lstm_size, vocab_size, prime="The "):
    prime = "Far"
    samples = [c for c in prime]
    model = build_rnn(vocab_size, lstm_size=lstm_size, sampling=True)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, checkpoint)
        new_state = sess.run(model.initial_state)
        for c in prime:
            x = np.zeros((1, 1))
            x[0,0] = vocab_to_int[c]
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.preds, model.final_state], 
                                         feed_dict=feed)

        c = pick_top_n(preds, len(vocab))
        samples.append(int_to_vocab[c])

        for i in range(n_samples):
            x[0,0] = c
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.preds, model.final_state], 
                                         feed_dict=feed)

            c = pick_top_n(preds, len(vocab))
            samples.append(int_to_vocab[c])
        
    return ''.join(samples)

In [32]:
checkpoint = "checkpoints/anna/i3560_l512_1.122.ckpt"
samp = sample(checkpoint, 2000, lstm_size, len(vocab), prime="Far")
print(samp)

INFO:tensorflow:Restoring parameters from checkpoints/anna/i3560_l512_1.122.ckpt


NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/anna/i3560_l512_1.122.ckpt
	 [[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]]
	 [[Node: save/RestoreV2_6/_37 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_80_save/RestoreV2_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op 'save/RestoreV2_5', defined at:
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tornado/ioloop.py", line 887, in start
    handler_func(fd_obj, events)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 276, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 228, in dispatch_shell
    handler(stream, idents, msg)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 390, in execute_request
    user_expressions, allow_stdin)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2821, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-32-91ad82b46673>", line 2, in <module>
    samp = sample(checkpoint, 2000, lstm_size, len(vocab), prime="Far")
  File "<ipython-input-31-a7ae04af7e97>", line 5, in sample
    saver = tf.train.Saver()
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 684, in restore_v2
    dtypes=dtypes, name=name)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2500, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/luis/anaconda2/envs/dlnd_gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for checkpoints/anna/i3560_l512_1.122.ckpt
	 [[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]]
	 [[Node: save/RestoreV2_6/_37 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_80_save/RestoreV2_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]


In [None]:
checkpoint = "checkpoints/anna/i200_l512_2.432.ckpt"
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

In [None]:
checkpoint = "checkpoints/anna/i600_l512_1.750.ckpt"
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

In [None]:
checkpoint = "checkpoints/anna/i1000_l512_1.484.ckpt"
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)